DOGMA: a new tool for assessing the quality of proteomes and transcriptomes

A new tool, recently published in Nucleic Acids Research, caught my eye this week:

The tool, by a team from the University of Münster, uses protein domains and domain arrangements in order to assess 'completeness' of a proteome or transcriptome. From the abstract…

Even in the era of next generation sequencing, in which bioinformatics tools abound, annotating transcriptomes and proteomes remains a challenge. This can have major implications for the reliability of studies based on these datasets. Therefore, quality assessment represents a crucial step prior to downstream analyses on novel transcriptomes and proteomes. DOGMA allows such a quality assessment to be carried out. The data of interest are evaluated based on a comparison with a core set of conserved protein domains and domain arrangements. Depending on the studied species, DOGMA offers precomputed core sets for different phylogenetic clades

Unlike CEGMA and BUSCO, which run against unannotated assemblies, DOGMA first requires a set of gene annotations. The paper focuses on the web server version of DOGMA but you can also access the source code online.

It's good to see that other groups are continuing to look at new ways of asssessing the quality of large genome/transcriptome/proteome datasets.

What's in a name?

Initially, I thought the name was just a word that both echoed 'CEGMA' and reinforced the central dogma of molecular biology. Hooray I thought, a bioinformatics tool that just has a regular word as a name without relying on contrived acronyms.

Then I saw the website…

  • DOGMA: DOmain-based General Measure for transcriptome and proteome quality Assessment

This is even more tenuous than the older, unrelated, version of DOGMA:

  • DOGMA: Dual Organellar GenoMe Annotator

Transcriptional noise, isoform prediction, and the utility of mass spec data in gene annotation

The human genome may be 80% functional or 8.2% functional. Maybe it's 93.7% functional or only 6.1%. I guess that all we know for sure is that it is not 0% functional (although my genome on a Monday morning may provide evidence to the contrary).

Transcript data can be used to ascribe some sort of functionality to a genome and, in an ideal world, we would sequence full-length cDNAs for every gene. But in the less-than-ideal world we often end up sequencing lots of small bits of RNA using an ever-changing set of technologies. ESTs, SAGE, CAGE, RACE, MPSS, and RNA-Seq have all been used to provide evidence for where genes are and how highly they are being expressed.

Having some transcriptional evidence is (usually) better than not having any transcriptional evidence, but it doesn't necessarily imply functionality. A protein-coding gene that is transcribed may not be translated. Transcript data is used in gene annotation to add new genes, especially in the case of a first-pass annotation of a new genome. But in established genomes, it is probably used more to annotate transcript isoforms (e.g. splice variants). This can lead to a problem for the end users of such data…how to tell if all isoforms are equally likely?

Consider the transcript data for the rpl-22 gene in Caenorhabditis elegans. This gene has two annotated splice variants and there is indeed EST evidence for both variants, but it is a little bit unbalanced:

This gene encodes the large ribosomal subunit protein…a pretty essential protein! Notice how the secondary isoform (shown on top) a) encodes for a much shorter protein and b) has very little transcript evidence. In my mind, this secondary isoform is the result of 'transcriptional noise'. Maybe a couple of lucky ESTs captured the transcript in the process of heading towards destruction via nonsense-mediated decay? It seems highly unlikely that this secondary transcript gives rise to a functional protein though someone who is new to viewing data like this might initially consider each isoform as equally valid.

If we turn on some additional data tracks to look at protein homology to human (shown in orange) and mass spectromety data from C. elegans (shown in red) it becomes clear that all of the evidence is really pointing towards just one functional isoform:

Indeed mass spec data has the potential to really clean up a lot of noisy gene annotations. In light of this I was very happy to see this new paper published in the Journal of Proteome Research (where talented up-and-coming scientists publish!):

Pooling data from 8 mass spec analyses of human data, the authors attempted to see how much protein support there was for the different annotated isoforms of the human genome. They could reliably map peptides to about two-thirds of the protein-coding genes from the GENCODE 20 gene set (Ensembl 76). What did they find?

We identified alternative splice isoforms for just 246 human genes; this clearly suggests that the vast majority of genes express a single main protein isoform.

They also found that the mass spec data was not always in agreement with the dominant isoforms that can be predicted from RNA-Seq data:

…part of the reason for this will be that more RNAseq reads map to longer sequences, it does suggest that either transcript expression is very different from protein expression for many genes or that transcript reconstruction methods may not be interpreting the RNAseq reads correctly.

The headline conclusion that mass spec evidence only supports alternate isoforms for 1.2% of human genes is thought provoking. It suggests to me that we should be careful in relying too heavily on gene annotations which describe large numbers of isoforms mostly on the basis of transcript data. Paraphrasing George Orwell:

All isoforms are equal, but some isoforms are more qual than others

Community annotation — by any name — still isn’t a part of the research process. It should be

In order for community annotation efforts to succeed, they need to become part of the established research process: mine annotations, generate hypotheses, do experiments, write manuscripts, submit annotations. Rinse and repeat.

A thoughtful post by Todd Harris on his blog which lists some suggestions for how to fix the failure of community annotation projects.

I particularly like Todd's 3rd suggestion:

We need to recognize the efforts of people who do [community annotation]. This system must have professional currency to it, akin to writing a review paper, and should be citable…

3 word summary of new PLOS Computational Biology paper: genome assemblies suck

Okay, I guess a more accurate four word summary would be 'genome assemblies sometimes suck'.

The paper that I'm referring to, by Denton et al, looks at the problem of fragmentation (of genes) in many draft genome assemblies:

As they note in their introduction:

Our results suggest that low-quality assemblies can result in huge numbers of both added and missing genes, and that most of the additional genes are due to genome fragmentation (“cleaved” gene models).

One section of this paper looks at the quality of different versions of the chicken genome and CEGMA is one of the tools they use in this analysis. I am a co-author of CEGMA, and reading this paper brought back some memories of when we were also looking at similar issues.

In our 2nd CEGMA paper we tried to find out why 36 core genes were not present in the v2.1 version of the chicken genome (6.6x coverage). Turns out that there were ESTs available for 29 of those genes, indicating that they are not absent from the genome, just from the genome assembly. This led us to find pieces of these missing genes in the unanchored set of sequences that were included as part of the set of genome sequences (these often appear as a 'ChrUn' FASTA file in genome sequence releases).

Something else that we reported on in our CEGMA paper is that, sometimes, a newer version of a genome assembly can actually be worse than what it replaces (at least in terms of genic content). Version 1.95 of the Ciona intestinalis genome contained several core genes that subsequently disappeared in the v2.0 release.

In conclusion — and echoing some of the findings of this new paper by Denton et al.:

  1. Many genomes are poorly assembled
  2. Many genomes are poorly annotated (often a result of the poor assembly)
  3. Newer versions of genome assemblies are not always better