The long tail of the distribution: do people ever really look at their genome assemblies?

Since I (somewhat foolishly) volunteered to run CEGMA on behalf of people who were having trouble installing it, I've had the opportunity to look at a lot of genome (and transcriptome) assemblies.

Some people get a little bit obsessed by just how many (or how few) core genes are present in their assembly. They also get obsessed by how long (or short) the N50 length of their assembly is.

Earlier this week, I wrote about a new tool (N50 Booster!!!) that can increase the N50 length of any assembly. Although this was published on April 1st, the tool does actually work and it merits some further discussion. The tool achieves higher N50 lengths by doing two things:

  1. It removes the shortest 25% of all sequences
  2. It then sums the lengths of sequences that were removed and adds an equivalent length of Ns to the end of the longest sequence

Both steps contribute to increasing the N50 length, though the amount of the increase really depends on the distribution of sequence lengths in the assembly. The second step is clearly bogus, and I added it just to preserve the assembly size. The first step might seem drastic, but is exactly the same sort of step that is performed by many genome assemblers (or by the people post-processing the assembly).

If you don't remove short sequences from an assembly, you could end up including a large number of reads that don't overlap with any other read. I.e. it is possible that the shortest contig/scaffold length in an assembly will be the same as whatever the read length of the sequencing technology being used is. If you trim reads for quality, you could potentially end up with contigs/scaffolds with even shorter lengths.

How useful is it to include such short sequences in an assembly, and how often do 'assemblers' (the software and/or the people running the assemblers) do this? Well by looking at some of the assemblies that I have run CEGMA against, I can take a look.

For 34 genome assemblies (from many different species, using many different assemblers) I looked to see whether the shortest 10 sequences were all the same length or were unique:

So about half of the assemblies have unique lengths for their 10 shortest sequences. The remainder represent assemblies that probably either removed all sequences below a certain length (which seems likely with the assembly that had the shortest sequences at 2,000 bp), or which simply included all unassembled reads (six assemblies have an abundance of 100 bp sequences).

This begs the question, how useful is it to include all of these short sequences? It's always possible that a 100 bp read that doesn't overlap anything else contains something useful, but probably not. You might see an exon, possibly an intron, and very possibly an entire gene (depending on the species)...but can anyone do much with this information?

The most extreme example I came across is actually from one of the 16 assemblies which initially appeared to have sequences with unique lengths. This vertebrate genome assembly contains 72,214 sequences. If I ask what percentage of these sequence are shorter than various cutoffs, this is what I find:

This is pretty depressing. The vast majority of sequences in this assembly are likely to be too short to be of any use. The reason that this assembly counted among by 'unique' category is because the shortest ten sequences have lengths as follows: 100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp. That's right, this assembly includes a sequence that's only three base pairs in length!

There are a couple of points that  I am trying to make here:

  1. You can increase the N50 length of your assembly by removing the shortest sequences.  This is not cheating — though you should clearly state what cutoff has been used — and should be considered part of the genome assembly process (getting rid of the crap).
  2. Please look at your assembly before doing anything with it. On it's own, N50 is not a very useful statistic. Look at the distribution of all sequence lengths (or plot an NG(X) graph).

It turned out that this assembly did contain a fair number of core genes and these were almost all located in the 1.2% of sequences that were > 10,000 bp. That 3 bp sequence though, turned out it contained no core genes at all. Shocker!