Filtering a SAM file generated by TopHat to find uniquely mapped, concordant read pairs: AWK vs SAMtools

Software suites like SAMtools (or should that be [SAMsam]tools?) offer a powerful way to slice and dice files in the SAM, BAM, and CRAM file formats. But sometimes other approaches work just as well.

If you have aligned paired RNA-Seq read data to a genome or transcriptome using TopHat you may be interested in filtering the resulting SAM/BAM file to keep reads that are:

  • a) uniquely aligned (only match one place in target sequence)
  • b) mapped as concordant read pairs (both pairs map to same sequence in correct orientation with suitable distance between them)

TopHat has a --no-discordant command-line option which only report read alignments if both reads in a pair can be mapped, but the name of option is somewhat misleading, as you still end up with discordantly mapped read pairs in the final output (always good to check what difference a command-line option actually makes to your output!.

So if you have a TopHat SAM file, and you wanted to filter it to only keep uniqueley mapped, concordant read pairs, you could use two of the options that the samtools view command provides:

  • -q 50 — This filters on the MAPQ field (5th column of SAM file). TopHat uses a value of 50 in this field to denote unique mappings (this important piece of information is not in the TopHat manual).
  • -f 0x2 — This option filters on the bitwise FLAG score (column 2 of SAM file), and will extract only those mappings where the 2nd bit is set. From the SAM documentation, this is described as each segment properly aligned according to the aligner. In practice this means looking for values that are either 83, 89, 147, or 163 (see this helpful Biobeat blog post for more information about this).

So if you have a SAM file, in essence you just need to filter that file based on matching certain numbers in two different columns. This is something that the Unix AWK tool excels at, and unlike SAMtools, AWK is installed on just about every Unix/Linux system by default. So do both tools give ou the same result? Only one way to find out:

Using SAMtools

The 'unfiltered.sam' file is the result of a TopHat run that used the --no-discordant and --no-mixed options. The SAM file contained 34,340,754 lines of mapping information:

time samtools  view -q 50 -f 0x2 unfiltered.sam > filtered_by_samtools.sam

real    1m57.068s
user    1m18.705s
sys     0m13.712s

Using AWK

time awk '$5 == 50 && ($2 == 163 || $2 == 147 || $2 == 83 || $2 == 99) {print}' unfiltered.sam  > filtered_by_awk.sam

real    1m31.734s
user    0m46.855s
sys     0m15.775s

Does it make a difference?

wc -l filtered_by_*.sam
31710476 filtered_by_awk.sam
31710476 filtered_by_samtools.sam

diff filtered_by_samtools.sam filtered_by_awk.sam

No difference in the final output files, with AWK running quite a bit quicker than SAMtools. In this situation, filtering throws away about 8% of the mapped reads.

Inconsistent bioinformatics branding: SAMtools vs Samtools vs samtools

The popular Sequence Alignment Map format, SAM, has given rise to an equally popular toolkit for working with SAM files (and BAM, CRAM too). But what is the name of this tool?


SAMtools?

If we read the official publication, then we see this software described as 'SAMtools' (also described by Wikipedia in this manner).

Samtools?

Head to the official website and we see consistent references to 'Samtools'.

samtools?

Head to the official GitHub repository and we see consistent references to 'samtools'.


This is not exactly a problem that is halting the important work of bioinformaticians around the world, but I find it surprising that all of these names are in use by the people that developed the software. Unix-based software is typically — but not always — implemented as a set of lower-case commands and this can add one level of confusion when comparing a tool's name to the actual commands that are run ('samtools' is what you type at the terminal). However, you can still be consistent in your documentation!

More madness with MAPQ scores (a.k.a. why bioinformaticians hate poor and incomplete software documentation)

I have previously written about the range of mapping quality scores (MAPQ) that you might see in BAM/SAM files, as produced by popular read mapping programs. A very quick recap:

  1. Bowtie 2 generates MAPQ scores between 0–42
  2. BWA generates MAPQ scores between 0–37
  3. Neither piece of software describes the range of possible scores in their documentation
  4. The SAM specification defines the possible ranges of the MAPQ score as 0–255 (though 255 should indicate that mapping quality was not available)
  5. I advocated that you should always take a look at your mapped sequence data to see what ranges of scores are present before doing anything else with your BAM/SAM files

So what is my latest gripe? Well, I've recently been running TopHat (version 2.0.13) to map some RNA-Seq reads to a genome sequence. TopHat uses Bowtie (or Bowtie 2) as the tool to do the intial mapping of reads to the genome, so you might expect it to generate the same range of MAPQ scores as the standalone version of Bowtie.

But it doesn't.

From my initial testing, it seems that the BAM/SAM output file from TopHat only contains MAPQ scores of 0, 1, 3, or 50. I find this puzzling and incongruous. Why produce only four MAPQ scores (compared to >30 different values that Bowtie 2 can produce), and why change the maximum possible value to 50? I turned to the TopHat manual, but found no explanation regarding MAPQ scores.

Turning to Google, I found this useful Biostars post which suggests that five MAPQ values are possible with TopHat (you can also have a value of 2 which I didn't see in my data), and that these values correspond to the following:

  • 0 = maps to 10 or more locations
  • 1 = maps to 4-9 locations
  • 2 = maps to 3 locations
  • 3 = maps to 2 locations
  • 50 = unique mapping

The post also reveals that, confusingly, TopHat previously used a value of 255 to indicate uniquely mapped reads. However, I then found another Biostars post which says that a MAPQ score of 2 isn't possible with TopHat, and that the meaning of the scores are as follows:

  • 0 = maps to 5 or more locations
  • 1 = maps to 3-4 locations
  • 3 = maps to 2 locations
  • 255 = unique mapping

This post was in reference to an older version of TopHat (1.4.1) which probably explains the use of the 255 score rather than 50. The comments on this post reflect some of the confusion over this topic. Going back to the original Biostars post, I then noticed a recent comment suggesting that MAPQ scores of 24, 28, 41, 42, and 44 are also possible with TopHat (version 2.0.13).

As this situation shows, when there is no official explanation that fully describes how a piece of software should work, it can lead to mass speculation by others. Such speculation can sometimes be inconsistant which can end up making things even more confusing. This is what drives bioinformaticians crazy.

I find it deeply frustrating when so much of this confusion could be removed with better documentation by the people that developed the original software. In this case the documentation needs just one paragraph added; something along the lines of…

Mapping Quality scores (MAPQ)
TopHat outputs MAPQ scores in the BAM/SAM files with possible values 0, 1, 2, or 50. The first three values indicate mappings to 5, 3–4, or 2 locations, whereas a value of 50 represents a unique match. Please note that older versions of TopHat used a value of 255 for unique matches. Further note that standalone versions of Bowtie and Bowie 2 (used by TopHat) produce a different range of MAPQ scores (0–42).

Would that be so hard?

Understanding MAPQ scores in SAM files: does 37 = 42?

The official specification for the Sequence Alignment Map (SAM) format outlines what is stored in each column of this tab-separated value file format. The fifth column of a SAM file stores MAPping Quality (MAPQ) values. From the SAM specification:

MAPQ: MAPping Quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

So if you happened to know that the probability of correctly mapping some random read was 0.99, then the MAPQ score should be 20 (i.e. log10 of 0.01 * -10). If the probability of a correct match increased to 0.999, the MAPQ score would increase to 30. So the upper bounds of a MAPQ score depends on the level of precision of your probability (though elswhere in the SAM spec, it defines an upper limit of 255 for this value). Conversely, as the probability of a correct match tends towards zero, so does the MAPQ score.

So I'm sure that the first thing that everyone does after generating a SAM file is to assess the spread of MAPQ scores in your dataset. Right? Anyone?

< sound of crickets >

Okay, so maybe you don't do this. Maybe you don't really care, and you are happy to trust the default output of whatever short read alignment program that you used to generate your SAM file. Why should it matter? Will these scores really vary all that much?

Here is a frequency distribution of MAPQ scores from two mapping experiments. The bottom panel zooms in to more clearly show the distribution of low frequency MAPQ scores:

Distribution of MAPQ scores from two experiments: bottom panel shows zoomed in view of MAPQ scores with frequencies < 1%. Click to enlarge.

What might we conclude from this? There seems to be some clear differences between both experiments. The most frequent MAPQ scores in the first experiment are 42 followed by 1. In the second experiment, scores only reach a maximum value of 37, and scores of 0 are the second most frequent value.

These two experiments reflect some real world data. Experiment 1 is based on data from mouse, and experiment 2 uses data from Arabidopsis thaliana. However, that is probably not why the distributions are different. The mouse data is based on unpaired Illumina reads from a DNase-Seq experiment, wheras the A. thaliana data is from paired Illumina reads from whole genome sequencing. However, that still probably isn't the reason for the differences.

The reason for the different distributions is that experiment 1 used Bowtie 2 to map the reads whereas experiment 2 used BWA. It turns out that different mapping programs calculate MAPQ scores in different ways and you shouldn't really compare these values unless they came from the same program.

The maximum MAPQ value that Bowtie 2 generates is 42 (though it doesn't say this anywhere in the documentation). In contrast, the maximum MAPQ value that BWA will generate is 37 (though once again, you — frustratingly — won't find this information in the manual).

The data for Experiment 1 is based on a sample of over 25 million mapped reads. However, you never see MAPQ scores of 9, 10, or 20, something that presumably reflects some aspect of how Bowtie 2 calculates these scores.

In the absence of any helpful information in the manuals of these two popular aligners, others have tried doing their own experimentation to work out what the values correspond to. Dave Tang has a useful post on Mappinq Qualities on his Musings from a PhD Candidate blog. There are also lots of posts about mapping quality on the SEQanswers site (e.g. see here, here or here). However, the prize for the most detailed investigation of MAPQ scores — from Bowtie 2 at least — goes to John Urban who has written a fantastic post on his Biofinysics blog:

So in conclusion, there are 3 important take home messages:

  1. MAPQ scores vary between different programs and you should not directly compare results from, say, Bowtie 2 and BWA.
  2. You should look at your MAPQ scores and potentially filter out the really bad alignments.
  3. Bioinformatics software documentation can often omit some really important details (see also my last blog post on this subject).