The dangers of default parameters in bioinformatics: lessons from Bowtie and TopHat

Updated 2015-04-28 to include a reference to a relevant blog post by Nick Loman.

Most bioinformatics tools are equipped with a vast array of command-line options which let the user configure the inputs, outputs, and performance of the software. You may not wish to explore every possible option when using a particular piece of software, but you should always try to have a look at the manual. In fact I would go so far as to say:

Never use a piece of bioinformatics software for the first time without looking to see what command-line options are available and what default parameters are being used

If a bioinformatics tool provides command-line options, it's a good bet that some of those options will have default parameters associated with them. The most common default parameter that you should always check for might be called something like -p, -t, or --threads. I.e. you should always check to see if a program can make use of multiple CPUs/cores to speed up its operation. The default value for such parameter is usually '1'.

Failure to check (and change) this option will merely result in your program taking longer than it otherwise could, so no harm done. More serious issues arise when you stick with default parameters which might actually make you miss interesting results, or produce misleading results.

E.g. the TopHat RNA-Seq aligner has an command-line option -i which lets you set a minimum intron length. The manual reveals that TopHat will ignore donor/acceptor pairs closer than this many bases apart. The default value for this parameter is 70 bp which is probably fine for all vertebrate genomes, but is probably not suitable for genomes such as Arabidopsis thaliana or Caenorhabditis elegans where there are many introns shorter than this length.

Now let's look at some specific examples which shed some light on what can happen if you stick with the default parameters.

Testing Bowtie

I used Bowtie 2 to align two FASTQ files of bovine RNA-Seq data to the reference cow transcriptome. These two files represented paired read data, about 25 million reads in each file. Following alignment, I filtered the data to only keep uniquely mapped, concordant read pairs. Furthermore, I filtered the data to only keep reads pairs that aligned with the highest possible mapping quality (for Bowtie, this means MAPQ values of 42).

From the resulting set of aligned reads, I calculated the inner read pair distance (see this great post about the difference between this and insert sizes), which I then plotted:

Thicker red vertical line indicates mean.

You will notice that most read pairs have a negative inner read pair distance, i.e. the reads overlap. But do you notice anything odd about these results? Look at the far right of the distribution…the maximum inner read pair distance cuts off abruptly at 300 bp. To me this is very suspicious. After looking in the Bowtie manual I discovered this:

-X <int> The maximum fragment length for valid paired-end alignments. Default: 500 bp.

Aha! My reads are 100 bp, so the cutoff at 300 bp in the above graph represents fragment lengths of 500 bp. If there are reads that correctly map further than 300 bp apart, I am not going to see them because of the default value of -X. So what happens if I increase this default parameter to 2,000 bp?

Thicker red line indicates mean of dataset with -X 500, thicker blue line indicates mean of dataset at -X 2000.

I have colored in the new data in blue; we now see many more paired reads that map further than 300 bp apart. There are actually still some reads that correctly map even further than 1,800 bp apart (the limit imposed by -X 2000). So not only does this default parameter of Bowtie mean that I lose some data, it also means that I lose the type of data that might be particularly useful in some applications where we especially want the reads that map far apart (e.g. transcriptome assembly).

It turns out that Nick Loman was writing about the dangers of the -X parameter of Bowtie back in 2013, and I encourage everyone to read his blog post: Use X with bowtie2 to set minimum and maximum insert sizes for Nextera libraries.

Testing TopHat

The TopHat program actually uses Bowtie for the alignment phase, and there are a couple of default parameters used by TopHat that might cause issues for you. From the TopHat manual:

-r <int>  This is the expected (mean) inner distance between mate pairs. The default is 50 bp.

--mate-std-dev <int> The standard deviation for the distribution on inner distances between mate pairs. The default is 20 bp.

The default value for --mate-std-dev seems worryingly low; how many biological datasets ever have standard deviations that are less than half the mean? So I did some testing…

I used TopHat to map a small sample of my RNA-Seq data (200,000 paired reads) to the cow transcriptome (using the -T option of TopHat). In addition to using the defaults, I also tried various runs that used different values for the -r and --mate-std-dev options. My run with default parameters ended up with 150,235 reads mapped to the transcriptome, whereas increasing those parameters (-r 100 and --mate-std-dev 125) produced fewer reads in my output (138,970). So you might conclude that the default parameters are performing better. However, of the 150,235 reads mapped with default parameters, only 84,996 represent high quality alignments (uniquely mapped, concordant read pairs). My custom parameters produced 105,748 high quality alignments. So the custom parameters are giving me better results overall.

Conclusions

It takes time and effort to be prepared to test command-line options, especially for programs like Bowtie and TopHat which have a lot of options. For TopHat, I ended up doing 40 different test runs in order to investigate the effects of various parameters to work out what was most suitable (and effective) for my input data.

Bioinformatics programs should never be treated as 'black box' devices. You may get results out of them, but you will never know whether you could have more and/or better results from changing some of those default parameters.