When 'verbose' mode is maybe a little too verbose: lessons from the Trinity transcriptome assembler

The transcriptome assembler Trinity, like many other bioinformatics command-line tools, sends its principle output (a transcriptome assembly) to a named output file. It writes other information about the status of the run to standard output.

Another feature in common with other bioinformatics programs is that provides a --verbose mode. The Trinity command-line help describes this as follows:

verbose: provide additional job status info during the run

I recently helped a colleague use Trinity to generate a primate transcriptome assembly, and when we ran the program we did two runs, one with standard logging and one with the verbose output turned on. In both cases we used file redirection to send the output to a file. So what did we end up with?

  1. transcriptome.fasta - 60.4 MB
  2. stdout.log - 2.1 MB
  3. stdout_verbose.log - 140.7 MB

The verbose log file was 70 times bigger than the standard log file, and over twice the size of the final transcriptome assembly! I tried converting the verbose text file to a PDF which gave me a 15,385 page document. The Unix word count program tells me that this file contains over 15 million 'words', but the problem is that that these are not words that you would necessarily want to read. There are thousands and thousands of pages of output with text that looks like this:

If you run Trinity without redirecting the output to a file, you will just see the percentage completion number overwrite itself on a single line of output. This doesn't work so well though if someone does choose to redirect the output to a file. You could also make an argument that no-one really needs to see such a high level of precision when reporting the state-of-completion of each step (four decimal places!).

I think this is an example where the verbose log file ends up being so big as to be largely unusable. If you wanted to search for a specific string in that file, then maybe it would be helpful. The main problem is that the Trinity developers are trying to be smart by having the program overwrite output — regarding the percentage completion status of each step — on various lines of output. However, this is only useful if the user chooses not to redirect the output to a file (something which is incredibly common in bioinformatics). I would argue that for 99% of cases, it is more than sufficient for a program to indicate 10–20 lines of output regarding the state of completion, e.g.

Calculating stage 1 of shamrock.pl…
10% complete
20% complete
30% complete
40% complete
50% complete
60% complete
70% complete
80% complete
90% complete
100% complete