Thinking of writing a FASTA parser? 12 scenarios that might break your code
In today's Bits and Bites Coding class, we talked about the basics of FASTA and GFF file formats. Specifically, we were discussing the problems that can arise when you write scripts to parse these files, and the types of problems that might be present in the file which may break (or confuse) your script.
Even though you can find code to parse FASTA files from projects such as BioPerl, it can be instructive to try to do this yourself when you are learning a language. Many of the problems that occur when trying to write code to parse FASTA files will fall into the 'I wasn't expecting the file to look like that' category. I recently wrote about how the simplicity of the FASTA format is a double-edged sword. Because almost anything is allowed, it means that someone will — accidentally or otherwise — produce a FASTA file at some point that contains one of the following 12 scenarios. These are all things that a good FASTA parser should be able to deal with and, if necessary, warn the user:
> space_at_start_of_header_line ACGTACGTACGTACGT >Extra_>_in_FASTA_header ACGTACGTACGTACGT >Spaces_in_sequence ACGTACGT ACGTACGT >Spaces_in_sequence_and_between_lines A C G T A C A G A T >Redundant_sequence_in_header_ACGTACGTACGT ACGTACGTACGTACGT ><- missing definition line ACGTACGTACGTACGT >mixed_case ACGTACGTACGTACGTgtagaggacgcaccagACGTACGTACGTACGT >missing_sequence >rare, but valid, IUPAC characters ACGTRYSWKMBDHVN >errant sequence Maybe I accidentally copied and pasted something Maybe I used Microsoft Word to edit my FASTA sequence >duplicate_FASTA_header ACGTACGTACGTACGT >duplicate_FASTA_header ACGTACGTACGTACGT >line_ending_problem^MACGTACGTACGTACGT^MACGTACGTACGTACGT^M>another_sequence_goes_here^MACGTACGTACGTACGT^MACGTACGTACGTACGT