Details of GFF version 4 have emerged

gff4.png

One of the most widely used file formats in bioinformatics is the General Feature Format (GFF). This venerable tab-delimited format uses 9 columns of text to help describe any set of features that can be localized to a DNA or RNA sequence.

It is most commonly used to provide a set of genome annotations that accompany a genome sequence file, and the success of this format has also spawned the similar Gene Transfer Format (GTF), which focuses on gene structural information.

GFF has been an evolving format, and the widely adopted 2nd version has largely been superceded by use of GFF version 3. This was developed by Lincoln Stein from around 2003 onwards.

As version 3 is now over a decade old, work has been ongoing to develop a new version of GFF 4 that is suitable for the rigors of modern day genomics. The principle change to version 4 will be the addition of a 10th GFF column. This 'Feature ID' column is defined in the spec as follows:

Column 10: Feature ID

Format: FeatureID=<integer>

Every feature in a GFF file should be referenced by a numerical identifier which is unique to that particular feature across all GFF files in existence.

This field will store an integer in the range 1–999,999,999,999,999 (no zero-padding) and identifiers will be generated via tools available from the GFF 4 consortium. If you wish to generate a GFF 4 file, you will need to obtain official sanctioned Feature IDs for this mandatory field.

The advantage of this new field is that all bioinformatics tools and databases will have a convenient way to uniquely reference any feature in any GFF file (as long as it is version 4 compliant)

Large institutions may wish to work with the GFF 4 consortium to reserve blocks of consecutive numeric ranges for Feature IDs

It is intended that the GFF 4 consortium will act as a gatekeeper to all Feature IDs, and that via their APIs you will be able to check whether any given Feature ID exists, and if it does you will be able extract the relevant details of that feature from whatever GFF file in the world contains that specific Feature ID.

Here is an example of how GFF version 4 would describe an intron from a gene:

## gff-version 4
## sub-version 1.02
## generated: 2015-02-01
## sequence-region   chr1 1 2097228       
chrX    Coding_transcript   intron 14192   14266   .   -   gene=Gene00071  FeatureID=125731789

In this example, the intron is the 125,731,789th feature to be registered globally with the GFF 4 consortium. The big advantage of this format is a researcher can now guarantee that this particular Feature ID will not exist in any other GFF file anywhere in the world. The use of unique identifiers like this will be a huge leap forward for bioinformatics as we will no longer have to worry about lines in our GFF files possibly existing in someone else's GFF files as well.

Update: check the date