Feature types differ somewhat from a gff, “geenuff_” has been appended to names to reduce confusion where the meaning is not identical.
These do not exist in a gff, but are used in geenuff to denote things that might be ambiguous or unknown about a gene model.
Currently errors are assigned when any obvious gene model inconsistency is encountered during gff parsing. Most error types are extended from the end of a known feature half way to the next gene model, while some internal errors (e.g. too_short_intron) can be assigned more precisesly. If gff format were not used as an intermediary and gene annotation was performed and stored directly in a geenuff structured database, all the errors could be assigned more precise ranges for any ambiguity.
Types are:
When True
, these attributes mean the start and end attributes
of a feature correspond to a meaningful biological transition.
When False
, these attributes mean the start and end attributes
of a feature either do not, or it is not known if they correspond
to a biological transition, yet the region they delineate is
confidently of the given type.
For instance, if the parser finds a gene model in a gff where the start of the first exon and the start of the first CDS (+ strand) have the same position (the A in ATG), then it is apparent that we are missing the 5’ UTR, so for the geenuff_transcript feature the start_is_biological_start will be set to False, and an error mask will be added upstream of the CDS. We are still confident that all of the CDS must occur within the transcript, we know the start codon is part of the transcript region, but we mark that the start point itself is probably wrong, and mask the upstream range as it’s unclear what part of this is intergenic and which part UTR.
Features have start and end coordinates that delineate a range.
The positioning of these features is in keeping with the common coordinate system: count from 0, start inclusive, end exclusive. So, the “geenuff_cds, start”, is at the A, of the ATG, AKA the first coding base pair; while in contrast, the “geenuff_cds, end” is after the stop-codon, AKA, the first non-coding bp.
Importantly, the coding-start should always point to the first A, of ATG, regardless of strand. This means the numeric coordinates have to change and unfortunately while one could take the sequence [1, 4) on the + strand, and directly use 1 and 4 as python coordinates and get the sequence; the same is not going to work on the minus strand. Instead:
0 1 2 3 4 5
.N [A .T .G )N .N
| | | | | |
N. T. A. C. N. N.
To get the reverse complement of this on the minus strand, we set the inclusive start to 3, and exclusive end to 0. Note this is now off by one from the python coordinates
0 1 2 3 4 5
.N [A .T .G )N .N
| | | | | |
N( T. A. C] N. N.
Cheat sheet for how the Features compare to the gff (in particular any discrepancy between the closest coordinate in the gff, and the now standardized, consistent coordinate).
First and last for gff are reported as they are typically in gff (coordinate sorted), so reverse to the interpretation when on the - strand.
Plus strand (+)
Common Name | GFF | GFF start | GFF end | geenuff type | bearing | position |
---|---|---|---|---|---|---|
TSS, Transcription start site | start 1st exon | x | geenuff_transcript | start | x - 1 | |
TTS, Transcription termination site | end last exon | x | geenuff_transcript | end | x | |
1st bp of start codon | start 1st CDS | x | geenuff_cds | start | x - 1 | |
coding end | end last CDS | x | geenuff_cds | end | x | |
donor splice site (5’ of intron) | end non-last exon | x | geenuff_intron | start | x | |
acceptor splice site (3’ of intron) | start 2nd+ exon | x | geenuff_intron | end | x - 1 |
Minus strand (-)
Common Name | GFF | GFF start | GFF end | genuff type | bearing | position |
---|---|---|---|---|---|---|
TSS, Transcription start site | end last exon | x | geenuff_transcript | start | x - 1 | |
TTS, Transcription termination site | start 1st exon | x | geenuff_transcript | end | x - 2 | |
1st bp of start codon | end last CDS | x | geenuff_cds | start | x - 1 | |
coding end | start 1st CDS | x | geenuff_cds | end | x - 2 | |
donor splice site (5’ of intron) | start 2nd+ exon | x | geenuff_intron | start | x - 2 | |
acceptor splice site (3’ of intron) | end non-last exon | x | geenuff_intron | end | x - 1 |