Software release notes for the NCBI Eukaryotic Genome Annotation Pipeline

The software used for the NCBI annotation pipelines is under active development. This page provides a list of the major changes incorporated in releases of the Eukaryotic Genome Annotation Pipeline software.

Version 10.2

Release date: September 6 2023

Process

Assignment of Gene Ontology terms to annotated proteins using InterProScan
Improvements in the handling of cross-species RNA-Seq alignments with STAR
Calculation of expression per RNA-Seq run and per gene using using Subread featureCounts software
Improved filtering of PacBio and ONT RNA alignments used for model generation
Incremental improvements to internal processing and performance

Reporting

Addition of new downloadable files to our FTP site, in connection with the features above
- Gene Ontology annotation of the annotated genes in GO Annotation File (GAF) format. See files *_gene_ontology.gaf.gz
- Addition of featureCounts output files. These files provide the counts of reads per RNA-Seq run per gene, for all RNA-Seq runs used in the annotation, and some metadata:
  - *_gene_expression_counts.txt.gz: tab-delimited text file with counts of RNA-Seq reads mapped to each gene
  - *_rnaseq_runs.txt: tab-delimited text file containing information about RNA-Seq runs used for gene expression analyses
  - *_rnaseq_alignment_summary.txt files: tab-delimited text file containing information about assignment of the aligned reads to genes
- Addition of RNA-Seq coverage graph in UCSC bigWig format, for each SRA run aligned to the genome. See *_graph.bw in the RNASeq_coverage_graphs directory.

Version 10.1

Release date: December 14 2022

Process

Better identification and removal of chimeric alignments by STAR for more accurate predictions of paralogous genes
Trimming of low-entropy terminal exons identified minimap2
Revised annotation of RefSeq NM_/NR_ features with large inserts (e.g. an Alu repeat found in the transcript and not the genome) to use a single exon rather than two abutting exons
Improvements for PAR annotation and gene placement when annotating multiple assemblies (e.g. human GRCh38.p14 and CHM13_T2Tv2.0)
Added support for annotation of human GenBank assemblies using curated RefSeq data, available under genomes/all/pilot
Incremental improvements to internal processing and performance

Reporting

New nomenclature for annotations. Starting with this release, annotations will be named after the assembly accession and date on which the annotation was started. For example, the name of the annotation for assembly GCF_016801865.2 started in December 2022, is GCF_016801865.2-RS_2022_12.

Version 10.0

Release date: June 14 2022

Process

Aligner change for RNA-Seq reads from Splign to STAR (Dobin A, et al. Bioinformatics 2013, 29(1):15-21)
Upgrade of RFAM library to RFAM 14.6, for the prediction of small non-coding RNAs (rRNAs, snRNAs and snoRNAs)
Incremental improvements to internal processing and performance

Version 9.0

Release date: June 8 2021

Process

Addition of a module for the refinement of transcription start sites with Cap analysis gene expression (CAGE) data. (Applied only in the annotation of species with public CAGE data in SRA.)

Reporting

Addition of cap and/or polyA sites information on genomic and transcript records, when experimental support is available (CAGE for cap and RNA-Seq for polyA sites).
- On genomic records, cap and polyA_site evidence are in the /experiment field of the .gbk files, as /experiment="COORDINATES: polyA evidence [ECO:0006239]" or /experiment="COORDINATES: cap analysis [ECO:0007248] and polyA evidence [ECO:0006239]"
- On transcript records, cap evidence is represented as misc_features and polyA as polyA_site features. See for example XM_027966739.2:

         misc_feature    1
                     /gene="D2HGDH"`                     
                     /experiment="COORDINATES: cap analysis [ECO:0007248]"
                     /note="transcription start site"
[...]
         polyA_site      2524
                     /gene="D2HGDH"
                     /experiment="COORDINATES: polyA evidence [ECO:0006239]"

The cap and polyA sites are present in column 9 of the GFF3 file.

Version 8.6

Release date: February 24 2021

Process

Change in masking of genome repeats prior to alignments of transcripts and protein evidence:
- Use of WindowMasker for all organisms but human and mouse
- For human and mouse, switched RepeatMasker to using Dfam HMMs rather than RepBase libraries
Normalization of the 5' and 3'-UTR ends of RefSeq model transcripts (with XM or XR prefix) with the curated RefSeq transcripts (NM or NR prefix) of the same gene with the same terminal exon

Reporting

Addition to the web and XML annotation reports of:
- BUSCO results, calculated on the annotated gene set using the longest protein from each gene
- Per-run alignment statistics of long RNA-Seq reads, generated with long-read sequencing technologies such as PacBio or Oxford Nanopore
Removal from the FTP site of files reporting genomic spans masked by RepeatMasker (*rm.out.gz files)

Version 8.5

Release date: July 9 2020

Process

Upgrade of minimap2 to version 2.17, for aligning SRA long read transcriptomes
Upgrade of tRNAscan-SE to version 2.0.4, for prediction of tRNAs
Incremental improvements to internal processing and performance

Version 8.4

Release date: March 17 2020

Process

Improvement in the naming process for fish genes. We have switched to primarily applying gene symbols and names from zebrafish, which are mostly provided by the Zebrafish Information Network (ZFIN), instead of human, to other fish orthologs. The end result is more ortholog connections, and better nomenclature.

Version 8.3

Release date: November 25 2019

Process

Aligner change for SRA long read transcriptomes (PacBio IsoSeq, Oxford Nanopore technologies, etc...) from Splign to Minimap2 (Li H Bioinformatics 2018, 34(18):3094-3100)
Incremental improvements to internal processing and performance

Reporting

Addition of annotated transcripts in BAM format to the files available for download
Files for the annotated assemblies now available under genomes/refseq. Files in genomes/Genus_species will be archived on February 1, 2020 as announced December 5, 2019

Version 8.2

Release date: March 8 2019

Process

Upgrade of RepeatMasker to version 4.0.8 and RepBase-20181026
Incremental improvements to internal processing and performance

Version 8.1

Release date: June 21 2018

Process

Incremental improvements to internal processing and performance

Version 8.0

Release date: November 20 2017

Process

Addition of a module to the pipeline to annotate small non-coding RNAs (rRNAs, snRNAs and snoRNAs), using cmsearch from the Infernal package and RFAM 12.0 HMMs for eukaryotes (Nawrocki EP, et al. Nucleic Acids Research 2015, 43(Database issue):D130-7).

Reporting

Changes in the web annotation reports. These result in higher consistency with the NCBI GFFs and other downloadable files. Note that web reports for annotations executed with software older than version 8.0 were not updated to the new format.
- Features annotated on organelles are now included in the 'Gene and Feature statisitics' section
- Changes in the break-down of reported features:
  - Immunoglobulin/T-cell receptor gene segments are reported separately from protein-coding genes.
  - Pseudogenes are reported as two categories, transcribed and non-transcribed pseudogenes.

Version 7.4

Release date: April 19 2017

Process

Incremental improvements to internal processing and performance

Reporting

In compliance with a NCBI-wide change, gi numbers are no longer included in FASTA and GenBank format files (.fa, .mfa, .gbk and .gbs) provided on our FTP site.
In the RNA-Seq alignments section of the annotation reports, report of the 'Percent of aligned reads with introns' instead of the 'Percent spliced reads'. The 'Percent of aligned reads with introns' is the proportion of reads with a spliced alignments out of all aligned reads.
In the RNA-Seq alignments section of the annotation reports, correction in the calculation of the' Percent aligned reads'. In some reports generated prior to version 7.4, the denominator included the count of reads from small numbers SRA runs that were not used in the annotation.

Version 7.3

Release date: February 9 2017

Process

Improvements in the alignment process for curated RefSeq sequences in masked regions of the genome
Improvements in the global alignment process of protein evidence to the genome
Incremental improvements to internal processing and performance

Reporting

In the eukaryotic annotation status page, addition of links to the Genome Data Viewer (GDV) for genomes assembled to the level of chromosomes
In the RNA-Seq alignments section of the annotation reports, addition of publications associated with RNA-Seq data

Version 7.2

Release date: September 27 2016

Process

Added option to include in the final annotation Gnomon models with up to 99% ab initio sequence and no BlastP hit. This option may be used for annotating organisms distant from reference genomes, and for which little long-ranging same or cross-species primary evidence is publicly available and align to the genome (i.e. some invertebrates or fungi).
Refinements to pairwise orthology calculations to be more conservative when there are multiple paralogs and no supporting synteny information
Incremental improvements to internal processing

Reporting

Changes to GFF3 files. ncRNA features are now represented in the type field (column 3) with specific SO terms associated with their ncRNA_class (lnc_RNA, SRP_RNA, snRNA , RNase_MRP_RNA, etc). The "ncrna_class" attribute is no longer provided in the attributes field (column 9).

Version 7.1

Release date: June 8 2016

Process

Upgrade of RepeatMasker to version 4.0.6, along with RepBase Update 20150807 and RM database version 20150807
Incremental improvements to internal processing

Version 7.0

Release date: April 8 2016

Process

Execution of the annotation process on top-level sequences (chromosomes, and unplaced and unlocalized scaffolds) instead of scaffolds. This change improves the annotation of features spanning gaps between adjacent scaffolds. For the near future, SNP annotation will remain on scaffolds.
Assignment of unique GeneIDs to tRNAs annotated at different locations. Note that tRNAs with the same anticodon are assigned the same Gene symbol. This change increases consistency with other gene types.
Bug fix in the handling of coding models with a high proportion of ab initio sequence (>50%)
Restriction in the generation of alternative variants for alternate loci units. If a gene with a known RefSeq transcript (NM_ or NR_prefix) is placed on an alternate locus, no alternate variant model (XM_ or XR_ prefix) is created for the gene on this alternate locus. Given sufficient evidence, alternative variants for gene with known RefSeq will continue being generated on the primary assembly unit. This change will affect the annotation of alternate loci units in human and mouse.
Incremental improvements to internal processing and annotation consistency

Reporting

In Nucleotide:
- GenBank, Graphics and ASN views of RefSeq placed scaffolds no longer show any annotation (see for exampleNW_001594469.1)
- ASN view of RefSeq chromosomes now include the annotation.
On the FTP site (see for example the recent re-annotation of platypus)
- GFF files are now only provided for top-level sequences.
- Files in the CHR_* directories for nuclear chromosomes no longer include annotation on placed scaffolds.
- Masked spans (masking_coordinates.gz) are now in top-level coordinates.
- Comparison of current to previous annotation (comparison directory) are now in top-level coordinates.

Version 6.5

Release date: November 23 2015

Process

Due to low usage of the STS (Sequence Tagged Sites) placement information on annotated sequences, the process that maps STSs has been discontinued. STS annotation will not be produced for new RefSeq sequences, but will remain available for sequences last annotated before November 20, 2015.
Better handling of stranded RNA-seq reads
Incremental improvements to internal processing and annotation consistency

Reporting

Addition of a section to the HTML annotation reports, "Comparison of current and previous annotations", for organisms that are re-annotated (see this example). This new section indicates how much of the annotation on each assembly has changed between the current and the previous annotation releases and provides links to downloadable full reports. The full reports (in tabular and Genome Workbench formats) are on our FTP site and contain the mappings of current to previous genes and transcripts. Summary counts by category of change are available in the XML annotation report, annotation_report.xml file (<AnnotationComparison> section), also in the FTP directory.
Addition to the annotation_report.xml <RnaseqAlignReport> section of the <Stranded> tag to the individual SRA runs that were generated with a strand-specific isolation technique
Changes to GFF3-formatted files:
- Transcript features for model RefSeqs now contain the attribute "model_evidence" in column 9, listing the source and number of supporting evidence and percent coverage by RNA-Seq samples, similar to reporting in the flatfile format.
- GFF3 output has been changed to only use small gaps (1-2 bp) (aka micro-introns) to correct for frameshifts, even if the RefSeq product has an insertion. Earlier files from software releases 6.3 and 6.4 used small overlaps to represent insertions according to INSDC specifications, but these overlaps weren’t compatible with some external software.

Version 6.4

Release date: July 22 2015

Process

Improvement in the RNA-Seq alignment process. Prior to alignment to the genome, SRA runs are now evaluated for strandedness and reads of stranded runs are aligned in the sense orientation only. Unstranded runs are aligned in both orientations and logic to determine the best strand is applied downstream as before.
Incremental improvements to internal processing and annotation consistency

Reporting

Changes to GFF3-formatted files. Genes in the GFF files for the final annotation now contain the attribute "gene_biotype" in column 9, making explicit whether a gene is coding, non-coding, pseudogene, etc... See more details in the GFF3 documentation.

Version 6.3

Release date: April 21 2015

Process

Improvement in the annotation of model proteins containing selenocysteine residues (see for example XM_012546481.1)
- Selenocysteine residues are now represented with a "U" (instead of a code-breaking "X") in protein sequences.
- Titles of selenocysteine-containing proteins are not prefixed any more with "LOW QUALITY PROTEIN" unless the proteins contains corrections for the genome.
- Transcripts and annotation of the parent genomic sequences contain a /transl_except that explicitly provides the location of the selenocysteine residue in the sequence.
Refinement in the logic that weighs alignments of same-species transcript versus cross-species validated RefSeq proteins to favor same-species transcripts. This change results in a smaller number of models with frameshifts or code-breaks.
Improvement of models bordering assembly gaps
- Better handling of alignments of protein evidence affected by assembly gaps
- Generation of alternative variants of gap-filled models, if alternative variants are supported by the evidence and if the gap-filled portion is identical in all variants
- Trimming of UTRs in gap-filled portion of a transcript if shorter than 100 bases

Reporting

Change in the reporting of RNA-Seq alignment statistics in the "Short read transcript alignments" section of the annotation reports. Raw counts of aligned and spliced reads are estimates and are subject to small variations (within 1%) from run to run, therefore only percentages rounded to the nearest integer are now reported.

Version 6.2

Release date: December 3 2014

Process

Improvements to alignments and model generation algorithms
Exclusion of low-entropy RNA-Seq reads from the set of reads aligned to the genome

Reporting

Addition of a section to the annotation reports, "Alignment of the annotated proteins to a set of high-quality proteins", providing the counts of annotated proteins with BlastP hits against a database of high-confidence proteins (e.g. UniProtKB/Swiss-Prot), at several coverage thresholds. For comparison purposes the data is also provided for a selection of related organisms that were recently annotated.
Bug fix in the calculation of the number of RNA-Seq reads aligned to the genome presented in the "Short read transcript alignments" section of the annotation reports. Statistics in reports pre-dating the 6.2 release may be off by a few percent.
Modification of the representation of multi-interval non-trans-spliced tRNA features in GFF3 files. Each multi-interval non-trans-spliced tRNA feature is now represented by a single feature (line) of type tRNA and multiple nested features of type exon (one for each interval).
Modification of the representation of transcripts with indels compared to the genome in GFF3 files. Insertions in transcripts within the coding region are now represented by a small overlap between the two halves of a split exon, and deletions within the coding region are represented by very short introns between the two halves of an exon. This allows software to properly interpret the reading frame. Note that the conceptual sequence of the feature can still differ from the transcript or protein sequence because of mismatches, gaps, and when overlapping genome sequence does not match the sequence of an insertion.

Version 6.1

Release date: August 4 2014

Process

Addition of a post tRNAscan-SE filter to limit probable noise in tRNA predictions
Bug fix in the unique hit exon coverage track displayed in Gene, that caused reads with multiple placements to be included

Reporting

In the "Short read transcript alignments" section of the annotation reports, addition of the alignment statistics per RNA-Seq SRA run, in addition to the alignment statistics per sample

Version 6.0

Release date: April 17 2014

Process

For model RefSeqs extending into assembly gaps, construction of transcript (XM_/XR_) and protein (XP_) products using a combination of genomic and transcript sequence (RefSeq, INSDC or TSA) to compensate for missing genomic sequence.
Improvements to identification of orthologs compared to a reference taxon, including more robust analysis of protein BLAST alignments. These changes result in more ortholog calls, especially for more distantly related taxa, with lower false-match rates. The results are used for gene naming, and are reported in Gene.
Redesign of the code for categorizing genes by type (protein-coding, pseudogene, non-coding) and assigning names to genes and products (transcript and protein RefSeqs). These changes allow for more automation and higher throughput, as well as improve the identification of pseudogenes and low-quality protein-coding genes.
Change in the naming of model RefSeq variants and isoforms to use the same isoform name for multiple variants that differ only in the UTRs, and to use the same variant and isoform names for equivalent model RefSeqs annotated on multiple assemblies.

Reporting

For model RefSeqs extending into assembly gaps, addition to the nucleotide records of the source of the model spans. For example, XM_007659754.1 is a model with three exons annotated on genomic sequence AAPN01287557.1 and was allowed to extend at the 5-prime end into an assembly gap based on the alignment of transcript JQ350810.1. The flat file for this record contains the following three indicators of the origin of the model:

A comment:

gap_filling_comment

An assembly gap attribute:

gap_filling_attribute

A PRIMARY block providing the spans of the RefSeq model on the genomic or transcript (primary) sequence:

gap_filling_composition

For model RefSeqs extending into assembly gaps, annotation of the genomic mRNA and CDS features with partial features (< or > in the flatfile view), either at internal intervals or at the 5-prime or 3-prime end, to indicate the location of the missing sequence.
Addition of a structured comment of RefSeq attributes to the nucleotide and protein records of model RefSeqs with ab initio span(s) and/or with corrections (see XM_007529441.1 for example). The comment indicates the following, as appropriate for each model:
- Ab initio span(s): % bases not supported by evidence and produced by the ab initio component of Gnomon
- Frameshift(s): number of indels corrected
- Internal stop codon(s): number of genomic stop codons corrected
- Assembly gap(s): number of transcript bases added to fill a genome assembly gap (see above)
Addition of keyword "corrected model" to models with frameshifts, internal stop codons or assembly gaps; and keyword "includes ab initio" to models with ab initio spans.
Addition to the annotation reports of the number of model RefSeqs with genomic gaps filled with transcript sequence.
Change in the annotation reports for the calculation of the number of corrected model RefSeqs. The new count, "model RefSeq with major corrections", includes all model RefSeq proteins with major corrections (CDSs with correction for internal stop-codons, frameshifts or internal gaps).
Changes to GFF3-formatted files:
- Incorporation of the start_range and end_range attributes from the GVF specification to indicate partial features. The GFF3 specification currently does not include any formal mechanism to indicate partial features, so these attributes are borrowed from GVF with non-official (lowercase) tags. In NCBI's annotation files, presence of a start_range attribute can simply be interpreted as column 4 is partial, and an end_range attribute as column 5 is partial, regardless of strand, without further analysis of the tag value. Further details about the attributes are available in the GVF specifications.
- Reduced usage of URL escaping in attribute values.

Version 5.2

Release date: November 19 2013

Process

Exclusion of spans in protein alignments from use by gene prediction if the spans contain an intron with much lower RNA-Seq support than the rest of the alignment.
Classification of model RefSeqs (XR_) for predicted non-coding genes as ncRNA of type lncRNA rather than misc_RNA.
Improvements to RNA-Seq filtering criteria in regions of alternative splicing.
Improvements to model predictions in regions of closely-spaced or overlapping genes.
Improvements to the assembly-assembly alignment process, used for tracking genes across assemblies.
Performance improvements.

Reporting

Production of a report with each annotation run summarizing the features annotated and the alignments used for gene prediction. This report is available in HTML (see URL in the README_CURRENT_RELEASE file) and in XML on the FTP site.
Change in the format of the README_CURRENT_RELEASE file distributed on the FTP site.
Phase-out of the production of RefSeq scaffold BLAST databases. Top-level (chromosomes, unplaced and unlocalized scaffolds) BLAST databases are now the default on the organism-specific BLAST pages.
Increased stringency for the CpG islands displayed in Map Viewer. Only islands meeting the "strict" definition of 500bp or more in length, 50% or higher in GC content and 0.60 or higher observed CpG / expected CpG are now shown in the CpG island map.

Version 5.1

Release date: July 19 2013

Process

Exclusion of spans in EST or mRNA alignments from use by gene prediction if the spans contain an intron with much lower RNA-Seq support than the rest of the alignment.
Allowed co-existence of known RefSeq (NM/NR/NP_ accessions) and model RefSeq (XM/XR/XP_ accessions) on the same gene, resulting in an increase in the number of alternate variants for organisms with large amount of evidence (i.e. RNA-Seq).

Version 5.0

Release date: April 11 2013

Process

Addition of a process to align RNA-Seq short reads from SRA to the genome.
Incorporation of RNA-Seq alignments in gene prediction.
Performance improvements.

Reporting

Production of RNA-Seq coverage graphs and intron feature tracks.
Addition of BioSamples in the annotated features' evidence support summary on the model RefSeq records.

Version 4.1

Release date: January 8 2013

Process

Classification of model RefSeqs (XR_) for predicted non-coding genes as misc_RNA.
Performance improvements.

Reporting

Addition of a /note on RNA and CDS features describing differences between the annotation product and the genome.
Addition of the BioProject ID on model RefSeq records (XM/XR/XP_).

Version 4.0

Release date: May 21 2012

Process

For some genomes, addition of ab initio predictions to the model RefSeq set if these have high-quality BLAST hits to known proteins.
Improvements to the assembly-assembly alignment process, used for tracking genes across assemblies.
Improvements to the alignment of genomic sequence to the genome. Alignments with long gaps are now split in the Map Viewer display.
Performance improvements.

Reporting

Addition of annotation files in GFF3 format to the FTP site.
Addition of BLAST databases of top-level molecules (chromosomes, unplaced and unlocalized scaffolds) to the set of BLAST databases displayed in the organism-specific BLAST pages.

RefSeq

Integrated reference sequences

Software release notes for the NCBI Eukaryotic Genome Annotation Pipeline

Version 10.2

Process

Reporting

Version 10.1

Process

Reporting

Version 10.0

Process

Version 9.0

Process

Reporting

Version 8.6

Process

Reporting

Version 8.5

Process

Version 8.4

Process

Version 8.3

Process

Reporting

Version 8.2

Process

Version 8.1

Process

Version 8.0

Process

Reporting

Version 7.4

Process

Reporting

Version 7.3

Process

Reporting

Version 7.2

Process

Reporting

Version 7.1

Process

Version 7.0

Process

Reporting

Version 6.5

Process

Reporting

Version 6.4

Process

Reporting

Version 6.3

Process

Reporting

Version 6.2

Process

Reporting

Version 6.1

Process

Reporting

Version 6.0

Process

Reporting

Version 5.2

Process

Reporting

Version 5.1

Process

Version 5.0

Process

Reporting

Version 4.1

Process

Reporting

Version 4.0

Process

Reporting