Genomes Selected for RefSeq Annotation

Genomes Selected for RefSeq Annotation

RefSeq genome assemblies are selected from public data to represent organisms across the tree of life. In general, only the best quality genomes or genomes the of highest value to the scientific community are included in RefSeq for each species, but selection rules vary based on taxonomic superkingdom. Genes are annotated on all RefSeq genome assemblies, either at NCBI, or by the assembly submitters.

Vertebrates, Higher Plants, Arthropods, and some Invertebrates

For these organism groups, the genome assemblies in RefSeq are annotated using the NCBI Eukaryotic Genome Annotation Pipeline (EGAP). They are chosen using the following criteria:

  • Assembly quality:
    • Contiguity: genomes assembled to the level of chromosomes, and genomes with high contig and scaffold N50 values are preferred. Assemblies with contig N50 below 50 Kb are excluded.
    • Sequence accuracy: genome assemblies with high counts of indels or base substitutions are excluded if these substantially affect EGAP’s ability to produce a quality annotation.
  • Availability of transcriptomics data in the NCBI Sequence Read Archive (SRA): EGAP is highly dependent on experimental evidence, so only genomes for species with RNA-seq data that is public in the SRA are annotated.
  • “Best” for the species: RefSeq contains only one genome per species at any given time. The best-quality genome (see “Assembly quality” above) or a genome of high value to the scientific community is chosen.
  • Community interest/requests: we prioritize the annotation of genomes for which we receive requests.

Fungi, Protostomia, and Protozoans

For these organism groups, the genome assemblies in RefSeq, including their gene annotation, are copies of the genomes submitted to the International Nucleotide Sequence Database Collaboration (INSDC [GenBank, ENA or DDBJ]), and slightly modified for format standardization and functional annotation improvements. Only one genome per species is in RefSeq at any given time. The best-quality genome or a genome of high value to the scientific community is chosen.

Prokaryotes

By default, prokaryotic genomes are all annotated using the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) and included in RefSeq. Notable exceptions are:

  • genome assemblies derived from surveillance projects for human pathogens
  • most but not all metagenomes and metagenome-assembled-genomes
  • genomes with detected quality issues.

The reasons prokaryotic genomes are excluded from RefSeq are listed on the Genome Notes page.

Viruses and Viroids

For these organism groups, the genome assemblies in RefSeq are copies of the genomes submitted to the International Nucleotide Sequence Database Collaboration (INSDC [GenBank, ENA, or DDBJ]). Novel viruses identified as part of another organism’s genome cannot be made into an individual virus RefSeq until the virus sequence is submitted separately. In general, there is one genome per species in RefSeq. However, viruses with high diversity within the species may have more than one RefSeq. There are two ways in which virus RefSeqs are selected:

  1. The International Committee on Taxonomy of Viruses (ICTV) proposes species exemplars in its Virus Metadata Resource. The exemplars are reviewed to ensure technical standards (e.g., they are not regions on eukaryote records), and if acceptable, the record is used to create a duplicate RefSeq record. Partial genomes are considered acceptable if they are exemplars.
  2. Less often, a RefSeq is created based on high value to the scientific community, including as a representative of a potential species not yet officially recognized by the ICTV. In this case, the criteria for selection may include an early representative of the species; genome completeness; complete annotation; and metadata. Widely used lab isolates or vaccine strains may also be selected as an additional RefSeq for a species, especially if it differs significantly from wild types.

Further, a subset of viral RefSeqs is improved by curators with more complete source metadata, additional references, or reannotation of the gene, protein, and mature peptide regions. We prioritize the annotation of genomes for which we receive requests. Segmented viruses have one RefSeq nucleotide record per segment. However, the assembly record can be used to identify all segments for the genome.

Note: only genomes that are public in INSDC are considered for inclusion in RefSeq.

Please feel free to contact us with any questions, concerns, or suggestions for additional genomes!

Generated May 7, 2024