Prokaryotic RefSeq Genomes

Which genomes are included in RefSeq?
Reference genomes
Non-reference genomes
Protein data model

Related documentation:

Which genomes are included in RefSeq?

The RefSeq archaeal and bacterial genome assemblies are annotated and maintained copies of complete and whole-genome shotgun assemblies submitted to INSDC (Genbank, ENA and DDBJ) that meet sequence and annotation quality criteria. A genome assembly may be excluded from RefSeq for reasons related to sequence or annotation quality. Most assemblies generated from metagenomic (MAG) samples are excluded due to concerns with the accuracy of the organism assignment and possible cross-contamination except for MAG assemblies estimated to be free of contaminants, for species with fewer than 50 non-MAGs, and with Taxonomy check status “OK” or “Inconclusive” with best match status “below-threshold match”.

The RefSeq archaeal and bacterial genome assemblies can be searched and downloaded from Datasets. They are also available as a Blast database for sequence homology searches.

With the exception of selected reference genomes, RefSeq genomes are annotated using NCBI’s prokaryotic genome annotation pipeline to provide consistency across the dataset. Each genome is annotated with gene and protein features that are unique to that genome. Gene features are provided with unique locus_tags on all genomes but NCBI GeneID cross-references are only annotated for reference genomes. Therefore, links between RefSeq genomes, or annotated proteins, and the Gene resource are only available for a subset of RefSeq prokaryotic genomes (and corresponding annotated proteins). Protein coding regions (CDS features) include cross-references to RefSeq non-redundant protein accessions (with WP_ prefix). A given non-redundant protein accession may be annotated on more than one genome.

Reference genomes

For each defined species with assemblies included in RefSeq, one assembly is designated as 'reference’. Reference genomes are a compact, normalized, and taxonomically diverse view of the RefSeq collection that can be used for the taxonomic identification and characterization of novel sequences. Only species with formal names accepted under the International Code of Nomenclature of Prokaryotes which is governed by the International Committee on Systematics of Prokaryotes, or species with names that have conventions for formal use, such as names that are only “effectively published”, i.e. in publications other than the journal of record International Journal of Systematic and Evolutionary Microbiology, or Candidatus species are assigned references. No references are selected for undefined species such as ‘Vibrio sp.’.

Among species in scope, the following assemblies are taken into consideration in the selection of references:

Live RefSeq assemblies - not superseded by newer assemblies or suppressed due to quality or taxonomic misassignment concerns
Assemblies that pass Average Nucleotide Identity (ANI) criteria - 1) are from type material or 2) are not from type material but a) match type material assemblies for the species at above 70% coverage, or b) in the absence of type material for the species, are not flagged as mismatches at or above ANI thresholds with at least 70% query and subject coverage to a type from a different species.

A reference genome is chosen among eligible assemblies based on the criteria below, in order of importance. Criteria that are lower on the list are only used if assemblies are judged equal based on higher-ranked criteria:

Manual selection – a few references are selected based on community input, biological features or other a priori knowledge about the assembly.
Magnitude of deviation from the mean assembly length for the species – assemblies with the lowest integral number of standard deviations from the species average assembly length are preferred. This ensures that assemblies that are significantly longer or shorter than others for the species are not chosen.
CheckM Completeness – In order, assemblies with the highest quantized level of completeness (98 to 100) are preferred over assemblies in the 95-98, 90-95, 85-90, 70-85, 50-70, and under 50 percent level of completeness, as determined by CheckM.
Magnitude of count of pseudo CDSs – assemblies with the lowest rounded natural log of pseudo CDSs are preferred.
Presence of a plasmid - assemblies containing plasmid sequences are preferred.
Magnitude of count of scaffolds – assemblies with the lowest rounded log base 10 scaffold count are preferred.
Species reference – the current reference is preferred.
Magnitude of deviation from the mean gene count for the species - assemblies with the lowest integral number of standard deviations from the species average count of genes are preferred. This ensures that assemblies that have significantly more or fewer genes than others for the species are not chosen.
Absolute count of pseudo CDSs - assemblies with fewer pseudo CDSs are preferred.
Type strain status
Release date (tie-breaker)

Reference genomes are updated several times a year to take into account newly added assemblies to RefSeq, changes in the NCBI Taxonomy, modified taxonomic assignments, and recently discovered contamination.

With the exception of a few manually selected assemblies, reference genomes are annotated with non-redundant RefSeq protein accessions (WP_ accession prefix) and display the protein product name that appears on the WP-accessioned record (see Protein data model below). Selected reference genomes are assemblies that are annotated and updated by the assembly submitters and chosen by the RefSeq curatorial staff based on their quality and importance to the community as anchors for the analysis of other genomes in their taxonomic group. Some reference genomes are selected based on a long history of collaboration and wide recognition as a community standard, such as the reference genome of Escherichia coli str. K-12 substr. MG1655. Other reference genomes are selected based on medical importance, sequence and annotation quality, and the availability of experimental support. Gene annotation on these genomes is reviewed and may be modified by RefSeq, but largely reflects the work of the submitters. These genomes are annotated with YP_ or NP_ protein accessions which in turn cross-reference the non-redundant protein records. Reference genomes for species with at least ten assemblies in RefSeq are annotated with a GeneID cross-reference to the NCBI Gene resource.

You can browse the list of reference genomes in the NCBI Datasets resource. Sequences and annotation for the latest set of prokaryotic reference genomes can also be downloaded from Datasets.

Non-reference genomes

The non-reference prokaryotic genomes make the bulk of the RefSeq prokaryotic collection and are taxonomically diverse. They are included in RefSeq to represent the sequence variation observed in isolate- and strain-specific genomes, including genomes of medical importance. They are annotated with non-redundant RefSeq protein accessions (WP_ accession prefix) and display the protein product name that appears on the WP-accessioned record (see Protein data model below). However, non-reference genomes are not annotated with GeneID cross-references to the NCBI Gene resource. Sequences and annotation for non-reference genomes can be downloaded from Datasets.

Protein data model

With the exception of selected reference genomes, only non-redundant protein accessions (WP_ accession prefix) are annotated on new or re-annotated RefSeq prokaryotic WGS and Complete genomes. A single non-redundant protein may be annotated on many RefSeq genomes, when the CDS annotated on those genomes encodes exactly the same protein that is identical in both sequence and length. For example, the coding sequence for the 50S ribosomal protein L11 that is annotated on NC_017743.1 provides a cross-link, shown below, to the non-redundant RefSeq protein WP_003156430.1. Over 2000 prokaryotic genomes are annotated with a CDS feature that encodes the identical sequence of the same length as shown in the Identical Protein report, which can be accessed by clicking on the "Identical Proteins" link near the top of the protein record.

Image of CDS feature for 50S ribosomal protein L11 as annotated on NC_017743.1. The CDS cross-references nonredundant protein WP_003156430.1.

RefSeq

Integrated reference sequences

Prokaryotic RefSeq Genomes

Which genomes are included in RefSeq?

Reference genomes

Non-reference genomes

Protein data model