Prokaryotic RefSeq Genomes
Related documentation:
- Prokaryotic RefSeq genome re-annotation project
- Non-redundant RefSeq proteins
- Prokaryotic RefSeq genomes FAQ
Which genomes are included in RefSeq?
The RefSeq archaeal and bacterial genome assemblies are annotated and maintained copies of complete and whole-genome shotgun assemblies submitted to INSDC (Genbank, ENA and DDBJ) that meet sequence and annotation quality criteria. A genome assembly may be excluded from RefSeq for reasons related to sequence or annotation quality. Most notably, assemblies generated from environmental samples are excluded due to concerns with the accuracy of the organism assignment and possible cross-contamination.
The RefSeq archaeal and bacterial genome assemblies can be searched and downloaded from Entrez Assembly. They are also available as a Blast database for sequence homology searches.
With the exception of selected reference genomes, RefSeq genomes are annotated using NCBI’s prokaryotic genome annotation pipeline to provide consistency across the dataset. Each genome is annotated with gene and protein features that are unique to that genome. Gene features are provided with unique locus_tags on all genomes but NCBI GeneID cross-references are only annotated for reference genomes and a sub-set of representative genomes. Therefore, links between RefSeq genomes, or annotated proteins, and the Gene resource are only available for a subset of RefSeq prokaryotic genomes (and corresponding annotated proteins). Protein coding regions (CDS features) include cross-references to RefSeq non-redundant protein accessions (with WP_ prefix). A given non-redundant protein accession may be annotated on more than one genome.
For each defined species with assemblies included in RefSeq, one assembly is designated as 'reference’ or ‘representative'. The collection of 'representative' and 'reference' genomes is a compact, normalized, and taxonomically diverse view of the RefSeq collection that can be used for the taxonomic identification and characterization of novel sequences.
Reference genomes
Reference genomes are genome assemblies that are annotated and updated by the assembly submitters and chosen by the RefSeq curatorial staff based on their quality and importance to the community as anchors for the analysis of other genomes in their taxonomic group. Some reference genomes are selected based on a long history of collaboration and wide recognition as a community standard, such as the reference genome of Escherichia coli str. K-12 substr. MG1655. Other reference genomes are selected based on medical importance, sequence and annotation quality, and the availability of experimental support. Gene annotation on these genomes is reviewed and may be modified by RefSeq, but largely reflects the work of the submitters.
Reference genomes are annotated with YP_ or NP_ protein accessions which in turn cross-reference the non-redundant protein records. Reference genomes are also annotated with the GeneID cross-reference to the NCBI Gene resource. You can browse the list of reference genomes in the NCBI Genome resource. Sequences and annotation for reference assemblies can be downloaded from Entrez Assembly.
Representative genomes
For species without a reference genome, one assembly per defined species is selected as representative. Only species with formal names accepted under the International Code of Nomenclature of Prokaryotes which is governed by the International Committee on Systematics of Prokaryotes, or species with names that have conventions for formal use, such as names that are only “effectively published”, i.e. in publications other than the journal of record International Journal of Systematic and Evolutionary Microbiology, or Candidatus species are assigned representatives. No representatives are selected for undefined species such as ‘Vibrio sp.’.
Among species in scope, the following assemblies are taken into consideration in the selection of representatives:
- Live RefSeq assemblies - not superseded by newer assemblies or suppressed due to quality or taxonomic misassignment concerns
- Assemblies that pass Average Nucleotide Identity (ANI) criteria - 1) are from type material or 2) are not from type material but a) match type material assemblies for the species at above 70% coverage, or b) in the absence of type material for the species, are not flagged as mismatches at or above ANI thresholds with at least 70% query and subject coverage to a type from a different species.
A representative genome is chosen among eligible assemblies based on the criteria below, in order of importance. Criteria that are lower on the list are only used if assemblies are judged equal based on higher-ranked criteria:
- Manual selection – a few representatives are selected based on community input, biological features or other a priori knowledge about the assembly.
- Magnitude of deviation from the mean assembly length for the species – assemblies with the lowest integral number of standard deviations from the species average assembly length are preferred. This ensures that assemblies that are significantly longer or shorter than others for the species are not chosen.
- CheckM Completeness – In order, assemblies with the highest quantized level of completeness (98 to 100) are preferred over assemblies in the 95-98, 90-95, 85-90, 70-85, 50-70, and under 50 percent level of completeness, as determined by CheckM.
- Magnitude of count of pseudo CDSs – assemblies with the lowest rounded natural log of pseudo CDSs are preferred.
- Presence of a plasmid - assemblies containing plasmid sequences are preferred.
- Magnitude of count of scaffolds – assemblies with the lowest rounded log base 10 scaffold count are preferred.
- Species representative – the current representative is preferred.
- Magnitude of deviation from the mean gene count for the species - assemblies with the lowest integral number of standard deviations from the species average count of genes are preferred. This ensures that assemblies that have significantly more or fewer genes than others for the species are not chosen.
- Absolute count of pseudo CDSs - assemblies with fewer pseudo CDSs are preferred.
- Type strain status
- Release date (tie-breaker)
Representative assemblies are updated several times a year to take into account newly added assemblies to RefSeq, changes in the NCBI Taxonomy, modified taxonomic assignments, and recently discovered contamination.
Representative genomes are annotated with non-redundant RefSeq protein accessions (WP_ accession prefix) and display the protein product name that appears on the WP-accessioned record (see Protein data model below). Representatives for species with at least ten assemblies in RefSeq are annotated with a GeneID cross-reference to the NCBI Gene resource. You can browse the list of reference and representative genomes in the NCBI Genome resource. Sequences and annotation for the latest set of prokaryotic reference and representative genomes can be downloaded from Entrez Assembly.
Non-representative genomes
The non-reference and non-representative prokaryotic genomes make the bulk of the RefSeq prokaryotic collection and are taxonomically diverse. They are included in RefSeq to represent the sequence variation observed in isolate- and strain-specific genomes, including genomes of medical importance. Like representative genomes, they are annotated with non-redundant RefSeq protein accessions (WP_ accession prefix) and display the protein product name that appears on the WP-accessioned record (see Protein data model below). However, non-representative genomes are not annotated with GeneID cross-references to the NCBI Gene resource. Sequences and annotation for non-representative genomes can be downloaded from Entrez Assembly.
Protein data model
With the exception of reference genomes, only non-redundant protein accessions (WP_ accession prefix) are annotated on new or re-annotated RefSeq prokaryotic WGS and Complete genomes. A single non-redundant protein may be annotated on many RefSeq genomes, when the CDS annotated on those genomes encodes exactly the same protein that is identical in both sequence and length. For example, the coding sequence for the 50S ribosomal protein L11 that is annotated on NC_017743.1 provides a cross-link, shown below, to the non-redundant RefSeq protein WP_003156430.1. Over 1000 prokaryotic genomes are annotated with a CDS feature that encodes the identical sequence of the same length as shown in the Identical Protein report, which can be accessed by clicking on the "Identical Proteins" link near the top of the protein record.