Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
NCBI Pathogen Detection Project Help Document

Pathogen Detection Help Document

Beta Release

This is a beta release of the Pathogens help documentation in order to make new content available, while development continues on the format and presentation of the information. Navigation tips:
  • The Isolates Browser Advanced Search, topic list icon takes you to the list of topics for the section you are currently reading
  • The back to top icon takes you to the top of the document
We welcome and appreciate feedback about the content, including comments about sections that are helpful as well as those in need of clarification and/or enhancement. Thank you for your feedback, which can be sent to
pd-help@ncbi.nlm.nih.gov.

Table of contents


What is the NCBI Pathogen Detection project? back to top

Overview What is the NCBI Pathogen Detection Project, topic listback to top

NCBI Pathogen Detection project is a centralized system that integrates sequence data for bacterial pathogens.

NCBI Pathogen Detection integrates bacterial and fungal pathogen genomic sequences from numerous ongoing surveillance and research efforts whose sources include food, environmental sources such as water or production facilities, and patient samples. Foodborne, hospital-acquired, and other clinically infectious pathogens are included. The system provides two major automated real-time analyses:
  1. It quickly clusters related pathogen genome sequences to identify potential transmission chains, helping public health scientists investigate disease outbreaks
  2. As part of the National Database of Antibiotic Resistant Organisms (NDARO), NCBI screens genomic sequences using AMRFinderPlus to identify the antimicrobial resistance, stress response, and virulence genes found in bacterial genomic sequences, which enables scientists to track the spread of resistance genes and to understand the relationships among antimicrobial resistance, stress response, and virulence.
A number of public health agencies and researchers in the US and internationally are collecting samples from clinical cases, from the environment, from food products, and from industrial production facilities to facilitate active, real-time surveillance of pathogens, including foodborne disease. Public health agencies and researchers sequence the samples and submit the data to NCBI, which analyzes the sequences and compares them to others in its database, including all genomes in GenBank, to identify closely related sequences. The aim is to identify closely or clonally related isolates to aid in outbreak investigation. For example the FDA, CDC, and USDA use isolates from food and the environment linked to isolates associated with human illness to aid traceback investigations and outbreak response. (read more in the Pathogen Detection factsheet , and in the Contributors section of this document)

NOTE: NCBI Pathogen Detection does not identify outbreaks or outbreak membership. All analyses are dependent on the public data submitted to the system and the quirks of our analysis pipelines. NCBI provides a service to help identify clonal relationships based on genomic similarity. Determinations of outbreaks are done by public health organizations including CDC, FDA, USDA. Although we take care to make the analyses as error free as possible, this is a large-scale automated pipeline that takes data from submitters and analyzes it in real-time therefore we cannot guarantee the results to be free from error or applicable for a particular use.

Where to access the Pathogens Detection Project results What is the NCBI Pathogen Detection Project, topic listback to top

The Pathogen Detection project can be accessed from a variety of entry points, such as:

Where to access Antimicrobial Resistance (AMR) Data What is the NCBI Pathogen Detection Project, topic listback to top

Update Frequency What is the NCBI Pathogen Detection Project, topic listback to top

The various components of the Pathogen Detection project are updated at the following intervals:

References and Contact Information What is the NCBI Pathogen Detection Project, topic listback to top

Separate sections of this document provide additional information, including:

How To: back to top

  • The Pathogen Detection Project home page includes an "Explore the Data" section. This lists the four foodborne pathogens including direct links to the Isolates Browser for Salmonella enterica, E.coli and Shigella, Campylobacter jejuni, Listeria monocytogenes, and provides instant access to isolates from those groups.
  • The Organism Groups page also provides links for all available organism groups, along with additional details for each group. Note that the species name under the Organism Groups table reflects the most common species in each group, but does not reflect all species. For example, the Salmonella enterica organism group consists of predominantly Salmonella enterica isolates, but also Salmonella bongori isolates. To see the full list of organisms present in each group, see the scientific_name column in the Isolates Browser.
  • For example, to quickly retrieve new isolates for a Salmonella enterica, open the Pathogen Detection Project home page:
    • Scroll down to "Explore the Data" and follow the "New Isolates" link for the Salmonella enterica.
    • That will retrieve isolates that have become available in the Pathogen Detection Project. "New" isolates are those that have been added to a Pathogen Detection Group (PDG#) since the last calculation. This may have been all isolates added in the last 24 hours for frequently updated organism groups like Salmonella, or it may have been months since the last update and "new" isolates are now several months old, but reflect the newest isolates added to a given pathogen detection group.

General text searches Pathogens How To, topic listback to top

Field-specific searches Pathogens How To, topic listback to top

  • As an alternative to general text searches, you can conduct more precise searches by limiting your query to specific data fields.
    The general syntax of a field-specific search is:

    Important notes:
    • The names of data fields, and the values they contain, are case sensitive.
    • The exact name of the data fields can be seen by hovering the mouse over the column names, then a popup appears with the search syntax for that field.
    • The data field names and values might also include special characters such as underscore bars, hypens, parentheses, and slashes. These should be included in the query string, as the Isolates Browser has been modified relative to the SOLR Standard Query Parser to recognize and properly handle special characters that are part of a search term.
  • For example, you can search the Location data field, as shown below, in order to retrieve isolates that were collected from a given geographic area:

    • Open the Isolates Browser home page. It will display all isolates are by default.
    • Enter the following type of search in the text box to display only the subset of isolates that have been identified by the submitter as having been collected in the USA:
      • geo_loc_name:USA
  • For additional examples, such as searches that retrieve isolates with specific genotypes and/or phenotypes, see the Examples of SOLR queries section of this document.
  • For detailed information about searching specific data fields, see the Isolates Browser help > Advanced Search > Data Fields section of this document.

Using Filters to focus the search results Pathogens How To, topic listback to top

  • You can use "Filters" in order to focus on a specific subset of isolates.
  • For example, open the Isolates Browser home page. It will display all isolates by default.
  • To filter the isolates by criteria such as isolation source:
  • Click on the "Filters" menu in order to filter the data displayed by the browser.
  • Scroll down to the "Isolation source" text box to filter the data by source of isolation.
  • Now the "Isolation source" filter box pops up. By default the top 100 unique values are shown, which can be viewed using the scrollbar. The number of items for each value are also shown. This box has a search bar to search for any values not displayed. Values can be selected and will update the number of items displayed in the table below. If two or more filters are open, then the selections in one filter will update the available values and unique items in the other filter. The filters that you see are generated on the fly to reflect the attributes of the isolates that you are currently viewing in the browser.

Identify the possible source of an outbreak Pathogens How To, topic listback to top

To identify the possible source of an outbreak, you can use either one of the following methods:
  • Analyze data that's already available in the pathogen detection project by using the SNP Tree Viewer to view the phylogenetic relationships among a group of sequence-similar isolates from clincal or environmental sources.
    For example, the FDA's GenomeTrakr project (BioProject PRJNA230969) for the surveillance and rapid detection of foodborne contamination events include a subset of E. coli isolates that belong to the SNP cluster "PDS000003441." Many of the isolates in that cluster were from an outbreak that originated in all-purpose flour. (Read more on the CDC website about that outbreak.)

    In the Isolates Browser display, you can click on the "PDS*" accession number that appears in the "SNP Cluster" column for any one of those isolates to open a Tree Viewer display for the SNP cluster and interactively examine the phylogenetic distance tree. A SNP cluster contains isolate genomes that have been found, via the Pathogens data processing pipeline, to be closely related.

    The Tree View for SNP cluster PDS000003441 shows a number of clinical and environmental samples that are very closely related, in some cases, with a distance of zero SNPs between the clinical and environmental samples. (Mouse over any branch in the tree to view the SNP distance between the isolates.) The phylogenetic distance tree therefore sheds light on the possible source of the outbreak.

    The sequence data analysis and SNP Tree Viewer help sections of this document provide additional details about SNP clusters and using the SNP Tree Viewer, respectively. The SNP Tree Viewer help includes an illustrated example of SNP Tree Viewer launch points and illustrated example of a SNP Tree Viewer display.
    - or -

  • Submit sequence reads to NCBI and obtain data analysis results on the Pathogen Detection project FTP site, in the form of phylogenetic distance trees that show the relationship of your isolates to those already in the Pathogen Detection project.
See the section of this document on sequence data analysis for more information.

More examples... Pathogens How To, topic listback to top



Pathogens Project Components back to top

Resources/Tools Pathogens Project Components, topic listback to top

Isolates Browser | SNP Tree Viewer | Automatic e-mail notifications of new data | Antimicrobial Resistance (AMR) resources | FTP site | Data submission tools

Isolates Browser Pathogens Project Components, topic listback to top

The NCBI Pathogens Isolates Browser is a web-based portal that integrates pathogen genomic sequences, metadata, antibiotic susceptibility and resistance gene information, and the SNP cluster information.

The Isolates Browser was built to answer two specific questions for incoming pathogen genomes:
1) is this isolate clonally related to anything else in the database?
2) what is the AMR repertoire of this isolate?

It allows users to browse and search over 300,000 pathogen isolates, effectively and efficiently providing access to the National Database of Antibiotic Resistant Organisms.

Upon opening the Isolates Browser, a table displays data for all available isolates, with the most recently added data at the top. You can query the Isolates Browser with a wide variety of allowable search terms. The data can be sorted by clicking on column headers, filtered by using the "Filters" interface (e.g., Property: has antimicrobial resistance (AMR) genotypes), or searched using basic or advanced queries.

Every row in the Isolates Browser is an assembled isolate, possibly with antimicrobial resistance (AMR), virulence, and/or stress response genotype data, and antibiotic susceptibility (AST) phenotype data, as available.

If an isolate has a "PDS*" accession number in the "SNP Cluster" column, that indicates it is part of a SNP cluster. You can click on the PSD* accession to launch the SNP Tree Viewer and examine the relationships among your isolate of interest and other isolates that have been found, via the Pathogens data processing pipeline, to be closely related.

A separate section of this file provides Isolates Browser help documentation, with details on how the browser can be used, including allowable input, a decription of the output, and an illustrated example of search results.

SNP Tree Viewer Pathogens Project Components, topic listback to top

The SNP Tree Viewer is a component of the Pathogens Isolates Browser. Any isolate that belongs to a SNP cluster (group of isolates whose genome assemblies are clustered) has a link to the SNP Tree Viewer.

The SNP Tree Viewer displays a phylogenetic tree of pathogen isolates, built from assembled genomes by the maximum compatibility method. It shows relationships among the isolates based on the number of single nucleotide polymorphisms (SNPs) they contain relative to each other. Each tree represents a cluster of isolates that have been found, via the Pathogens data processing pipeline, to be closely related.

The trees can be used to examine the relationships of isolates in a SNP cluster to each other, and to identify the possible source of an outbreak based on the sequence similarity of the clinical and environmental isolates in a tree. (See an example in How to identify the possible source of an outbreak.)

A separate section of this file provides SNP Tree Viewer help documentation, with details on how the tree viewer can be used. It includes an illustrated example of SNP Tree Viewer launch points and illustrated example of a SNP Tree Viewer display.

Automatic e-mail notifications of new dataPathogens Project Components, topic listback to top

You can perform a search in the Pathogens Isolates Browser, or to select an isolate of interest in the SNP Tree Viewer, and then automatically receive e-mail notifications each time new isolates become available that match your search criteria, or are closely related to your isolate of interest.

There are two ways to receive automatic e-mail notifications of new data, and you must be logged into your free My NCBI account to use either one:

"Save" a search in the Isolates Browser
  • A "Save" button in the Isolates Browser interface allows you to save one or more searches, and automatically notifies you about new isolates that match the criteria of each saved search. (Read more and view an illustrated example.)
"Watch" an isolate of interest in the SNP Tree Viewer
  • A "Watch" button in the SNP Tree Viewer interface allows you to watch one or more selected isolates in a tree, and automatically notifies you about new isolates that fall within the SNP distance that you have specified from the watched isolate(s). (Read more and view an illustrated example.)

Antimicrobial Resistance (AMR) resources Pathogens Project Components, topic listback to top

AMR Landing page | AMR Resources page | Pathogen Detection Reference Gene Catalog | AMRFinderPlus | MicroBIGG-E | Submit sequence and phenotype data related to AMR | FTP/Raw Data Download | Schematic illustration of AMR resources

As antimicrobial resistance (AMR) continues to evolve in many bacterial pathogens, the NCBI Pathogen Detection Project has developed a database to collect curated information about AMR genes, as well as tools to access the data. The AMR resources include:

FTP site Pathogens Project Components, topic listback to top

The Pathogens FTP site provides files that contain the results of analyses done by NCBI on the isolates data that have been submitted to the Pathogen Detection project. The files include genome assemblies that were built from sequence reads, phylogenetic distance trees for isolates placed in clusters using the methods described in data processing section, and antimicrobial resistance (AMR) data.

A separate section of this document provides an overview of the data available on the FTP site, and the FTP readme file provides additional details.

Data submission tools Pathogens Project Components, topic listback to top

NCBI provides a number of tools for submitting data to the Pathogen Detection project, and the specific tool(s) you use depends on the types of data you are submitting.

A separate section of this document provides an overview of the data submission process, and links to detailed submission instructions.

Types of Data Pathogens Project Components, topic listback to top

The Pathogen Detection Resource integrates primary records from other NCBI databases so that you can search by their accessions and properties in the Isolates Browser. Many of the data fields that are in the Isolates Browser and other Pathogen resources are derived from these primary data sources. Other data fields are derived during processing of the primary data through the data processing pipeline. The "examples" below retrieve samples of the data types from their original source databases. The "search tips" under genotypes and phenotypes show how to retrieve those data types through the Isolates Browser. A separate section of this document provides details on how to use the Isolates Browser, including searches against specific data fields.

BioProject records | BioSample records | Raw data: Sequence reads | Genomes | Genotypes: antimicrobial resistance (AMR) genes | Phenotypes: antimicrobial susceptibility test (AST) data (antibiograms)

BioProject records Pathogens Project Components, topic listback to top

  • A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. As the sequence data archives (GenBank and SRA) require submission to a BioProject for assembled genomes, this means that every isolate in the Isolate Browser comes from one of these BioProjects. There may be many isolates from any particular BioProject.
    • Example: Retrieve the BioProject PRJNA230969, which describes the GenomeTrakr project by the US Food and Drug Administration (FDA) to sequence Escherichia coli (E. coli) genomes for the surveillance and rapid detection of foodborne contamination events.
    • Submit: See the data submissions section of this document for instructions on submitting BioProjects.

BioSample records Pathogens Project Components, topic listback to top

  • BioSample records describe the biological source materials used in experimental assays. For many pathogen samples, a template/package is used that has a minimal set of required fields that was developed specifically for this project: (clinical package, environmental package).
    • Example: Retrieve an individual BioSample record, SAMN05245394, for Escherichia coli isolated from all-purpose flour and sequenced as part of the FDA's GenomeTrakr project (BioProject PRJNA230969) for the surveillance and rapid detection of foodborne contamination events.
    • Example: Retrieve all biosamples that are part of the FDA's GenomeTrakr project (BioProject PRJNA230969) for the surveillance and rapid detection of foodborne contamination events.
    • Submit: See the data submissions section of this document for instructions on submitting BioSamples.

Raw data: Sequence reads Pathogens Project Components, topic listback to top

  • Sequence Read Archive (SRA) stores raw sequencing data and alignment information from high-throughput sequencing platforms. Most of the major pathogen surveillance efforts use next generation sequencing platforms with raw sequence data deposited in SRA. The majority of isolates in the Isolate Browser have been assembled using the Pathogen Detection data processing pipeline from the raw data in SRA.
    • Submit: See the data submissions section of this document for instructions on submitting sequence reads.

Genomes Pathogens Project Components, topic listback to top

  • Pathogen genomes are from two sources: 1) assemblies submitted to the GenBank nucleotide sequence database from outside contributors, 2) genomes assembled in the Pathogen Detection data processing pipeline using the raw sequencing data in SRA. Currently NCBI is working on depositing these assemblies into GenBank, however the vast majority are not yet available there.
    • Submit: See the data submissions section of this document for instructions on submitting assembled genomes.
    • Note: Each Pathogen Detection Target ("PDT" record) in the Pathogen Detection Project contains the genome assembly for a single pathogen isolate.
    • There are several types of genome assemblies in the Project:
      1. isolates submitted directly to GenBank as assembled genomes, and therefore have a corresponding "GCA" accession
      2. isolate genomes assembled by the NCBI Pathogens data processing pipeline from sequence reads, but not published as genome sequence records in GenBank
      3. isolate genomes assembled by the NCBI data processing pipeline and then submitted to GenBank either by the submitter or on behalf of the submitter with their permission.

Genotypes Pathogens Project Components, topic listback to top

Phenotypes Pathogens Project Components, topic listback to top

  • Antimicrobial Susceptibility Test (AST) data, also referred to as AST phenotypes or antibiograms, are included by submitters as data in BioSample records, when available. Those BioSample records with AST data can be retrieved from the BioSample database directly. For those BioSample records for which sequencing data is submitted, and which are also incorporated into the Pathogen resources, the Isolate Browser displays the antibiotic compounds from each antibiogram, binned into the SIR (sensitive, intermediate, resistance) calls as made by the submitter into a separate column: AST_phenotypes. You can submit AST data for your samples. See How to submit for information on how to submit that data

  • A list of possible phenotype values is shown on the BioSample Beta-Lactamase Antibiograms page, under the "Resistance Phenotype" tab, and includes:

    • intermediate (I)
    • nonsusceptible (NS)
    • not defined (N, ND)
    • resistant (R)
    • susceptible (S, sensitive)
    • susceptible-dose dependent (SSD)

Contributors Pathogens Project Components, topic listback to top

List of contributors by organism | Additional contributors

List of contributors by organism Pathogens Project Components, topic listback to top

Additional contributors Pathogens Project Components, topic listback to top

  • The Pathogen Detection project continues to grow and welcomes data submissions from additional contributors. The data submissions section of this document provides an overview of the submissions process and links to pages that provide detailed instructions.

Data Retrieval & Analysis back to top

Text term searches Data Retrieval & Analysis, topic listback to top

input text terms | example | output list of isolates | more information

Input text term(s) Data Retrieval & Analysis, topic listback to top

  • If you want to retrieve isolates from the existing data in the Pathogen Detection project, you can use the Isolates Browser to search for isolates that contain a term(s) of interest, as shown in the example below.

Example of text term search: Data Retrieval & Analysis, topic listback to top

Output tabular list of isolates that contain your search term(s) Data Retrieval & Analysis, topic listback to top

More information about text term searches: tips and techniques Data Retrieval & Analysis, topic listback to top

Sequence data analysis Data Retrieval & Analysis, topic listback to top

real time analysis | input sequence data | output phylogenetic distance trees | example | more information

Real time analysis Data Retrieval & Analysis, topic listback to top

  • Unlike other NCBI system such as BLAST, the Pathogen Detection project is not built with an interactive interface that allows users to upload their data and immediately obtain an answer. Instead, this project was set up to facilitate interactive analyses of large-scale surveillance projects that are automatically submitting real-time data to the NCBI archives that are then routed to an automated pipeline that generates interactive web reports on a daily basis. The web displays allow users to search, browse, and filter the automatically analyzed data that has been already submitted.

Input sequence data Data Retrieval & Analysis, topic listback to top

  • If you have sequenced new isolates and want to determine their relationship to existing isolates in the Pathogen Detection project, then you can follow the data submission procedures described in a separate section of this document. Your submission(s) will go through the NCBI data processing pipeline, which includes sequence analysis to identify closely related isolates. The results of the analysis on your data are then made available on the FTP site and in the SNP Tree Viewer, as described in the example below.
  • All of the existing isolates in the Pathogen Detection project have also undergone sequence analysis after they were submitted, and their results are also available on the FTP site and in the SNP Tree Viewer.

Output phylogenetic distance trees Data Retrieval & Analysis, topic listback to top

NCBI has developed a data processing pipeline that analyzes pathogens sequence data from GenBank or the Sequence Read Archive (SRA). Individual phylogenetic trees for each SNP cluster are available on FTP as well as the NCBI Pathogen Detection Isolates Browser, as noted below:

Example of sequence data analysis results (as interactive displays in SNP Tree Viewer) Data Retrieval & Analysis, topic listback to top

  • The FDA's GenomeTrakr project (BioProject PRJNA230969) for the surveillance and rapid detection of foodborne contamination events include a subset of E. coli isolates that belong to the SNP cluster "PDS000003441." Upon submission to NCBI, those isolates were compared to all other isolates in the Pathogen Detection project and were found, via the Pathogens data processing pipeline, to be closely related to other isolate genome sequences in that SNP cluster. In the Isolates Browser display, you can click on the "PDS*" accession number that appears in the "SNP Cluster" column for any one of those isolates (e.g., isolate PDT000133982.1) to open a SNP Tree Viewer display for the SNP cluster and interactively examine the phylogenetic distance tree. The Tree View shows a number of clinical and environmental samples that are very closely related, and therefore sheds light on the possible source of the outbreak. (Read more on the CDC website about that outbreak.)

More information about pathogen sequence data analysis Data Retrieval & Analysis, topic listback to top

Automatic E-mail Notifications of New Data Data Retrieval & Analysis, topic listback to top

You can perform a search in the Pathogens Isolates Browser, or to select an isolate of interest in the SNP Tree Viewer, and then automatically receive e-mail notifications each time new isolates become available that match your search criteria, or are closely related to your isolate of interest.

There are two ways to receive automatic e-mail notifications of new data, and you must be logged into your free My NCBI account to use either one:

"Save" a search in the Isolates Browser Data Retrieval & Analysis, topic listback to top

  • A "Save" button in the Isolates Browser interface allows you to save one or more searches, and automatically notifies you about new isolates that match the criteria of each saved search. (Read more and view an illustrated example.)

"Watch" an isolate of interest in the SNP Tree Viewer Data Retrieval & Analysis, topic listback to top

  • A "Watch" button in the SNP Tree Viewer interface allows you to watch one or more selected isolates in a tree, and automatically notifies you about new isolates that fall within the SNP distance that you have specified from the watched isolate(s). (Read more and view an illustrated example.)


Isolates Browser help back to top

What is the Isolates Browser? Isolates Browser, topic listback to top

The NCBI Pathogen Detection Isolates Browser is a web-based portal that provides analysis results for the two fundamental components of the Pathogen Detection Project: 1) pathogen isolate similarity and 2) antimicrobial resistance. The results are presented in tabular format with the full unfiltered results set presented as default, as opposed to summary documents appearing only after a search is completed as in other NCBI resources. You can query the Browser by entering various text strings, as described under "Allowable search terms."

Upon opening the Isolates Browser, a table displays data for all available isolates, with the most recently added data at the top. The data can be faceted by using filters (e.g., Property: has antimicrobial resistance (AMR) genotypes), queried with a wide variety of allowable search terms, using either basic or advanced search methods, and sorted by clicking on column headers.

Every row in the Isolates Browser is an assembled isolate, possibly with antimicrobial resistance (AMR), virulence, and/or stress response genotype data, and antibiotic susceptibility (AST) phenotype data, as available.

The table summarizes and links to the data available for each pathogen, such as strain name, geographic origin, isolation type (environmental or clinical), BioSample UID, organism group (PDG* accession), antimicrobial resistance (AMR)/virulence/stress response genotypes, and antibiotic susceptibility (AST) phenotypes, and more (see list of data fields available in the Isolates Browser).

If an isolate has a "PDS*" accession number in the "SNP Cluster" column, that indicates it is part of a SNP cluster, and you can click on the PSD* accession to launch the SNP Tree Viewer and examine the relationships among your isolate of interest and other similar isolates.

The information below provides details on how the Isolates Browser can be used, including allowable input, a description of the output, and an illustrated example of search results. The browser accepts basic queries that contain one or more text terms, with or without quotes. It also accepts advanced queries using the SOLR query language, such as complex Boolean queries that look for the search terms in specific data fields. Filters on the results page enable you to further narrow your retrieval, if desired, and links from the Isolates Browser to the SNP Tree Viewer enable you to interactively explore the relationship of an isolate of interest to other isolates in the SNP cluster, which were found, via the Pathogens data processing pipeline, to have closely related genome sequences.

Input for Isolates Browser Isolates Browser, topic listback to top

Allowable search terms | Free text vs. controlled vocabulary | Unique identifiers | NCBI accession prefixes
Basic search | Query tips | multiple terms | special characters | phrase searches | advanced searches | case sensitive vs. case insensitive searches
Filters to refine search | filters menu options | filters are generated on the fly | look for synonyms within a filter
Advanced search | SOLR query language | Query terms | Operators | Parentheses | Data fields | Examples of SOLR queries

Allowable search terms Isolates Browser, topic listback to top

Free text vs. controlled vocabulary Isolates Browser, topic listback to top

  • Free Text - Many data fields in the Isolates Browser are free text and therefore contain the exact terms that were supplied by the data submitters.
    • Please note that data submitters might use different forms of a term in their submissions.
    • For example, some submitters might use hyphens between terms (e.g., "all-purpose flour") while others might use spaces (e.g., "all purpose flour").
    • For a comprehensive search, include synonyms in your query, or use wildcards to search for a word stem.
    • For a more precise search, you can limit your query to a specific data field, such as Isolation source. Please note that, in field-specified queries, both the data field names and values are case sensitive.
    • Separate sections of this document provide query tips about searching for synonyms, and describe how the Isolates Browser handles special characters in search terms (such as hyphens in strain names, parentheses in gene names, slashes in serovar names, etc.).

  • Controlled Vocabulary - Some data fields in the Isolates Browser contain a controlled vocabulary. In these fields, it is not necessary to search for synonyms.

    • For example, the Location data field, which lists the geographic origin of the sample from which a pathogen was isolated, contains two parts: Country and Region. Country is a controlled vocabulary (https://www.ncbi.nlm.nih.gov/genbank/collab/country). Region is not controlled and can be anything (i.e., free text, such as a state abbreviation, province name, city name, zip code, etc.).

Unique identifiers and NCBI accession prefixesIsolates Browser, topic listback to top

  • NCBI Unique identifiers, such as an NCBI accessions (e.g., biosample ID SAMN05245394, bioproject ID PRJNA230969, etc.) can be used to retrieve pathogen isolates.

    Note that while NCBI accessions are unique, there can be multiplicity involved and it is possible for the same accession to appear in multiple current Pathogen records. For example, two or more isolates can belong to the same BioProject and/or same SNP cluster, so the record for each isolate will have its own PDT accession, but all of those records will contain the same PRJ and/or PDS accession.
  • Some NCBI accessions that can be searched in Pathogens Detection Project have the following prefixes:
    GCA | GCF | NG | PDG | PDS | PDT | PRJ | SAMN | SRR | SRS | WP
    • GCA_ - Accession number prefix for a GenBank genome assembly. This is data submitted by the scientific community directly to GenBank as an assembled genome.
      (Read more about genomes in the data types section of this document.)
    • GCF_ - Accession number prefix for a RefSeq genome assembly. This is a representative genome assembly for a given organism in RefSeq, a non-redundant database.
      (Read more about Prokaryotic RefSeq Genomes.)
      (Read more about NCBI Genome Assembly Models.)
    • NG_ - Accession number prefix for a RefSeq genomic sequence record.
      (Read more about NG_* accessions.)
    • PDG - Accession number prefix for a Pathogen Detection Organism Group.
      Technical note: An organism group (PDG) contains one or more targets (PDTs). A PDT is a member of zero or one SNP cluster (PDS), and never more than one cluster. A SNP cluster is composed of two or more PDTs, and each ach PDS is completely contained within a PDG.
      (Read more about organism groups in the data fields section of this document.)
    • PDS - Accession number prefix for a Pathogen Detection SNP Cluster.
      (Read more about SNP clusters in the data fields section of this document.)
    • PDT - Accession number prefix for a Pathogen Detection Target. This is the Pathogen project accession for an individual isolate's genome assembly.
      (Read more about genome asemblies in the data types section of this document.)
    • PRJ - Accession number prefix for an International Nucleotide Sequence Database Collaboration (INSDC) BioProject.
      (Read more about bioprojects in the data types section of this document.)
    • SAMN - Accession number prefix for a NCBI BioSample.
      (EBI BioSamples have the prefix SAMEA, and DDBJ BioSamples have the prefix SAMD.)
      (Read more about biosamples in the data types section of this document.)
    • SRR - Accession number prefix for a NCBI Sequence Read Archive (SRA) Run. A Run is an object that contains actual sequencing data for a particular sequencing experiment. SRA experiments may contain many Runs depending on the number of sequencing instrument runs that were needed.
      (Read more about SRA accessions.)
    • SRS - Accession number prefix for a NCBI Sequence Read Archive (SRA) Experiment Sample. A Sample is an object that contains the metadata describing the physical sample upon which a sequencing experiment was performed. That information is imported from the BioSample record.
      (Read more about SRA accessions.)
    • WP_ - Accession number prefix for a RefSeq protein sequence that has been found in one or more archaeal and bacterial RefSeq genomes. If the identical protein sequence has been found in multiple genomes, the WP_ sequence record is a non-redundant representation of all the instances of the protein, and includes links to the genomic sequences that code for the protein.
      Details about WP_* accessions are provided on the web pages that describe the RefSeq non-redundant proteins, the Prokaryotic RefSeq Genome Re-annotation Project, and the New RefSeq protein product and data model.
A basic search simply consists of one or more search terms, and does not include any Boolean operators, parentheses, or other criteria such as search field (data field) specifiers. Below is information about:

Query tips | multiple terms | special characters | phrase searches | advanced searches | case sensitive vs. case insensitive searches
Filters to refine search | filters menu options | filters are generated on the fly | Filters for gene fields | Filter for Scientific name | look for synonyms within a filter

Filters to refine search Isolates Browser, topic listback to top

Filters menu options | Filters are generated on the fly | Look for synonyms within a filter
  • The "Filters" menu options Isolates Browser, topic listback to top

    The "Filters" menu options in the Isolates Browser enable you to facet or subset the data in a variety of ways, and therefore can be used to refine your results, whether you have done a basic search or an advanced search.

    . The filter menu now allows all data fields in the column chooser to be filtered. By default, each filter displays the top 100 terms (based on the number of isolates retrieved by a term). Note that:
    • A Boolean "OR" is applied if multiple items are checked in the same filter field. This way you can choose multiple values in the same filter. For example:
      • Open the "Filters" tab of the Isolates Browser, scroll to the "Isolation source" field, and check the boxes for "stool" and "feces" The system will retrieve isolates that have either one of those values in the "Isolation source" field.
    • A Boolean "AND" is applied if you select items in several different filter fields (Location, Source, etc). For example:
      • Open the "Filters" tab of the Isolates Browser, then check the boxes for "clinical" in the "Isolation type" filter and "wound" in the "Isolation source" filter. The system will retrieve isolates that have both of your specified criteria.
    • If you prefer to apply a Boolean "AND" to multiple items within the same filter field, you can enter a SOLR query. For example:
    As mentioned under "Filters are generated on the fly," the terms that are listed under each filter will depend on the data set you are currently displaying in the browser and on the number and count of values in the filters if multiple filters have selections.

  • Filters are generated on the fly for a given data set Isolates Browser, topic listback to top

    The choices listed in the "Filters" tab reflect the attributes of the isolates that you are currently viewing in the browser. By default the top 100 terms (based on the number of isolates retrieved by a term, and listed by count of isolates per value are displayed). The total number of unique values is also shown at the bottom of each filter tab.


  • Gene fields: AMR genotypes, Stress genotypes, Virulence genotypes, AMR genotypes core Isolates Browser, topic listback to top

    The fields with gene and point-mutation fields have filters that separate the genes into categories based on characteristics that help to determine how likely the gene/point-mutation is to be properly transcribed and assembled. They are divided into COMPLETE, POINT, PARTIAL, HMM, MISTRANSLATION, and PARTIAL_END_OF_CONTIG. More information on what the categories mean is available below and on the AMRFinderPlus wiki. Each of the categories can be expanded by clicking on the '+' sign next to it, and within that the gene symbols may be selected to further refine your search. As with the other filter fields only the 100 most-frequent gene symbols will appear in the filter box. To search for specific genes you can use the search function within the filter.

  • Scientific name Isolates Browser, topic listback to top

    The Scientific name field is set up with a hierarchy that represents lineages based on NCBI Taxonomy to allow you to filter for all members of a given taxonomic group. Clicking on a node at a higher level will select all the taxa within that group even though the boxes by those names won't show up as selected. As with other filters only the 100 most common values are shown, in this case that is the 100 most common values in the Scientific name field and the higher level taxa that they belong to. You can search within the values using the Search box within the filter to narrow the choices and reveal scientific names that are not in the most frequent set. For example selecting Enterobacteriaceae will include all isolates that are Salmonella, E. coli, Shigella, and Klebsiella even though those more specific taxa are not selected individually.

  • Look for synonyms Isolates Browser, topic listback to top

    A number of data fields do not use a controlled vocabulary, but instead list the various terms that submitters applied to their data. As a result, submitters might use different terms for the same concept. Therefore, if you are using filters, look for synonymous terms that are listed under a given filter and check the boxes for any/all terms that are of interest. If you are searching the data fields directly (as described in the advanced search section of this document), consider including synonyms in your query in order to broaden retrieval.
    Synonyms are also useful to include if you are doing advanced searches, such as limiting your search to specific data fields. As an example, see the sample searches of the host organism data field.

SOLR Query Language

The Isolates Browser uses a modified SOLR search platform (version 6.6) to retrieve pathogen data. The Apache SOLR Reference Guides provides detailed documentation for the platform. Some key concepts are introduced below, and link to the complete documentation in the SOLR Reference Guide 6.6, particularly the sections on: Searching > Query Syntax and Parsing > The Standard Query Parser.

In some instances, there might be some slight differences between the Isolates Browser and the SOLR Standard Query Parser. For example, the Isolates Browser has been modified relative to the SOLR Standard Query Parser in the way it handles special characters that are part of a search term. Specifically, the browser has been programmed to automatically escape special characters (such as hyphens in strain names, parentheses in gene names, slashes in serovar names, etc.) and to treat them as part of the search term. The Browser therefore retrieves isolates that include the term exactly as it was entered, including special characters.

Query terms | single term | multiple terms | phrase | term modifiers | wildcard searches | special characters in search terms
Operators | AND, OR, NOT | plus (+) and minus (-) symbols | range searches [nnnn TO nnnn]
Parentheses | use to order Boolean queries | use to input a list of query terms | automatically escaped if part of a gene name or query term
Data Fields
Examples of SOLR queries

SOLR Query terms Isolates Browser Advanced Search, topic listback to top

single term | multiple terms | phrase | term modifiers | wildcard searches | special characters in search terms
  • Single term Isolates Browser Advanced Search, topic listback to top

    A single query term, such as lettuce, will retrieve all isolates that have the term in any data field.

    Examples:

    A search for:
    lettuce
    will show isolates that contain the term lettuce in any data field.

    Tips:

    If you search a specific data field, your search will become case sensitive.

    For example, compare the search results for:
    isolation_source:lettuce
    versus:
    isolation_source:Lettuce
    For broader retrieval, you can either remove the data field specifier to conduct a case insensitive search, or include synonyms in your query, for example:
    isolation_source:lettuce OR isolation_source:Lettuce
    A separate section of this document provides additional information about searching for synonyms.
  • Multiple terms Isolates Browser Advanced Search, topic listback to top

    If you include two or more terms in your query, the system will automatically insert a Boolean OR in each space that it encounters. As a result, it will search for each word individually, and the system will show isolates that contain at least one of your search terms in any data field.

    Examples:

    A search for the following query (with no quotes or special characters):
    romaine lettuce
    will be interpreted as:
    romaine OR lettuce

    A search for the following query (with no quotes or special characters):
    all purpose flour
    will be interpreted as:
    all OR purpose OR flour
    because the system will insert an OR when it encounters each space in the query string.

    A search for the following query (with no quotes and with a hyphen in all-purpose):
    all-purpose flour
    will be interpreted as:
    all-purpose OR flour
    because the system will treat the special character (hypen) as part of the first query term, and it will insert an OR where it encounters a space in the query string.

    Tips:

    If you include a data field specifier and you do not enclose your query terms in quotes, the data field specifier will be applied only to the term that immediately follows it, and that term will be searched in a case sensitive manner.

    For example, a search for the following query (with no quotes or special characters):
    isolation_source:romaine lettuce
    will be interpreted as:
    isolation_source:romaine OR lettuce
    The system will show all isolates that have the lower case term romaine in the Isolation Source data field, and the term lettuce in any case and in any field.
    If, on the other hand, you want to search romaine lettuce as a phrase, you will need to use quotes, as described below.
  • Phrase Isolates Browser Advanced Search, topic listback to top

    If you want to search for a phrase, surround your query terms with quotes.

    Examples:

    A search for:
    "romaine lettuce"
    will show isolates that contain that phrase in any data field.

    A search for:
    "all-purpose flour"
    will show isolates that contain the phrase all-purpose flour, and will conduct the search in a case insensitive manner because the query does not include a data field specifier.

    A search for:
    isolation_source:"All-Purpose Flour"
    and will show isolates that contain the phrase All-Purpose Flour in the Isolation Source data field.
    Because the query includes a data field specifier, the search is conducted in a case sensitive manner. It will therefore only show isolates that have the exact phrase you specified, including upper and lower case letters as well as the hypen.

    Tips:

    If no quotes are used, the system will automatically insert a Boolean OR when it encounters a space in the query string. If you query specific data fields, please note that the names of data fields, and the values they contain, are case sensitive. Special characters, such as the hypen in the examples above, are recognized as part of the search term and therefore retained in the query, regardless of whether quotes are used.

    For example, if the last sample search above was entered as isolation_source:All-Purpose Flour, with no quotes, it would be interpreted as isolation_source:All-Purpose OR flour. The Browser would show isolates that contain the term All-Purpose in the Isolation Source data field or the term flour in any data field. This is because the system processes the term adjacent to the data field specifier in a case sensitive manner, and inserts a Boolean OR when it encounters a space.
  • Term modifiers Isolates Browser Advanced Search, topic listback to top

    As noted in the "Standard Query Parser" section of the SOLR Reference Guide 6.6, "Solr supports a variety of term modifiers that add flexibility or precision, as needed, to searches. These modifiers include wildcard characters, characters for making a search "fuzzy" or more general, and so on."
  • Wildcard searches Isolates Browser Advanced Search, topic listback to top

    A question mark (?) can be included your query string to match any single character.
    An asterisk (*) can be included your query string to match zero or more sequential characters.

    Examples:

    A search for:
    AMR_genotypes:tet(?)
    will show isolates that have a string of "tet(?)" in the AMR Genotypes data field, with the question mark serving as a wildcard to retrieve gene names that have any single character in the parentheses, such as tet(A), tet(M), tet(O), tet(X). etc.

    A search for:
    strawberr*
    will show isolates that contain terms such as strawberry, strawberries, etc. in any data field.

    A search for:
    isolation_source:*berry
    will show isolates that contain terms such as strawberry, mulberry, etc. in the Isolation Source data field.

    Tips:

    The wildcard characters can appear anywhere in your search term (at the beginning, middle, or end).
    The SOLR Reference Guide 6.6 provides additional details about the use of wildcards.

  • Special characters in search terms Isolates Browser Advanced Search, topic listback to top

    As noted in the introduction to the advanced search section of this document, the Isolates Browser uses the SOLR search platform (version 6.6) to retrieve pathogen data. However, in some instances, there might be some slight differences between the Isolates Browser and the SOLR Standard Query Parser.

    For example, the Isolates Browser has been modified relative to the SOLR Standard Query Parser in the way it handles special characters that are part of a search term. Specifically, the browser has been programmed to automatically escape special characters (such as hyphens in strain names, parentheses in gene names, slashes in serovar names, etc.) and to treat them as part of the search term. As a result, the Browser retrieves isolates that include the term exactly as it was entered, including special characters.

    Examples:

    A search for:
    strain:KCRI-598A
    will show isolates that contain the term KCRI-598A in the Strain data field.

    A search for:
    serovar:1/2a
    will show isolates that contain the term 1/2a in the Serovar data field.

    A search for:
    AMR_genotypes:ant(6)-Ia AND AMR_genotypes:aph(3')-IIIa
    will show isolates that contain that have both the ant(6)-Ia and the aph(3')-IIIa in the AMR Genotypes data field.

    Tips:

    When you query specific data fields, please note that the names of data fields, and the values they contain, are case sensitive. Also, if your query string includes a space, surround the query string with quotes in order to do a phrase search. If no quotes are used, the system will automatically insert a Boolean OR when it encounters a space in the query string.

SOLR Operators Isolates Browser Advanced Search, topic listback to top

AND, OR, NOT | Plus (+) and Minus (-) symbols | Range searches [nnnn TO nnnn]
  • AND, OR, NOT Isolates Browser Advanced Search, topic listback to top

    The SOLR search platform allows you to apply Boolean logic to queries with the AND, OR, and NOT operators. Boolean operators must be written in upper case letters, or they can be represented as symbols:

    AND can be represented as &&
    OR can be represented as ||
    NOT can be represented as !

    By default, the system applies the OR operator each time it encounters a space in the query string.

    Examples:

    A search for:
    all-purpose flour
    will be interpreted as:
    all-purpose OR flour
    because the system applies a Boolean OR when it encounters a space in your query string.
    The system recognizes special characters such as the hyphen when they are part of a search term, and therefore will show isolates that contain the term all-purpose in any data field, or the term flour in any data field.

    A search for:
    romaine lettuce
    will be interpreted as:
    romaine OR lettuce
    will show isolates that contain the term romaine in any data field, or the term lettuce in any data field.

    A search for:
    romaine AND lettuce
    will show isolates that contain both of the terms, which can appear independently of each other in any data field. If you instead prefer to retrieve isolates in which two or more query terms to appear adjacent to each other, use quotes to conduct a phrase search. For example, a search for: "romaine lettuce" will retrieve isolates that contain that phrase romaine lettucein any data field.

    A search for:
    lettuce NOT romaine
    will show isolates that contain the term lettuce, but not the term romaine.
    That same search can also be written as:
    lettuce !romaine
    or as:
    lettuce -romaine

    Tips:

    The SOLR Reference Guide 6.6 provides additional details about the use of Boolean operators.
  • Plus (+) and Minus (-) symbols Isolates Browser Advanced Search, topic listback to top

    The plus (+) and minus (-) symbols can be used to require that a term be present or absent, respectively, in the records retrieved by a search.

    Examples:

    A search for:
    lettuce +romaine
    will show isolates that contain the term lettuce (in any data field) and that must contain the term romaine (in any data field).

    A search for:
    lettuce -romaine
    and will show isolates that contain the term lettuce (in any data field) but must not contain the term romaine (in any data field).

    Tips:

    The SOLR Reference Guide 6.6 provides additional details about the use of the plus (+) and minus (-) symbols in the section on Boolean operators.
  • Range searches [nnnn TO nnnn] Isolates Browser Advanced Search, topic listback to top

    To search for a range of values, enter a query such as:
    collection_date:[value1 TO value2]
    with square brackets surrounding the query string, and with the word "TO" written in upper case.

    Examples:

    Range of Collection Dates:

    A search for:
    collection_date:[2013-02* TO 2013-08*]
    will show isolates that were collected anytime from February 2013 through August 2013.
    A search for:
    collection_date:[2013* TO 2015*]
    will show isolates that were collected in any month or date from 2013 through 2015.

    See the section of this help document for more information about the Collection Date data field, which accepts an asterisk (*) as a wildcard.

    Range of Create Dates:

    A search for:
    creation_date:[2013-02 TO 2013-08]
    will show isolates that were first seen by the Pathogen Detection system anytime from February 2013 through August 2013.
    See the section of this help document for more information about the Create Date data field, which does NOT accept an asterisk (*) as a wildcard.

    Tips:

    The SOLR Reference Guide 6.6 provides additional details about Range searches.

Parentheses Isolates Browser Advanced Search, topic listback to top

order Boolean queries | input a list of query terms | automatically escaped if part of a query term
  • Use parentheses to determine order of execution in Boolean queries Isolates Browser Advanced Search, topic listback to top

    By default, the system applies Boolean operators from left to right in the query. Parentheses can be used to alter the order of execution of Boolean operators. Sub-queries that are surrounded by parentheses will be executed first.

    Examples:

    A search for:
    AMR_genotypes:qnr* AND (AST_phenotypes:ciprofloxacin=R OR AST_phenotypes:"nalixidic acid=R")
    will show all of the isolates that have a qnr gene and that are resistant to either ciprofloxacin or nalidixic acid.
    (For additional information about this example, see the section of this help document on Examples of SOLR Queries > Genotypes and phenotypes: has specific gene, resistant to antiobiotics.)

    Tips:

    The SOLR Reference Guide 6.6 provides additional details about use of parentheses for grouping terms to form sub-queries.
  • Use parentheses to input a list of query terms Isolates Browser Advanced Search, topic listback to top

    Search terms that are enclosed in parentheses will be OR'ed together.

    Examples:

    A search for:
    escherichia AND (FDA CDC USDA)
    will show isolates that contain the term escherichia (in any data field), and the term FDA or CDC or USDA (in any data field).

  • Parentheses are automatically escaped if they are an internal part of a gene name or query term Isolates Browser Advanced Search, topic listback to top

    As noted in the introduction to the advanced search section of this document, the Isolates Browser uses the SOLR search platform (version 6.6) to retrieve pathogen data. However, in some instances, there might be some slight differences between the Isolates Browser and the SOLR Standard Query Parser.

    For example, the Isolates Browser has been modified relative to the SOLR Standard Query Parser in the way it handles special characters that are part of a search term. Specifically, the browser has been programmed to automatically escape special characters, such as parentheses that are part of gene names, and to treat them as part of the search term. As a result, the Browser retrieves isolates that include the term exactly as it was entered, including special characters.

    Examples:

    A search for:
    AMR_genotypes:ant(6)-Ia AND AMR_genotypes:aph(3')-IIIa
    will show isolates that contain that have both strings, ant(6)-Ia and aph(3')-IIIa, in the AMR Genotypes data field.

Data fields in the Pathogens Isolates Browser Isolates Browser Advanced Search, topic listback to top

The Isolates Browesr data fields listed below have been indexed by the Pathogen Detection project and are therefore directly searchable.

Each data field reflects an available column in the Pathogens Isolates Browser web interface. The output section of this document provides tips on how to choose columns to include in the display.

Please note: in the list of available data fields below:
  • The term shown in the regular font is the display name (column header) shown by the Isolates Browser web interface. The term shown in (italics) is the name of the corresponding data field, if you want to search that field directly.
  • For example, one data field is listed as: Location (geo_loc_name). The term "Location" appears in the Isolates Browser column header, and "geo_loc_name" is the string you should use if you want to search that data field directly.
  • Brief italicized search examples are also provided for each data field, when possible, showing how to query the data field directly. The values represent text strings exactly as they appear in the data fields, including upper case and lower case letters, special characters such as hyphens, etc. The data field names and values are case sensitive, as noted below.
Case sensitive searches: The names of the data fields, and the values they contain, are case sensitive. The values represent text strings exactly as they were entered by the submitter, including upper case and lower case letters, special characters such as hyphens, etc. (A separate section of this document provides examples of case sensitive searches.)

The case-sensitivity and the retention of special characters such as hyphens and parentheses (when they are internal to a search term) were built into the system in order to ensure precise handling of searches for values such as strain name, serovar, gene symbol, and more. The case sensitivity and handling of special characters applies to other data fields as well.

Therefore, when you search a specific data field, the system will retrieve isolates that contain the exact string you have specified, including upper case and lower case letters, as well as special characters such as hyphens and parentheses.

Case insensitive searches: If you are uncertain about the exact text string that appears in isolate records, then you can simply enter the query in any text format (all upper, all lower, or mixed case) without a data field specifier. The system will then search the Text index, which is a case insensitive compilation of terms from many text-containing data fields. This provides a flexible search mechanism, although it is less precise in its retrieval as the query terms can appear in any text field of the pathogen isolate records. (A separate section of this document provides examples of case insensitive searches.)

The query tips section of this document includes a comparison of case sensitive versus case insensitive searches.

The available data fields in the Pathogens Isolates Browser include the following: Isolates Browser Advanced Search, topic listback to top
Note that fields shown in the default display are highlighted in blue. Each field is written in this format:   Display name (data_field_name)
The "Display name" is the column header that appears in the Isolates Browser web interface, and the "data_field_name" is the case-sensitive string you should enter if you want to search the data field directly using a SOLR query:
AMR genotypes (AMR_genotypes)
AMR genotypes core (AMR_genotypes_core)
AMRFinderPlus analysis type (amrfinderplus_analysis_type)
AMRFinderPlus version (amrfinderplus_version)
Assembly (asm_acc)
AST phenotypes (AST_phenotypes)
BioProject (bioproject_acc)
BioSample (biosample_acc)
Collected by (collected_by)
Collection Date (collection_date)
Computed types(computed_types)
Contigs (asm_stats_n_contig)
Create Date (creation_date)
Host (host)
Host Disease (host_disease)
IFSAC_category (IFSAC_category)
Isolate (target_acc)
Isolate_identifiers (isolate_identifiers)
Isolation Source (isolation_source)
Isolation type (epi_type)
K-mer group (kmer_group)
Lat/Lon (lat_lon)
Length (asm_stats_length_bp)
Level (asm_level)
Library Layout (LibraryLayout)
Location (geo_loc_name)
Method (assembly_method)
Min-same (minsame)
Min-diff (mindiff)
N50 (asm_stats_contig_n50)
Organism Group (taxgroup_name)
Outbreak (outbreak)
PFGE Primary Enzyme Pattern (PFGE_PrimaryEnzyme_pattern)
PFGE Secondary Enzyme Pattern (PFGE_SecondaryEnzyme_pattern)
Platform (Platform)
PD Ref Gene Catalog version (refgene_db_version)
Run (Run)
Strain (strain)
Serovar (serovar)
SNP cluster (erd_group)
Scientific name (scientific_name)
Source type (source_type)
Species TaxID (species_taxid)
SRA Center (sra_center)
SRA Release Date (sra_release_date)
Stress genotypes (stress_genotypes)
TaxID (taxid)
Virulence genotypes (virulence_genotypes)
WGS Accession (wgs_master_acc)
WGS Prefix (wgs_acc_prefix)
  • AMR genotypes (AMR_genotypes) list of Isolates Browser data fieldsback to top

    Antimicrobial resistance (AMR) genes found in the isolate during analysis with AMRFinderPlus. This is a de-duplicated list, so multiple genes that share the same symbol will only be represented once. <NONE> indicates a lack of AMR genes identified by AMRFinderPlus, while an empty field means AMRFinderPlus results are not yet available. See the AMRFinderPlus analysis type, PD Ref Gene Catalog version, and AMRFinderPlus version fields for more information about the AMRFinderPlus analysis of this isolate. (Separate sections of this document provide an overview of AMRFinderPlus and additional information about genotypes.)

    The genes that have been identified in an isolate's genome sequence are grouped into genotype categories, such as complete, partial, partial end of contig. The data processing pipeline section of this document provides more information about genotype categories.

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes in the second example below), and the use of wildcards such as the asterisk and question mark (as in the first and third examples below).

    Examples:
    • To search this field directly, enter a query such as:   AMR_genotypes:searchterm
    • Search for:   AMR_genotypes:mcr* AND AMR_genotypes:blaKPC*
      to show all of the isolates that have both a mobile colistin resistance gene and a KPC beta-lactamase.
    • Search for:   AMR_genotypes:ant(6)-Ia AND AMR_genotypes:aph(3')-IIIa
      to show all of the isolates that have both strings, ant(6)-Ia and aph(3')-IIIa, in the AMR Genotype data field.
    • Search for:   AMR_genotypes:tet(?)
      to show all of the isolates that have a genotype of "tet(?)," with the question mark serving as a wildard to retrieve gene names that have any character in the parentheses, such as tet(A), tet(M), tet(O), tet(X). etc.
    Note: To learn more about a given gene, open the Pathogen Detection Reference Gene Catalog and search for the gene symbol of interest. For example, see the Reference Gene Catalog results of a search for mcr* or ant(6)-Ia. In the Pathogen Detection Reference Gene Catalog search results display, clicking on the gene symbol will retrieve the isolates that have been found to contain the gene.
  • AMR genotypes core (AMR_genotypes_core) list of Isolates Browser data fieldsback to top

    Core antimicrobial resistance (AMR) genes found in the isolate during analysis with AMRFinderPlus. The only differences between AMR genotypes core (AMR_genotypes_core) and AMR genotypes (AMR_genotypes) column is that "plus" genes are not shown. This is a de-duplicated list, so multiple genes that share the same symbol will only be represented once. <NONE> indicates a lack of AMR genes identified by AMRFinderPlus, while an empty field means AMRFinderPlus results are not yet available. See the AMRFinderPlus analysis type, PD Ref Gene Catalog version, and AMRFinderPlus version fields for more information about the AMRFinderPlus analysis of this isolate. (Separate sections of this document provide an overview of AMRFinderPlus and additional information about core vs. plus genotypes.)

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes in the second example below), and the use of wildcards such as the asterisk and question mark (as in the first and third examples below).

    Examples:
    • To search this field directly, enter a query such as:   AMR_genotypes_core:searchterm
    • Search for:   AMR_genotypes_core:mcr* AND AMR_genotypes_core:blaKPC*
      to show all of the isolates that have both a mobile colistin resistance gene and a KPC beta-lactamase.
  • AMRFinderPlus analysis type (amrfinderplus_analysis_type) list of Isolates Browser data fieldsback to top

    Indicates the data types that were used to analyze the isolate's genome sequences using AMRFinderPlus. Genome sequences are generally analyzed in two passes:
    1. NUCLEOTIDE: this in an initial analysis that is done, using translated BLAST, immediately after a pathogen isolate genome is assembled. It identifies the proteins in the genome sequence.
    2. COMBINED: this is a second, more sensitive analysis that runs AMRFinderPlus on both an isolate's nucleotide and protein sequences. Protein BLAST, nucleotide BLAST, and HMMER are used to analyze the proteins. The combined analysis can produce more sensitive results than the initial nucleotide analysis.
      (Separate sections of this document provide details about the Pathogen Detection data processing pipeline and an overview of AMRFinderPlus. The AMRFinderPlus wiki provides details about installing and running the program, interpreting the results, and methods used.)
    This field will be empty if AMRFinderPlus results are not yet available.

    Data field names and values are case sensitive, as shown in the examples below, and the values of "NUCLEOTIDE" and "COMBINED" must be written in all upper case.

    Examples:
    • To search this field directly, enter a query such as:   amrfinderplus_analysis_type:searchterm
    • Search for:   amrfinderplus_analysis_type:COMBINED
      to show all of the isolates that were analyzed by running AMRFinderPlus on both their nucleotide and protein sequences.
  • AMRFinderPlus version (amrfinderplus_version) list of Isolates Browser data fieldsback to top

    The version of the AMRFinderPlus software that was used to analyze a particular isolate.

    New isolates are analyzed using the latest version of AMRFinderPlus software. Older isolates may have been analyzed with earlier versions of AMRFinderPlus software. There might be occasional updates to annotation on all isolates in special circumstances, such as the identification of a new genes (e.g., mobilized colistin resistance (mcr) genes).

    This field will be empty if AMRFinderPlus results are not yet available.

    (Separate sections of this document provide details about the Pathogen Detection data processing pipeline and an overview of AMRFinderPlus. The AMRFinderPlus wiki provides details about installing and running the program, interpreting the results, and methods used.)

    Data field names and values are case sensitive (as shown in the example below, in which the data field name is in all lower case). Additional query tips are provided in a separate section of this document.

    Examples:
    • To search this field directly, enter a query such as:   amrfinderplus_version:searchterm
    • Search for:   amrfinderplus_version:3.6.7
      to show all of the isolates that were analyzed with AMRFinderPlus version 3.6.7.
  • Assembly (asm_acc) list of Isolates Browser data fieldsback to top

    The accession number for the genome sequence from the Assembly database.

    Data field names and values are case sensitive, as shown in the examples below.
    Note that a transient state may occur where two isolates point to the same assembly when the submitter changes the taxonomic identifier for the biosample from one taxgroup to another. The assembly accession should be entered in the form of Accession.version, as in the first example below.
    If you enter only the accession, no hits will be returned.
    If you don't know the version number, then you can use an asterisk (*) to serve as a wildcard, as in the second example below.
    In either case, the letters that are in the accession number prefix must be in upper case. A separate section of this document provides search tips about case sensitive searches.

    Examples:
    • To search this field directly, enter a query such as:   asm_acc:searchterm
    • Search for:   asm_acc:GCA_000008865.2
  • AST phenotypes (AST_phenotypes) list of Isolates Browser data fieldsback to top

    Antibiotic resistance phenotype, based on Antimicrobial Susceptibility Test (AST) results. (read more about phenotypes and look at sample records)

    Data field names and values are case sensitive, as shown in the examples below. A separate section of this document provides tips about the use of quotes for phrase searches.

    DISCLAIMER: Note, the format for this data field in the isolates browser is presented as a list of antibiotic compounds broken down by resistance call made by the data submitter. These are typically, done by using CLSI or EUCAST standards and those standards change over time OR the call is made by an automated instrument which may infer the cutoff. This may mean that data submitted using an earlier standard may have different resistance calls for the same antibiotic compound than data submitter using a later standard, and even for the same organism and same isolate, different tests may yield different results. Users can search this field by the antibiotic compound AND by the resistance call – the format is different than most other fields in this document.

    Examples:
    • To search this field directly, enter a query such as:   AST_phenotypes:searchterm
    • Search for:   AST_phenotypes:imipenem=R
      to show isolates that are resistant to imipenem
    • Search for:     AST_phenotypes:ciprofloxacin=R OR AST_phenotypes:"nalixidic acid=R"
      to show isolates that are resistant to either ciprofloxacin or nalidixic acid
    A list of possible phenotype values is shown on the BioSample Antibiograms page, under the "Resistance Phenotype" tab, and includes:
    • intermediate (I)
    • nonsusceptible (NS)
    • not defined (N, ND)
    • resistant (R)
    • susceptible (S, sensitive)
    • susceptible-dose dependent (SSD)
  • BioProject (bioproject_acc) list of Isolates Browser data fieldsback to top

    BioProject accession (read more about bioprojects and look at sample records)

    Data field names and values are case sensitive. The letters that are in the accession prefix must be in upper case, as shown in the example below. (A separate section of this document provides a list of accession prefixes that appear in the Pathogen Detection project.)

    Examples:
    • To search this field directly, enter a query such as:   bioproject_acc:searchterm
    • Search for:   bioproject_acc:PRJNA230969
      to show all isolates that were sequenced as part of BioProject PRJNA230969, which describes the GenomeTrakr project by the US Food and Drug Administration (FDA) to sequence Escherichia coli (E. coli) genomes for the surveillance and rapid detection of foodborne contamination events.
    • Note that some bioprojects are "parent" to other bioprojects, and a search of this data field only retrieves the bioprojects that are being searched for explicitly. For exmaple, the search above will only retrieve BioProject PRJNA230969, and not its parent project (BioProject PRJNA230919). To access a parent project, or additional sub-projects that fall under the same parent, follow the "Navigate up" and "Navigate Across" links, respectively, that appear on a BioProject page.
  • BioSample (biosample_acc) list of Isolates Browser data fieldsback to top

    BioSample accession (read more about biosamples and look at sample records).

    Data field names and values are case sensitive. The letters that are in the accession prefix must be in upper case, as shown in the example below. (A separate section of this document provides a list of accession prefixes that appear in the Pathogen Detection project.)

    Examples:
    • To search this field directly, enter a query such as:   biosample_acc:searchterm
    • Search for:   biosample_acc:SAMN05245394
      to show the isolate from an individual BioSample, SAMN05245394, which was collected and sequenced as part of the FDA's GenomeTrakr project (BioProject PRJNA230969) for the surveillance and rapid detection of foodborne contamination events
  • Collected by (collected_by) list of Isolates Browser data fieldsback to top

    Name of persons or institute who collected the sample, if provided by the submitter.

    Data field names and values are case sensitive, as shown in the examples below, and quotes can be used for phrase searches.

    When you enter a query, the system will retrieve isolates that contain the exact query string you specified, including punctuation, capitalization, and spaces.

    To browse the various values that are available in a data field, use the "Choose Columns" option at the top of the "Matched Isolates" table, select the desired column (data field) to display, then click on the column header to sort by the values in that column.

    Examples:
    • To search this field directly, enter a query such as:   collected_by:searchterm
    • Search for:   collected_by:FDA
  • Collection Date (collection_date) list of Isolates Browser data fieldsback to top

    Date sample was collected, in the format the submitter supplied.
    (In contrast, the values in the Create Date field are in ISO format.)

    Note: collection_date is the time the sample was collected, which may differ from the type the data was submitted to INSDC and also different than the time the data was added to the Pathogen Detection project. For real-time submissions of pathogen surveillance data, these dates will be in close proximity. For legacy data, or research projects, these dates may differ wildly and be separated by years.

    You can use an asterisk (*) as a wildcard for truncation, in order to retrieve all of the isolates that were collected in a given month or year, as shown in the examples below.
    To search for a range of values, enter a query such as: collection_date:[value1 TO value2] with square brackets surrounding the query string, and with the word "TO" written in upper case.
    Data field names and values are case sensitive, and this data field name should be written in all lower case.

    Examples:
    • To search this field directly, enter a query such as:   collection_date:searchterm
    • Search for:   collection_date:2013-08-24
      to show isolates in which the submitter entered that exact string as the collection date.
    • Search for:   collection_date:2013-08
      to show isolates in which the submitter entered that exact string as the collection date (that is, the isolates in which the submitter provided only the year and month, but not the day, as the collection_date).
    • Search for:   collection_date:2013-08*
      to show isolates that were collected in August 2013. The asterisk serves as a wildcard, and the system will therefore retrieve all isolates that have 2013-08 as the stem of their collection date.
    • Search for:   collection_date:[2013-02* TO 2013-08*]
      to show isolates that were collected anytime from February 2013 through August 2013.
  • Computed types (computed_types) list of Isolates Browser data fieldsback to top

    "In-silico" typing results. Currently the results of executing SeqSero2 on Salmonella isolates (only) are presented in these subfields:
    • serotype - The serovar computed from the reads (if available) or the assembly of the isolate.
    • antigen_formula - The antigenic formula computed from the reads (if available) or the assembly of the isolate.

    Values for "Serotype" and "Antigen formula" in the Computed types field may not agree with the user submitted fields Serovar, TaxID, or Scientific name because those fields are reported by the submitter. The "computed_types" field, on the other hand, is a computational prediction based on the sequence calculated as part of the Pathogen Detection Pipeline.

    Examples:
    • Search for:   taxgroup_name:"Salmonella enterica" AND computed_types:serotype=Enteritidis
      to show isolates whose computed serovar is Enteritidis only.
    • Search for:   taxgroup_name:"Salmonella enterica" AND computed_types:antigen_formula=9:g,m:-
      to show isolates with the antigenic formula that corresponds to serovar Enteritidis.
    • Search for:   taxgroup_name:"Salmonella enterica" AND computed_types:serotype=Enteritidis AND NOT serovar:*nteritidis*
      to show isolates whose computed serovar is Enteritidis but were submitted with a different serovar.
  • Contigs (asm_stats_n_contig) list of Isolates Browser data fieldsback to top

    Number of contigs in the isolate's genome assembly. If this was submitted to GenBank by the submitter it will be from their assembly and will match the assembly stats in the assembly database (https://www.ncbi.nlm.nih.gov/assembly/). If it is from an assembly made by the Pathogen Detection system, it may not yet be in GenBank, and therefore this will be the only place to see the assembly statistics.

    To search for a range of values, enter a query such as:   asm_stats_n_contig:[value1 TO value2]
    with square brackets surrounding the query string, and with the word "TO" written in upper case. An interesting way to use a range search of this field is to retrieve isolates whose genome assemblies are comprised of only a few contigs.
    Data field names and values are case sensitive, and this data field name should be written in all lower case.

    Examples:
    • To search this field directly, enter a query such as:   asm_stats_n_contig:searchterm
    • Search for:   asm_stats_n_contig:[1 TO 3]
      to retrieve isolates with genome assemblies comprised of contigs that range in number from 1 to 3
  • Create Date (creation_date) list of Isolates Browser data fieldsback to top

    The date on which this isolate was first seen by the Pathogen Detection system, in the format: YYYY-MM-DD. Note, these dates are in ISO format.
    (In contrast, the values in the Collection Date field are in the format that was provided by the data submitter.)

    This data field does not accept an asterisk as a wild card. However, it allows you to input either a full date or a partial date as the query. For example, enter the query in the format:
    YYYY-MM-DD to retrieve all isolates first seen on a specific date, or
    YYYY-MM to retrieve all isolates first seen during a given month, or
    YYYY to retrieve all isolates first seen during a given year.
    To search for a range of values, enter a query such as:   creation_date:[value1 TO value2]
    with square brackets surrounding the query string, and with the word "TO" written in upper case.
    Data field names and values are case sensitive, and this data field name should be written in all lower case.

    Examples:
    • To search this field directly, enter a query such as:   creation_date:searchterm
    • Search for:   creation_date:2013-11-19
      to show isolates that were first seen by the Pathogen Detection system on that exact date.
    • Search for:   creation_date:2013-11
      to show isolates that were first seen by the Pathogen Detection system in November 2013.
    • Search for:   creation_date:2013
      to show isolates that were first seen by the Pathogen Detection system in 2013, regardless of the month or date.
    • Search for:   creation_date:[2013-02 TO 2013-08]
      to show isolates that were first seen by the Pathogen Detection system anytime from February 2013 through August 2013.
    • Search for:   creation_date:[2013 TO 2015]
      to show isolates that were first seen by the Pathogen Detection system in any month or date from 2013 through 2015.
  • Host (host) list of Isolates Browser data fieldsback to top

    Host species, if provided by the submitter.

    This field contains values exactly as they were entered by the data submitters. Some submitters might have entered a scientific name while others might have entered a common name; therefore, search for synonyms if you would like to retrieve more comprehensive results.
    Data field names and values are case sensitive, as shown in the examples below, and a separate section of this document provides tips about using synonyms in your query.

    Examples:
    • To search this field directly, enter a query such as:   host:searchterm
    • Search for:   host:"Homo sapiens"
      to retrieve only the isolates in which the submitter used the scientific name for the host species.
    • Search for:   host:human
      to retrieve only the isolates in which the submitter used the common name for the host species.
    • Search for:   host:"Homo sapiens" OR host:human
      to retrieve only the isolates in which the submitter used either the scientific name or the common name for the host species.
  • Host Disease (host_disease) list of Isolates Browser data fieldsback to top

    Host disease, if provided by the submitter.

    This field contains values exactly as they were entered by the data submitters. Search for synonyms if you would like to retrieve more comprehensive results.
    To browse the various values that are available in a data field, use the "Choose Columns" option at the top of the "Matched Isolates" table, select the desired column (data field) to display, then click on the column header to sort by the values in that column.
    Data field names and values are case sensitive, as shown in the examples below, and separate sections of this document provides tips about using synonyms in your query, and using quotes for phrase searches.

    Examples:
    • To search this field directly, enter a query such as:   host_disease:searchterm
    • Search for:   host_disease:HUS
    • Search for:   host_disease:"hemolytic uremic syndrome"
    • Search for:   host_disease:"Hemolytic Uremic Syndrome"
    • Search for:   host_disease:HUS OR host_disease:"hemolytic uremic syndrome" OR host_disease:"Hemolytic Uremic Syndrome"
  • IFSAC_category(IFSAC_category) list of Isolates Browser data fieldsback to top

    IFSAC_category, if provided by the submitter. The Interagency Food Safety Analytics Collaboration (IFSAC) develops regulatory-focused schemes to help categorize isolate sourcing information.

    This field contains values exactly as they were entered by the data submitters. Search for synonyms if you would like to retrieve more comprehensive results.

    To browse the various values that are available in a data field, use the "Choose Columns" option at the top of the "Matched Isolates" table, select the desired column (data field) to display, then click on the column header to sort by the values in that column.
    Data field names and values are case sensitive, as shown in the examples below, and separate sections of this document provides tips about using synonyms in your query, and using quotes for phrase searches.

    Examples:
    • To search this field directly, enter a query such as:   IFSAC_category:searchterm
    • Search for:   IFSAC_category:nuts

    An alternative way to search the IFSAC_category data field is to use the "Filters" option, which includes a "IFSAC_cateogry " text box, where you can enter the category name. Here it is possible to search for null values by selecting <EMPTY>.

  • Isolate (target_acc) list of Isolates Browser data fieldsback to top

    Pathogen Detection accession of the isolate. The accession begins with the prefix "PDT," which stands for Pathogen Detection Target. This database is the primary resource issuing PDT accessions.

    Each target is the genome assembly for a single pathogen isolate. There are several types of genome assemblies:
    1. isolate genomes assembled by the NCBI Pathogens data processing pipeline from sequence reads, but not published as genome sequence records in GenBank
    2. isolates submitted directly to GenBank as assembled genomes, and therefore have a corresponding "GCA" accession
    3. isolate genomes assembled by the NCBI data processing pipeline and then submitted to GenBank either by the submitter or on behalf of the submitter with their permission, or without their permission into the Third Party Annotation (TPA) database.

    Data field names and values are case sensitive, and the letters that are in the accession prefix must be in upper case, as shown in the example below. (A separate section of this document provides a list of accession prefixes that appear in the Pathogen Detection project.)

    The contents of this field may change for a given isolate if a new assembly or new metadata cause the pipeline to be rerun. See Data Retention and History Tracking for information on the data retention policy.

    Examples:
    • To search this field directly, enter a query such as:   target_acc:searchterm
    • Search for:   target_acc:PDT000133982
  • Isolate_identifiers (isolate_identifiers) list of Isolates Browser data fieldsback to top

    A list of alternative identifiers that the isolate may be known by.

    Ids are assembled from various fields in the BioSample record, including:
    1. auxiliary identifiers supplied with the Biosample
    2. sample_name
    3. strain
    4. isolate (from BioSample)
    5. NARMS_isolate_number
    6. culture_collection
    7. isolate_name_alias (split by delimiter)

    Data field names and values are case sensitive and embedded spaces must be contained in quotes.

    Examples:
    • To search this field directly, enter a query such as:   isolate_identifiers:searchterm
    • Search for a specific identifier: CFSAN045463 isolate_identifiers:CFSAN045463
    • Search for an identifier with embedded space: CVM N9107 isolate_identifiers:"CVM N9107"
    • Search with a wildcard pattern: FSIS* isolate_identifiers:FSIS*
    • Search a list isolate_identifiers:(PNUSAS185147 PNUSAS185148 PNUSAS185149)
  • Isolation Source (isolation_source) list of Isolates Browser data fieldsback to top

    Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived, if provided by the submitter.

    This field contains values exactly as they were entered by the data submitters. Data field names and values are case sensitive, as shown in the examples below. Separate sections of this document provides tips about the use of quotes for phrase searches, special characters that are part of a query term, and the use of wildcards.

    Examples:
    • To search this field directly, enter a query such as:   isolation_source:searchterm
    • Search for:   isolation_source:lettuce
    • Search for:   isolation_source:"All-Purpose Flour"
      to show all isolates that have that exact string (including upper case, lower case, and hypen) in the isolation source data field.
    • Search for:   isolation_source:*berry
      to show isolates that contain terms such as strawberry, mulberry, etc. in the isolation source data field, using the asterisk as a wildcard to match zero or more sequential characters.
    • Note that submitters might use different terms for the same type of source (e.g., "animal-chicken-young-chicken," "chicken," "chicken breast," "Chicken Breast," "chicken carcass," "comminuted chicken," and "raw intact chicken"), so search for synonyms to broaden your retrieval, if desired.
  • Isolation type (epi_type) list of Isolates Browser data fieldsback to top

    Isolation type of an isolate: clinical OR environmental/other OR NULL.
    Note, this field is derived from the attribute package selected by the isolate's submitter using one of the Pathogen templates in BioSample.
    • If attribute_package=Pathogen.cl.1.0 then isolation type is clinical.
    • If attribute_package=Pathogen.env.1.0 then isolation type is environmental/other, unless host or isolation_source indicates that it was isolated from a human subject in which case isolation type is clinical.
    • If neither of these packages is used then isolation type is NULL.

    The isolation type (epi_type) is used to calculate the SNP distance values Min-same and Min-diff . These have non-negative values when there are other isolates in the cluster having the same or different isolation type. These values will both be n/a if the isolate has isolation type NULL. These values will also be n/a if there is no other isolate in the cluster having the same or different isolation type.
    This data field's names and values are case sensitive and can be searched on values clinical OR environmental/other (enter as-is without quotes). The value NULL cannot be used as a search term. However, by using filters, you can choose between clinical OR environmental/other OR <EMPTY> and thereby find isolates whose epi_type is not set.

    Examples:
    • To search this field directly, enter a query such as:   epi_type:searchterm
    • Search for clinical isolates:   epi_type:clinical
    • Search for environmental isolates:   epi_type:environmental/other
    • Search for isolates without epi_type:   NOT epi_type:clinical NOT epi_type:environmental/other
  • K-mer group (kmer_group) list of Isolates Browser data fieldsback to top

    K-mer group accession, which is an alphanumeric representation of the Organism group. This database is the primary resource issuing PDG accession numbers. There is a one-to-one relationship of the organism group and the PDG accession, with each version representing each update of that organism group.

    The K-mer accession should be entered in the form of Accession.version, as in the first example below.
    If you enter only the accession, no hits will be returned.
    If you don't know the version number, then you can use an asterisk (*) to serve as a wildcard, as in the second example below.
    Data field names and values are case sensitive, and the letters that are in the accession prefix must be in upper case, as shown in the examples below. (A separate section of this document provides a list of accession prefixes that appear in the Pathogen Detection project.)

    Examples:
    • To search this field directly, enter a query such as:   kmer_group:searchterm
    • Search for:   kmer_group:PDG000000004.960
    • Search for:   kmer_group:PDG000000004.*
      with an asterisk (*) serving as a wildcard, if you don't know the version number of the K-mer accession.
  • Lat/Lon (lat_lon) list of Isolates Browser data fieldsback to top

    The geographical coordinates (latitude and longitude) of the location where the sample was collected, if provided by the submitter.
  • Length (asm_stats_length_bp) list of Isolates Browser data fieldsback to top

    Total length of the genome sequence assembly in number of base pairs (nucleotides).
    If this was submitted to GenBank by the submitter it will be from their assembly and will match the assembly stats in the assembly database (https://www.ncbi.nlm.nih.gov/assembly/). If it is from an assembly made by the Pathogen Detection system, it may not yet be in GenBank, and therefore this will be the only place to see the assembly statistics.

    When searching the Length data field, the value should be entered as an integer with no commas.
    To search for a range of values, enter a query such as:   asm_stats_length_bp:[value1 TO value2]
    with square brackets surrounding the query string, and with the word "TO" written in upper case.
    Data field names and values are case sensitive, and this data field name should be written in all lower case.

    Examples:
    • To search this field directly, enter a query such as:   asm_stats_length_bp:[value1 TO value2]
    • Search for:   asm_stats_length_bp:[4000000 TO 5000000]
      to retrieve isolates with genome assemblies that are anywhere in the range of 4,000,000 to 5,000,000 nucleotides in length.
  • Level (asm_level) list of Isolates Browser data fieldsback to top

    Assembly level.

    The NCBI Assembly database, which includes pathogen isolates as well as eukaryotic organisms, represents genomes assembled to different levels (read more about assembly levels). This field is only present for those assemblies in the assembly database. For pathogen assemblies not yet submitted to GenBank, this field will be blank, but for all intents and purposes the Pathogen Detection assemblies will only be at contig level. The Isolates Browser uses circle icons to represents the assembly levels, as follows:

    • Complete Genome:   Complete genome assemblies, represented in the "Level" column as a completely filled black circle icon.
    • Scaffold:   Assemblies that include scaffolds and contigs, represented in the "Level" column as a 1/2 filled circle icon.
    • Contig:   Assemblies that include only contigs, represented in the "Level" column as a 1/4 filled circle icon.
  • Library Layout (LibraryLayout) list of Isolates Browser data fieldsback to top

    Sequence Read Archive (SRA) library layout (PAIRED/SINGLE)

    Data field names and values are case sensitive. The value for library layout must be entered in all upper case, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   LibraryLayout:searchterm
    • Search for:   LibraryLayout:PAIRED
  • Location (geo_loc_name) list of Isolates Browser data fieldsback to top

    Geographical origin of the sample, if provided by the submitter. This matches the /country qualifier of GenBank records. The Location data field typically may have two parts: Country:Region. Country is a controlled vocabulary (http://www.insdc.org/country.html). Region is not controlled and can be anything (i.e., free text). For example, region could be a state abbreviation, province name, city name, zip code, etc.

    Data field names and values are case sensitive, as shown in the examples below. If you enter both Country and Region, surround the query string in quotes. If you only specify a country and no region, then the search system will retrieve all isolates with the specified country name, regardless of region.

    Examples:
    • To search this field directly, enter a query such as:   geo_loc_name:searchterm
    • Search for:   geo_loc_name:"USA:NY"
      with quotes around the "country:region" query string, to retrieve isolates that were collected in New York State.
    • Search for:   geo_loc_name:USA
      with no space before the country name, to retrieve isolates that were collected in the United States, regardless of region. (If you insert a space before the country name, the system converts the query to a search of the Text index, which is a case insensitive compilation of terms from many text-containing data fields. It will therefore retrieve isolates that contain your search term (in upper and/or lower case) in any data field.)
  • Method (assembly_method) list of Isolates Browser data fieldsback to top

    Assembly method.

    This field contains values exactly as they were entered by the data submitters.

    When searching this field, the query string you enter must match exactly the string that appears in the "Method" column, including capitalization, punctuation, and spaces.

    Data field names and values are case sensitive, and quotes can be used for phrase searches, as shown in the examples below.

    Examples:
    • To search this field directly, enter a query such as:   assembly_method:"search string in quotes"
    • Search for:   assembly_method:"CLC NGS Cell v. 9.0"
    • Search for:   assembly_method:"PacBio SMRT Analysis v. 2.3.0"
    • Search for:   assembly_method:"SPAdes v. 3.11.1"

  • Min-same (minsame) list of Isolates Browser data fieldsback to top

    Minimum SNP distance from this isolate to one of the same isolation type. For example, the minimum SNP distance from one clinical isolate to another clinical isolate, or from one environmental isolate to another environmental isolate.

    A value will appear in the "Min-diff" column only if an isolate has been found, by the Pathogen Detection Project data processing pipeline, to belong to a SNP cluster and another isolate in that cluster has the same isolation type (and the isolation type is not NULL). If it has, the isolate will contain a "PDS*" accession number in the "SNP cluster" column of the Isolates Browser, along with a value in the "Min-same" and/or "Min-diff" columns (depending upon the composition of the SNP cluster).

    To view the SNP cluster for an isolate of interest, click on either the "PDT*" accession number in the "Isolate" column, or the "PDS*" accession number in the "SNP cluster" column. In the SNP Tree Viewer display, the branch lengths are proportional to the number of SNPs among the isolates in the cluster. Mouse over any branch to see its length.

    Note that the value of Min-same is n/a where the isolate does not have a value for isolation type. It is also n/a where there are no other isolates in the cluster with this isolate's isolation type, or if the isolate is not in any SNP cluster.

    To search for a range of values, enter a query such as:   minsame:[value1 TO value2]   with square brackets surrounding the query string, and with the word "TO" written in upper case. Data field names and values are case sensitive, and this data field name should be written in all lower case.

    Examples:
    • To search this field directly, enter a query such as:   minsame:[value1 TO value2]
    • Search for:   minsame:[0 TO 6]
      to retrieve isolates that are no more than 6 SNPs away from other isolates of the same isolate type within the same cluster. In other words, retrieve clinical isolates that have a distance of no more than 6 SNPs from other clinical isolates in the same cluster, or retrieve environmental isolates that have a distance of no more than 6 SNPs from other environmental isolates in the same cluster.
  • Min-diff (mindiff) list of Isolates Browser data fieldsback to top

    Minimum SNP distance from this isolate to one of a different isolation type. For example, the minimum SNP distance from a clinical isolate to an environmental isolate, or vice versa.

    A value will appear in the "Min-diff" column only if an isolate has been found, by the Pathogen Detection Project data processing pipeline, to belong to a SNP cluster and another isolate in that cluster has a different "Isolation type" that is not NULL. If it has, the isolate will contain a "PDS*" accession number in the "SNP cluster" column of the Isolates Browser, along with a value in the "Min-diff" and/or "Min-same" columns (depending upon the composition of the SNP cluster).

    To view the SNP cluster for an isolate of interest, click on either the "PDT*" accession number in the "Isolate" column, or the "PDS*" accession number in the "SNP cluster" column. In the SNP Tree Viewer display, the branch lengths are proportional to the number of SNPs among the isolates in the cluster. Mouse over any branch to see its length.

    Note that the value of Min-diff is n/a where the isolate does not have a value for isolation type. It is also n/a where there are no other isolates in the cluster that has a type opposite to this isolate's isolation type, or if the isolate is not in any SNP cluster.

    To search for a range of values, enter a query such as:   mindiff:[value1 TO value2]   with square brackets surrounding the query string, and with the word "TO" written in upper case. Data field names and values are case sensitive, and this data field name should be written in all lower case. Alternatively Filters are a convenient way to search for ranges of values.

    Examples:
    • To search this field directly, enter a query such as:   mindiff:[value1 TO value2]
    • Search for:   mindiff:[0 to 6]
      to retrieve isolates that are no more than 6 SNPs away from other isolates of the opposite isolate type within the same cluster. In other words, retrieve clinical isolates that have a distance of no more than 6 SNPs from environmental isolates in the same cluster, or vice versa.
  • N50 (asm_stats_contig_n50) list of Isolates Browser data fieldsback to top

    Assembly contig N50. This is a statistical measure that defines assembly quality. At least half of the bases in the assembly belong to contigs that have a length of N50 or longer.
    If this was submitted to GenBank by the submitter it will be from their assembly and will match the assembly stats in the assembly database (https://www.ncbi.nlm.nih.gov/assembly/). If it is from an assembly made by the Pathogen Detection system, it may not yet be in GenBank, and therefore this will be the only place to see the assembly statistics.

    When searching the N50 data field, the value should be entered as an integer with no commas.
    To search for a range of values, enter a query such as:   asm_stats_contig_n50:[value1 TO value2]   with square brackets surrounding the query string, and with the word "TO" written in upper case. Data field names and values are case sensitive, and this data field name should be written in all lower case.

    Examples:
    • To search this field directly, enter a query such as:   asm_stats_contig_n50:[value1 TO value2]
    • Search for:   asm_stats_contig_n50:[1000000 TO 9999999]
      to retrieve isolates with genome assemblies that are highly aggregated (in this case 50% of the assembly length is in contigs 1 Mbp or greater in size).
  • Organism Group (taxgroup_name) list of Isolates Browser data fieldsback to top

    Organism group related by taxonomy for purposes of calculating SNP clusters.
    There is a one-to-one relationship between organism group and PDG accession. The organism group is effectively a shorthand for the organism that is predominant but does not list all organism present. These organism groups are manually constructed and may include sister species and outgroups. To see the full list of organism for each organism group utilize the scientific_name field.

    Some organism groups are represented by the Genus species name, such as "Listeria monocytogenes," and others are represented as a phrase, such as "E.coli and Shigella."

    Data field names and values are case sensitive, and quotes can be used for phrase searches, as shown in the example below. The system will retrieve isolates that contain the exact organism group name that you entered, including capitalization, punctuation, and spaces.

    Examples:
    • To search this field directly, enter a query such as:   taxgroup_name:searchterm
    • Search for:   taxgroup_name:"Acinetobacter baumannii"

    Tips:
    Alternative ways to retrieve isolates that belong to a specific organism group include:
    • Use the "Select an organism group" menu that appears near the top of the Isolates Browser interface, OR
    • Open the complete list of Organism Groups and follow the links of interest to retrieve the isolates that belong to a group of interest.

    Technical note:
    • An organism group (PDG) contains one or more targets (PDTs). A PDT is a member of zero or one SNP cluster (PDS), and never more than one cluster. A SNP cluster is composed of two or more PDTs, and each PDS is completely contained within a PDG. (A separate section of this document provides a list of accession prefixes that appear in the Pathogen Detection project.)

  • Outbreak (outbreak) list of Isolates Browser data fieldsback to top

    The submitter designated name for an occurrence of more cases of disease than expected in a given area or among a specific group of people over a particular period of time, if provided by the submitter.

    This field contains values exactly as they were entered by the data submitters.

    When searching this field, the query string you enter must match exactly the string that appears in the "Outbreak" column, including capitalization, punctuation, and spaces.

    Data field names and values are case sensitive, and quotes can be used for phrase searches, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   outbreak:"query string in quotes"
    • Search for:   outbreak:"1109COGX6-1 Cantaloupe"
    • Search for:   outbreak:"1203NYJAP-1"
    • To retrieve all isolates that have a value in the outbreak data field, enter a query that uses the asterisk (wildcard) as the value.

      Search for:   outbreak:*

      Once the search results are displayed, use the "Choose Columns" option at the top of the "Matched Isolates" table to add the "Outbreak" column to display, where you can browse the values that submitters entered in that data field.
  • PFGE Primary Enzyme Pattern (PFGE_PrimaryEnzyme_pattern) list of Isolates Browser data fieldsback to top

    Pulsed-field gel electrophoresis (PFGE) primary enzyme pattern, if provided by the submitter.

    This field contains values exactly as they were entered by the data submitters. When searching this field, the query string you enter must match exactly the string that appears in the "PFGE Primary Enzyme Pattern" column, including capitalization and punctuation.

    Data field names and values are case sensitive, as shown in the examples below.

    PFGE is a DNA fingerprinting technique used to differentiate bacterial strains based on the pattern of DNA fragments that are created by digesting their complete genome with a restriction enzyme. (Read about PFGE on the CDC website and in PubMed.)

    Examples:
    • To search this field directly, enter a query such as:   PFGE_PrimaryEnzyme_pattern:searchterm
    • Search for:   PFGE_PrimaryEnzyme_pattern:GX6A16.0016
    • Search for:   PFGE_PrimaryEnzyme_pattern:JFXX01.0787
    • To retrieve all isolates that have a value in the PFGE Primary Enzyme Pattern data field, enter a query that uses the asterisk (wildcard) as the value.

      For example:   PFGE_PrimaryEnzyme_pattern:*

      Once the search results are displayed, use the "Choose Columns" option at the top of the "Matched Isolates" table to add the "PFGE Primary Enzyme Pattern" column to display, where you can browse the values that submitters entered in that data field.
  • PFGE Secondary Enzyme Pattern (PFGE_SecondaryEnzyme_pattern) list of Isolates Browser data fieldsback to top

    Pulsed-field gel electrophoresis (PFGE) secondary enzyme pattern, if provided by the submitter.

    This field contains values exactly as they were entered by the data submitters. When searching this field, the query string you enter must match exactly the string that appears in the "PFGE Secondary Enzyme Pattern" column, including capitalization and punctuation.

    Data field names and values are case sensitive, as shown in the examples below.

    PFGE is a DNA fingerprinting technique used to differentiate bacterial strains based on the pattern of DNA fragments that are created by digesting their complete genome with a restriction enzyme. (Read about PFGE on the CDC website and in PubMed.)

    Examples:
    • To search this field directly, enter a query such as:   PFGE_SecondaryEnzyme_pattern:searchterm
    • Search for:   PFGE_SecondaryEnzyme_pattern:EXHA26.0556
    • Search for:   PFGE_SecondaryEnzyme_pattern:GX6A12.0022
    • To retrieve all isolates that have a value in the PFGE Secondary Enzyme Pattern data field, enter a query that uses the asterisk (wildcard) as the value.

      Search for:   PFGE_SecondaryEnzyme_pattern:*

      Once the search results are displayed, use the "Choose Columns" option at the top of the "Matched Isolates" table to add the "PFGE Secondary Enzyme Pattern" column to display, where you can browse the values that submitters entered in that data field.
  • Platform (Platform) list of Isolates Browser data fieldsback to top

    Sequence Read Archive (SRA) sequencing platform.

    Data field names and values are case sensitive. The data field name, "Platform," should be written with a leading upper case letter, and the values are also case sensitive, as shown in the examples below.

    Examples:
    • To search this field directly, enter a query such as:   Platform:searchterm
    • Search for:   Platform:ILLUMINA

    List of supported platforms:
    • ILLUMINA
    • LS454
    • ION_TORRENT

  • PD Ref Gene Catalog version (refgene_db_version) list of Isolates Browser data fieldsback to top

    The version of the Pathogen Detection Reference Gene Catalog that was used to analyze a particular isolate.

    New isolates are analyzed using the latest version of the Pathogen Detection Reference Gene Catalog. Older isolates may have been analyzed with earlier versions of the Pathogen Detection Reference Gene Catalog. There might be occasional updates to annotation on all isolates in special circumstances, such as the identification of a new genes (e.g., mobilized colistin resistance (mcr) genes).

    Because the "refgene_db_version" data field was added in February 2020, isolates that were analyzed prior to that time do not have a value in the corresponding "PD Ref Gene Catalog version" data column of the Isolates Browser display.

    (Separate sections of this document provide details about the Pathogen Detection data processing pipeline, Pathogen Detection Reference Gene Catalog help, and an overview of AMRFinderPlus that applies the Reference Gene Catalog data in the analysis of isolate genome assemblies. The AMRFinderPlus wiki provides details about installing and running the program, interpreting the results, and methods used.)

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes in the second example below), and the use of wildcards such as the asterisk and question mark (as in the first and third examples below).

    Examples:
    • To search this field directly, enter a query such as:   refgene_db_version:searchterm
    • Search for:   refgene_db_version:2020-01-06.1
      to show all of the isolates that were analyzed with the Pathogen Detection Reference Gene Catalog version 2020-01-06.1.
    • Search for:   refgene_db_version:2020-01-22.1
      to show all of the isolates that were analyzed with the Pathogen Detection Reference Gene Catalog version 2020-01-22.1.
  • Run (Run) list of Isolates Browser data fieldsback to top

    Sequence Read Archive (SRA) accession of the sequence that was used for the genome assembly.

    Data field names and values are case sensitive. The data field name, "Run," should be written with a leading upper case letter, and the "SRR" accession prefix should be written in all upper case, as shown in the examples below. (A separate section of this document provides a list of accession prefixes that appear in the Pathogen Detection project.)

    Examples:
    • To search this field directly, enter a query such as:   Run:searchterm
    • Search for:   Run:SRR3747659
    • Search for:   Run:SRR5862473 OR SRR7456389

  • Strain (strain) list of Isolates Browser data fieldsback to top

    Microbial strain name, if provided by the submitter.

    This field contains values exactly as they were entered by the data submitters.

    Data field names and values are case sensitive, as shown in the examples below.

    Separate sections of this document provide tips about the use of special characters such as the hyphen, wildcards such as the asterisk, and the use of quotes for phrase searches (for strain names that contain spaces).

    Examples:
    • To search this field directly, enter a query such as:   strain:searchterm
    • Search for:   strain:FDA00010279
    • Search for:   strain:KCRI-598A
    • Search for:   strain::PNUSA*

  • Serovar (serovar) list of Isolates Browser data fieldsback to top

    Combined field of sub-species, serotype, or serovar, if provided by the submitter.

    This field contains values exactly as they were entered by the data submitters.

    Data field names and values are case sensitive, as shown in the examples below.

    Separate sections of this document provides tips about the use of quotes for phrase searches, and special characters that appear in the sub-species, serotype, or serovar names.

    Examples:
    • To search this field directly, enter a query such as:   serovar:searchterm
    • Search for:   serovar:"4,[5],12:b:-"
    • Search for:   serovar:"Shigella sonnei"
    • Search for:   serovar:Enteritidis

  • SNP cluster (erd_group) list of Isolates Browser data fieldsback to top

    Pathogen SNP cluster accession. A SNP cluster is a group of isolates whose genome assemblies are closely related, depending on the clustering methodology used (as noted in the data processing section of this document).

    The SNP cluster accession data field name is erd_group, in which "ERD" stands for Epidemiologically Related Distance.

    Each SNP cluster can be viewed as a phylogenetic distance tree in the SNP Tree Viewer. (Read more in the SNP Tree Viewer help document, which includes an illustrated example of SNP Tree Viewer launch points and an illustrated example of a SNP Tree Viewer display.)

    Data field names and values are case sensitive, as shown in the examples below.

    The first sample search below includes an accession.version number. If you don't know the latest version number for a SNP cluster, you can use an asterisk * as a wildcard, as in the second example below. If you enter an older version number that has since been superceded by a newer version of the SNP cluster, the Isolates Browser will display a message that links to the newer version. The PDS version changes when the membership of a SNP cluster changes.

    A separate section of this document provides a list of accession prefixes that appear in the Pathogen Detection project, and the data retention and history tracking section describes the use of accession.versions to track changes to the data.

    Examples:
    • To search this field directly, enter a query such as:   erd_group:searchterm
    • Search for:   erd_group:PDS000003441.73
    • Search for:   erd_group:PDS000003441.*
      with an asterisk (*) serving as a wildcard, if you don't know the version number of the SNP cluster accession.
    • Note: Because the SNP cluster accession is unique, it is not necessary to include the data field name in searches. It is sufficient to just enter the SNP cluster accession, if desired. For example the first search above can simply be entered as PDS000003441.73 into the Isolates Browser, and the second search can be entered as PDS000003441.*.
    Either one of the search examples above will retrieve isolates that belong to a SNP cluster associated with an E. coli and Shigella outbreak that was traced to All-Purpose Flour. In that tree, the short branches that connect clinical and environmental samples indicate a high degree of similarity in the genome sequences of those isolates. (For more information about the All-Purpose Flour outbreak, see the section of this document on "How to identify the possible source of an outbreak.")
  • Scientific name (scientific_name) list of Isolates Browser data fieldsback to top

    Scientific name (in NCBI Taxonomy) of the isolate from the submitter.

    Data field names and values are case sensitive, and the genus name must begin with an upper case letter. For example, enter the scientific name for: Escherichia coli. The system will retrieve isolates that have the exact string you entered. An asterisk * can be used as a wildcard, if desired.

    Examples:
    • To search this field directly, enter a query such as:   scientific_name:searchterm
    • Search for:   scientific_name:"Escherichia coli O157:H7"
      to retrieve the isolates containing that full, exact string as the scientific name
    • Search for:   scientific_name:"Escherichia coli"
      to retrieve the isolates containing that exact string as the scientific name, with no additional characters.
    • Search for:   scientific_name:Escherichia*
      to retrieve the isolates containing Escherichia in the scientific name, followed by any other characters.
    An alternative way to search the scientific_name data field is to use the "Filters" option, which includes a "Scientific Name" text box, where you can enter the genus name (or the full genus and species name) of the pathogen, with the first letter of the genus capitalized. An autocomplete function will list the top 10 scientific names (based on number of isolates for each one) that begin with the term you entered. If your organism of interest doesn't fall within the top 10, then you can search the scientific_name data field directly for the organism of interest, as shown in the examples above.

    To retrieve all isolates that belong to a specific Organism group, use the "Select an organism group" menu on the Isolates Browser home page.
  • Source type (source_type) list of Isolates Browser data fieldsback to top

    The isolate source type. Possible values include Food, Animal, Environmental, Human, Animal feed.

    Data field names and values are case sensitive, and this data field name should be written in all lower case, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   source_type:searchterm
    • Search for:   source_type:Food
      to retrieve isolates with source_type Food.

    An alternative way to search the source_type data field is to use the "Filters" option, which includes a "Source type " text box, where you can enter the source_type string. Here it is possible to search for null values by selecting <EMPTY>.

  • Species TaxID (species_taxid) list of Isolates Browser data fieldsback to top

    The NCBI Taxonomy identifier (TaxID) at the species level for this isolate.

    Data field names and values are case sensitive, and this data field name should be written in all lower case, as shown in the example below.

    The TaxID number for a species can be obtained from the NCBI Taxonomy database. For example, search the database for Escherichia coli, then follow the link for that species name to open its Taxonomy Browser display, which shows a TaxID of 562.

    Examples:
    • To search this field directly, enter a query such as:   species_taxid:searchterm
    • Search for:   species_taxid:562
      to retrieve all isolates that belong to the species Escherichia coli.
  • SRA Center (sra_center) list of Isolates Browser data fieldsback to top

    The name of the center that submitted the data to the Sequence Read Archive (SRA).

    Data field names and values are case sensitive, as shown in the examples below.

    The system will retrieve isolates that contain the exact query string you specified, including punctuation, capitalization, and spaces.

    Separate sections of this document provide tips about use of quotes for phrase searches and special characters (such as a hyphen) that are part of a query term.

    To browse the various values that are available in a data field, use the "Choose Columns" option at the top of the "Matched Isolates" table, select the desired column (data field) to display, then click on the column header to sort by the values in that column.

    Examples:
    • To search this field directly, enter a query such as:   sra_center:searchterm
    • Search for:   sra_center:EDLB-CDC
    • Search for:   sra_center:FDA

  • SRA Release Date (sra_release_date) list of Isolates Browser data fieldsback to top

    Sequence Read Archive (SRA) release date.

  • Stress genotypes (stress_genotypes) list of Isolates Browser data fieldsback to top

    Stress resistance genes found in the isolate during analysis with AMRFinderPlus. These can include metal, biocide, and heat resistance genes. This is a de-duplicated list, so multiple genes that share the same symbol will only be represented once. <NONE> indicates a lack of AMR genes identified by AMRFinderPlus, while an empty field means AMRFinderPlus results are not yet available. See the AMRFinderPlus analysis type, PD Ref Gene Catalog version, and AMRFinderPlus version fields for more information about the AMRFinderPlus analysis of this isolate. (Separate sections of this document provide an overview of AMRFinderPlus)

    The genes that have been identified in an isolate's genome sequence are grouped into genotype categories, such as complete, partial, partial end of contig. The data processing pipeline section of this document provides more information about genotype categories.

    Data field names and values are case sensitive, as shown in the examples below.

    Examples:
    • To search this field directly, enter a query such as:   stress_genotypes:searchterm
    • Search for:   stress_genotypes:emrE
      to show all of the isolates that have the emrE gene.
    • Search for:   stress_genotypes:emrE AND stress_genotypes:merC
      to show all of the isolates that have both the emrE gene and the merC gene.
    Note: To learn more about a given gene, open the Pathogen Detection Reference Gene Catalog and search for the gene symbol of interest. For example, see the Reference Gene Catalog results of a search for emrE or merC. In the Pathogen Detection Reference Gene Catalog search results display, clicking on the gene symbol will retrieve the isolates that have been found to contain the gene.
  • TaxID (taxid) list of Isolates Browser data fieldsback to top

    The NCBI Taxonomy identifier (TaxID) for this isolate, which can have a classification that is narrower than species.

    Examples:
    • To search this field directly, enter a query such as:   taxid:searchterm
    • Search for:   taxid:83334
      to retrieve isolates for Escherichia coli O157:H7.
    Notes:

    Compare the TaxID data field that is described here with the "Species TaxID" data field that was described earlier.
    The Species TaxID data field contains taxonomy IDs at the Genus species level.
    The TaxID data field, in contrast, can contain classifications that are deeper than species, as shown in the examples above.

    The TaxID for a species and/or for deeper nodes can be obtained from the NCBI Taxonomy database. For example, search the database for Escherichia coli, then follow the link for that species name to open its Taxonomy Browser display, which show the TaxID for the species and will list the strains that fall under it. Follow the link for any strain name of interest to open its Taxonomy Browser display and view its TaxID.

    Some isolates might contain the same value in both fields, such as the E. coli isolates that are retrieved by a search for:
    species_taxid:562 AND taxid:562. Those isolates have just been classified at the Genus species level, and not any deeper.

  • Virulence genotypes (virulence_genotypes) list of Isolates Browser data fieldsback to top

    Virulence genes found in the isolate during analysis with AMRFinderPlus. This is a de-duplicated list, so multiple genes that share the same symbol will only be represented once. <NONE> indicates a lack of AMR genes identified by AMRFinderPlus, while an empty field means AMRFinderPlus results are not yet available. See the AMRFinderPlus analysis type, PD Ref Gene Catalog version, and AMRFinderPlus version fields for more information about the AMRFinderPlus analysis of this isolate. (Separate sections of this document provide an overview of AMRFinderPlus)

    The genes that have been identified in an isolate's genome sequence are grouped into genotype categories, such as complete, partial, partial end of contig. The data processing pipeline section of this document provides more information about genotype categories.

    Data field names and values are case sensitive, as shown in the examples below.

    Examples:
    • To search this field directly, enter a query such as:   virulence_genotypes:searchterm
    • Search for:   virulence_genotypes:fdeC
      to show all of the isolates that have the fdeC gene.
    • Search for:   virulence_genotypes:fdeC AND virulence_genotypes:iroE
      to show all of the isolates that have both the fdeC gene and the iroE gene.
    Note: To learn more about a given gene, open the Pathogen Detection Reference Gene Catalog and search for the gene symbol of interest. For example, see the Reference Gene Catalog results of a search for fdeC or iroE. In the Pathogen Detection Reference Gene Catalog search results display, clicking on the gene symbol will retrieve the isolates that have been found to contain the gene.
  • WGS Accession (wgs_master_acc) list of Isolates Browser data fieldsback to top

    The Whole Genome Shotgun (WGS) accession for the master record. The WGS master record contains no sequence data, and instead lists all of the accession numbers for the individual sequence records that compose the genome assembly for the isolate.

    Tips:
    The genome assembly identifier should be entered in the form of Accession.version, as in the first example below.
    If you enter only the accession, no hits will be returned.
    If you don't know the version number, then you can use an asterisk (*) to serve as a wildcard, as in the second example below.
    Data field names and values are case sensitive, and the accession prefix must be in upper case, as shown in the examples below.

    A separate section of this document provides a list of accession prefixes that appear in the Pathogen Detection project, and the data retention and history tracking section describes the use of accession.versions to track changes to the data.

    Examples:
    • To search this field directly, enter a query such as:   wgs_master_acc:searchterm
    • Search for:   wgs_master_acc:JZAA00000000.1
    • Search for:   wgs_master_acc:JZAA00000000.*
      with an asterisk (*) serving as a wildcard, if you don't know the version number of the WGS master record.
    A separate page provides more information about WGS data.

  • WGS Prefix (wgs_acc_prefix) list of Isolates Browser data fieldsback to top

    The stable accession prefix that is assigned to a Whole Genome Shotgun (WGS) project.

    Examples:
    • To search this field directly, enter a query such as:   wgs_acc_prefix:searchterm
    • Search for:   wgs_acc_prefix:JZAA
      to retrieve the isolate whose Whole Genome Shotgun (WGS) sequencing project that was assigned the prefix JZAA.
    Background: A separate page provides more information about WGS data.

    (This is the end of Isolates Browser data field descriptions.
    Go back up to list of data fields list of Isolates Browser data fields or to top of document back to top )

Examples of SOLR queries Isolates Browser Advanced Search, topic listback to top


Output from Isolates Browser Isolates Browser, topic listback to top

Tabular list of isolates | Exceptions table | Filters to refine results | Sort order
Customize the display (choose columns, default columns, additional columns)
SNP Tree Viewer link for each isolate that belongs to a SNP cluster
Show all AMR genotypes / Hide plus AMR genotypes button
"Share" function in the Isolates Browser
Illustrated example of Isolates Browser search results
Download data from the Isolates Browser web display (metadata, assemblies)
Isolates Browser in Google Cloud BigQuery

Tabular list of isolates Isolates Browser Output, topic listback to top

  • Upon opening the Isolates Browser, a table displays data for all available isolates, with the most recently added data at the top.
  • Every row in the Isolates Browser is an assembled isolate, possibly with antimicrobial resistance (AMR), virulence, and/or stress response genotype data, and antibiotic susceptibility (AST) phenotype data, as available.
  • The data for each isolate can also include strain name, geographic origin, isolation type (environmental or clinical), BioSample UID, K-mer group/organism group (PDG* accession), and more, as available. (See the Pathogens Isolates Browser data fields for a complete list.) Some of the data elements, such a accessions for corresponding BioSample and GenBank Assembly records, link to additional information in the source databases. The data in this table are either supplied by the submitter of the data into the BioProject, BioSample, SRA, and GenBank databases, and then collected from there by the Pathogen Detection system for display, or calculated by the Pathogen Detection system once the data is analyzed.
  • The isolates can be sorted by clicking on column headers, faceted by using filters (e.g., Property: has antimicrobial resistance (AMR) genotypes), or searched using basic or advanced queries (see examples of SOLR queries and an illustrated example of search results).
  • Tree Viewer links: If an isolate has a "PDS*" accession number in the "SNP Cluster" column, that indicates it is part of a SNP cluster, and you can click on the PSD* accession to launch the Tree Viewer and examine the relationships among your isolate of interest and other closely related isolates. read more...

Exceptions table Isolates Browser Output, topic list back to top

  • The results of a search for certain isolates in an organism group may include isolates that failed quality control (QC) and so are not used for analysis. Isolates having "QC exceptions" are listed in an "Exceptions Table" along with QC details above the main grid display. Users and submitters can find out why specific isolates are not being used.
  • There are three "consequences" of QC failure:
    • Not published - The isolate will not appear in any published organism group (PDG).
    • Not clustered - The isolate will appear in a published organism group (PDG) but will be presented as a singleton (ie no clustering attempted).
    • Not submitted - The isolate will appear in a published organism group (PDG) and will be clustered, but its assembled sequence will not be submitted to GenBank.
  • There are several exception "types":
    • ANI species check - When aligned against a database of type strains using average nucleotide identity (ANI) on the assembled sequence, the biosample's species could not be verified.
    • Readset validation failure - The SRA run was not valid and could not be used for assembly.
    • Assembly validation failure - The pathogen assembly was not valid and could not be used for analysis.
    • wgMLST validation failure - The GenBank assembly could not be used for clustering.
    • Bad triples - The assembly failed a triangle inequality test in the legacy kmer (ie non-wgMLST) clustering step.
  • The Exceptions table is published to both the Pathogen Isolates Browser and ftp. Further documentation about the ftp Exceptions file can be found at: FTP README file.
  • Exception columns are defined as follows:
    • exception type - The category of error
    • exception - Descriptive text for this category of error
    • consequence - The result of the error
    • lower limit - Lower allowed limit of the value if numeric
    • upper limit - Expected value, or upper limit of value if numeric
    • actual value - The value of the QC check for this isolate
    • BioSample - Biosample accession
    • run(s) - SRA accession for the sequencing run representing this isolate.
    • Isolate - Pathogen target accession for this isolate
    • Assembly - GenBank assembly accession for this isolate
    • organism - Organism this isolate was submitted with
    • strain - Strain this isolate was submitted with
    • sra center - SRA center that submitted the sequencing run
  • Click the download link to download the table in comma-delimited (.csv) or tab-delimited (.tsv) format.
  • Special note about assembly size validation: NCBI now validates the assembly size of most pathogenic bacterial organisms against fixed upper and lower bounds. These are set by species. The thresholds are the same for Pathogen and GenBank. The assembly size thresholds can be checked by species, see assembly size cutoffs. A table of min/max values is also available as a downloadable TSV file.

Filters to refine results Isolates Browser Output, topic listback to top

Sort order Isolates Browser Output, topic listback to top

  • The default sort order in the Isolates Browser is by Create Date (also known as target_creation_date). That is the date on which the isolate was first seen by the Pathogen Detection system. The isolates are shown in reverse chronological order, with the newest ones appearing at the top.
  • To change the sort order, click on a column header to sort by that criterion.
  • Example:
    • Open the Isolates Browser home page, which displays all available isolates in the default sort order.
    • Enter a search for strawberr*   (The asterisk is a wild card. The system therefore searches for the word stem and will retrieve isolates that contain terms such as strawberry, strawberries, etc. in any data field.)
    • By default, the isolates are sorted by Create Date.
    • Click on the "Organism" column header to sort alphabetically by organism name.
    • Each subsequent click on the same column header inverts the sort order. (The column header acts as a toggle switch to sort in ascending or descending order by the values in that column.)
    • To return to the original, default sort order, refresh the page (i.e., reload the Isolates Browser, or, if you have done a search, re-run the search).

Customize the Isolates Browser display Isolates Browser Output, topic listback to top

The Isolates Browser output table displays the default columns (highlighted in the isolates browser column list) initially, but you can use the "Choose Columns" option at the top of the "Matched Isolates" table to remove columns, select additional columns, or to display, and/or change the order of the columns. You can also drag column headers right and left to reorder them. Clicking on the column title will sort the list based on that column.

The options you select will persist within a given browser (e.g., Chrome, Edge, Internet Explorer, Firefox, Safari) until that browser's cookies are cleared/reset. To reset the column display and sort order to the default click the Choose columns button then click Default and OK.
  • If an isolate has a "PDS*" accession number in the "SNP Cluster" column of the Isolates Browser, this means the isolate's genome assembly has been found, via the Pathogens data processing pipeline, to be closely related to other isolate genome sequences in that SNP cluster.
  • Click on either the "PDS*" (Pathogen Detection SNP Cluster) accession number or the isolate's "PDT*" (Pathogen Detection Target) accession number to open the SNP Tree Viewer, which displays an interactive phylogenetic tree of all the isolates in the SNP cluster. (A separate section of this document provides more information about the SNP Tree Viewer.)
  • If the SNP Cluster column is blank for a given isolate, that means the isolate's genome assembly has not been found, by the Pathogens data processing pipeline, to be similar to any other isolate that is currently in the Pathogen Detection Project.

Show all AMR genotypes / Hide plus AMR genotypes buttonIsolates Browser Output, topic listback to top

"Share" function in the Isolates Browser Isolates Browser Output, topic listback to top

  • A "Share" button is available in the Isolates Browser search results display. It produces a URL that captures your search strategy, which can then be copied and shared with others to execute the search. The results of the search, however, will change over time as new data become available.

Illustrated example of Isolates Browser search results Isolates Browser Output, topic listback to top

Illustrated example of Pathogens Isolates Browser display, showing the results of a search for isolates that contain the terms escherichia, and FDA or CDC or USDA, and that have any value in the AST_phenotypes data field. The image shows the results as of July 24, 2018. Click on the image to open the current, live results for the search.
  • The illustration above shows the Pathogens Isolates Browser results (as of July 24, 2018) of a search for:
    escherichia AND (FDA CDC USDA) AND AST_phenotypes:*
    That search retrieves isolates that contain the term "escherichia" in any data field, and contain the term FDA or CDC or USDA in any data field, and contain any value in the AST_phenotypes data field.
  • Click on the illustration, or enter the query above, into the Isolates Browser, to open the current, live results for that search. Once the results are displayed, use the "Choose Columns" option to customize the display, for example, by adding the "AST Phenotypes" column to the display.
  • The Isolates Browser help section of this document provides additional information about searching, including basic searches, advanced searches, available data fields, and additional examples of SOLR queries.

Cross-browser selection - display isolates in MicroBIGG-E Isolates Browser Output, topic listback to top

  • Selected isolates can be displayed in MicroBIGG-E, the Microbial Browser for Identification of Genetic and Genomic Elements which displays the results of AMRFinderPlus analyses.
  • It is possible to view the full results in MicroBIGG-E for the isolates you have identified.
  • Click the Cross-browser selection button to the right of the Expand all button (you must be logged into your myNCBI account for this functionality). By default, all of the isolates from your Isolates Browser search will be selected, as indicated by the checkbox column; however, you can deselect rows manually.
  • Then click the Show in MicroBIGG-E button. A new tab will open with the MicroBIGG-E results for the selected isolates.
  • For example, having identified isolates that contain a blaKPC gene and a blaTEM-1 gene, a user might want to use MicroBIGG-E to determine if these genes co-occur on the same contig. Having used the search term AMR_genotypes:blaKPC* AND AMR_genotypes:blaTEM-1, the user can click the Cross-browser selection button to the right of the Choose Columns button. By default, all of the isolates from your Isolates Browser search will be selected, as indicated by the checkbox column; however, you can deselect columns manually. Then the user can click the "Show in MicroBIGG-E" button. A new tab will open with the MicroBIGG-E results for the selected isolates.

Isolates Browser data at Google Cloud Platform in BigQuery Isolates Browser Output, topic listback to top

Isolates browser and exceptions information is available on Google Cloud in BigQuery. See the Getting Started with BigQuery documentation for help getting started, and Isolates Browser data at Google Cloud Platform for details on the isolates and isolate_exceptions tables at Google BigQuery. From there the data can be analyzed and downloaded in bulk as well as linked to the microbigge table using SQL syntax.

Download data from the Isolates Browser web display Isolates Browser Output, topic listback to top

The Download button in the Pathogens Isolates Browser allows you to download either the metadata or the assemblies for all of the genomes currently displayed by the Isolates Browser, as described below. Please note that metadata can be downloaded for any isolate, whether or not it has been submitted to GenBank. In contrast, assemblies can only be downloaded for isolates that have been submitted to GenBank (i.e., for isolates that display an accession in the "Assembly" column). For bulk SQL access to table data see Isolates Browser data at Google Cloud Platform.
  • Metadata
    • Metadata can be downloaded for any isolate, whether or not it has been submitted to GenBank.
    • The Isolates Browser will download the data that are currently displayed into a comma separated value (*.csv) file.
    • For example, if you have chosen to customize the Isolates Browser display, only the columns you have chosen to display will be downloaded into the file.
    • Bulk data in tab-delimited format per organism group can also be downloaded from the FTP site. See the ReadMe.txt on the FTP site for more information.
    • To use SQL to query or to download >100,000 rows see also Isolates Browser data at Google Cloud Platform.
  • Assemblies
    • Assemblies can only be downloaded for isolates that have been submitted to GenBank:
      • The "Assembly" column will display an accession if an isolate's assembled genome sequence has been submitted to GenBank (because assemblies that have been submitted to GenBank are also represented in the Assembly Database).
      • The Assembly column will be blank if an isolate's genome sequence has not yet been submitted to GenBank. (The deposit of >500,000 isolates from the Pathogens Project into the GenBank database is an ongoing project. Many, but not all, of the isolates have been submitted to GenBank. Once the data for a given isolate have been deposited into GenBank, an accession will appear in the Assembly column, and the genomic data will be available for download at that time.)
    • Assembly data are downloaded as a Generic Feature Format (GFF) file. This is a tabular 9 column file that contains the annotations generated by the Assembly Database API. The Assembly Database home page includes a link to Genomes Download FAQ, which provide more information about data downloads.

SNP Tree Viewer help back to top

What is the SNP Tree Viewer? Tree Viewer, topic listback to top

For all pathogen isolates that are clustered together as part of the data processing pipeline, a phylogenetic tree is built for each cluster. The trees can be used to: (1) identify the possible source of an outbreak based on the sequence similarity of the clinical and environmental isolates in a tree, (2) select isolates of interest and examine their relationships to other isolates in the SNP cluster, or to each other, and (3) retrieve metadata about the pathogen isolate.

The information below provides details on real time analysis, how to access the SNP Tree Viewer, scope of data in a tree and output (four panels in a tree viewer display), which include: [A] description of tree (organism group and number of isolates), [B] isolates selected (navigation panel), [C] table of all isolates in tree, [D] interactive phylogenetic distance tree.

Real time analysis Tree Viewer, topic listback to top

Unlike other NCBI system such as BLAST, the Pathogen Detection project is not built with an interactive interface that allows users to upload their data and immediately obtain an answer. Instead, this project was set up to facilitate interactive analyses of large-scale surveillance projects that are automatically submitting real-time data to the NCBI archives that are then routed to an automated pipeline that generates interactive web reports on a daily basis. The web displays allow users to search, browse, and filter the automatically analyzed data that has been already submitted.

How to access the SNP Tree Viewer Tree Viewer, topic listback to top

The SNP Tree Viewer can be accessed from the Isolates Browser. Any isolate that has a "PDS*" accession number in the "SNP Cluster" column has a link to the SNP Tree Viewer. ("PDS" is the accession number prefix for a Pathogen Detection SNP cluster.)

Example: The FDA's GenomeTrakr project (BioProject PRJNA230969) for the surveillance and rapid detection of foodborne contamination events include a subset of E. coli isolates that belong to the SNP cluster "PDS000003441", and that were associated with a 2016 outbreak from all-purpose flour.

In the Isolates Browser display, you can click on the "PDS*" accession number that appears in the "SNP Cluster" column for any one of those isolates to open the SNP Tree Viewer display for the SNP cluster and interactively examine the phylogenetic distance tree. (Below is an illustrated example of SNP Tree Viewer launch points.)

The resulting SNP Tree View shows a number of clinical and environmental samples that are very closely related, and therefore sheds light on the possible source of the outbreak. The SNP Tree Viewer output section of this document includes an illustrated example of a SNP Tree Viewer display that includes isolates from the E. coli outbreak. (Read more on the CDC website about that outbreak.)

Illustration of Pathogens Isolates Browser output, showing launch points for the SNP Tree Viewer. Each SNP cluster (PDS*) accession opens a SNP Tree Viewer display.

Scope of data in a tree Tree Viewer, topic listback to top

The data processing pipeline section describes what data is available in the SNP Tree Viewer.

Individual phylogenetic trees for each SNP cluster are available on FTP as well as the NCBI Pathogen Detection Isolates Browser. (Separate sections of this file provide Isolates Browser help documentation and an overview of the data available on the FTP site.)

Output: four panels in a SNP Tree Viewer display Tree Viewer, topic listback to top

Description of tree (organism group and number of isolates)
Isolates selected (navigation panel)
Table of all isolates in tree
Interactive phylogenetic distance tree
Tree Viewer display controls:
Labels
Load Labels
Expand
Collapse
Subtree
Neighbors
Search & Highlight in Tree
"Share" function
Illustrated example of a SNP Tree Viewer display
"Watch" function to receive automatic e-mail notifications about new data related to selected isolate(s)
Illustrated example of automatic e-mail notification for a watched isolate

Description of tree Tree Viewer, topic listback to top

  • The top of a SNP Tree Viewer display provides summary information about the phylogenetic distance tree currently being displayed, such as:
    • Organism group, such as E. coli and Shigella, and the corresponding PDG accession.version for the group. (The "PDG" prefix = Pathogen Detection Group.)
    • Number of isolates in the tree, and the corresponding PDS accession.version for the tree. (The "PDS" prefix = Pathogen Detection SNP cluster.)
  • An example is shown in part A of the illustrated example of a SNP Tree Viewer display.
  • The composition of a tree can change over time as new data are added to the Pathogen Detection Project.
    (A separate section of this document on data retention and history tracking provides addition information about the ways in which data and analysis results continue to evolve.)

Isolates selected (navigation panel) Tree Viewer, topic listback to top

  • The navigation panel, which has the header "Isolates Selected" in the SNP Tree Viewer interface, allows for easy tree navigation based on the selection of isolates. Clicking on ANY isolate in the navigation panel will shift the focus of the tree to where that isolate is. This is especially critical for larger trees, where the number of isolates may be several thousand, or where the number of selected isolates is large.
  • The navigation panel also provides critical information on the similarity of isolates when there is more than one isolate selected, including min, max, and avg. SNP distances and the creation data ranges of the isolate(s), providing a quick and easy to use summary.
  • The number of items that are listed in the "Isolates Selected" section depends upon which link you followed from the Isolates Browser output to the SNP Tree Viewer display.
    • For example, the illustration of SNP Tree Viewer launch points (in the section on "how to access the SNP Tree Viewer") shows the Isolates Browser results from a search for the phrase "all-purpose flour" (as of September 4, 2018).
      • In the "Matched Clusters" section of the Isolates Browser results, clicking on the Pathogen Detection SNP cluster (PDS) accession would open a SNP Tree Viewer display with "10 Isolates Selected" out of the total 136 isolates in the tree. This is because 10 of the isolates that contain your search term have been found to belong to a SNP cluster. When you view the SNP cluster in SNP Tree Viewer, those 10 isolates will be automatically selected, and will be shown in red font in the interactive phylogenetic distance tree.
      • In the "Matched Isolates" section of the Isolates Browser results, clicking on an individual item (i.e., on an individual isolate's Pathogen Detection SNP cluster (PDS) or Pathogen Detection Target (PDT) accession) would open a SNP Tree Viewer display with only "1 Isolate Selected" out of the total 136 isolates in the tree.
    You can add or remove isolates from that list by clicking on isolates of interest in the phylogenetic tree to select/deselect them, by activating/deactivating their checkboxes in the table of all isolates in the tree, etc.
  • An example of the "Isolates Selected" navigation panel is shown as part B of the illustrated example of a SNP Tree Viewer display. It features six isolates: four clinical isolates, and two environmental isolates.
  • The selected isolates are also shown at the top of the table that lists all of the isolates in the SNP cluster, with their check boxes activated (as shown in part C of the illustrated example of a SNP Tree Viewer display).
  • The selected isolates are displayed in red font in the phylogenetic distance tree (as shown in part D of the illustrated example of a SNP Tree Viewer display).

Table of all isolates in tree Tree Viewer, topic listback to top

  • A table that lists all of the isolates in the SNP cluster appears above the phylogenetic distance tree. An example is shown in part C of the illustrated example of a SNP Tree Viewer display.
  • The table has the same data content as the Isolates Browser, but only for the subset of isolates in the currently viewed SNP cluster. The only additional data is a checkbox column that allows selections in the table to be reflected by selections in the tree and the navigation panel. Conversely selections in the tree are reflected by selections in the table. The table can be hidden from view and customized the same as in the Isolate Browser. (A separate section of this document describes Isolates Browser output and provides information on display controls such as choose columns.)
  • The "Share" button at the top of the table produces a URL that captures your customized view of the tree, which can then be shared with others to reproduce the same view. Critically, this allows the user to highlight selected isolates, collapse certain parts of the tree, and generate a view that can be shared in a document or via email with collaborators. The URL is temporary; the customized display remains available for one month. (Read more about the "share" function and data retention.)

Interactive phylogenetic distance tree Tree Viewer, topic listback to top

  • The bottom section of a SNP Tree Viewer display shows an interactive phylogenetic distance tree, as shown in part D of the illustrated example of a SNP Tree Viewer display.
  • Isolates that you have selected are shown in red font. Click on any isolate of interest in a live SNP Tree Viewer display in order to open a menu that allows you to select/deselect it.
  • Display Controls above a phylogenetic distance tree in the enable you to customize the view. Mouse over a control button in a live SNP Tree Viewer display to read about its function. Some of the controls include:

    • Labels button (at the top of the table that lists all of the isolates in the tree) allows you to determine which labels are displayed for the isolates in the tree view, from the set of labels that are available in the SNP Tree Viewer. The selections you make will persist within a given browser (e.g., Chrome, Edge, Internet Explorer, Firefox, Safari) until that browser's cookies are cleared/reset.
    • Load Labels button allows you to add custom labels to one or more isolates in the tree view. To do this:
      • On your local computer, create a tab-delimited text file (*.txt) that lists which isolates to label (by specifying their PDT* accessions), and which label(s) to add to a given isolate.
        • The text file should contain one line per PDT accession and label-value pair.
        • The text file can contain multiple lines with the same PDT accession. For example, if you want to add two custom labels to a given PDT, the file should contain two lines for that accession, with one label and value pair in each line.
        • The contents of a sample tab-delimited text file for loading custom labels could look like:
          PDT000123456 YourLabelName1 ValueA
          PDT000123456 YourLabelName2 ValueB
          PDT000456789 YourLabelName1 ValueC
          PDT000456789 YourLabelName3 ValueD
      • Save the text file on your local computer.
      • Click on the "Load Labels" button and choose the file you want to load.
      • A messsage will appear that says, Add N labels, where N is the number of properly formatted rows in your text file. (Each properly formatted row contains three items in a tab-delimited format: the PDT accession, a label name, and the value. If any item is missing from row, that row will not be counted, and the information it contains will not be displayed in the tree view.)
        • In the case of the sample text file above, the message would say: Add 4 labels. The SNP Tree Viewer would then display ValueA and ValueB for PDT000123456, and ValueC and ValueD for PDT000456789, in addition to the other labels that were already shown for those isolates.
      • Note: the Share function will not capture the custom labels you added to the display. However, you can use the "Export" option to save the customized tree in Newick, PNG, or PDF format.
    • Expand button expands all branches (default)
    • Collapse button collapses branches to show 100 nodes. Clusters with fewer nodes will not be collapsed.
  • A Subtree menu appears if you click on the circle that represents a node in the tree. The Subtree menu includes options such as:
    • Subtree view opens only the subtree you have selected in a new tab.
    • Collapse subtree reduces the isolates in the branch into a blue cloud. Click on the collapsed node to open the menu and "Expand subtree" again, if desired.
    • As an example, see part D of the illustrated example of a SNP Tree Viewer display. The lower left hand corner includes an inset showing the Subtree menu.
  • The SNP Tree Viewer offers options to highlight or select groups of isolates in a single action, whether you are viewing all isolates in the tree or a only a subtree. For example:
    • The "Neighbors" button (at the top of the table that lists all of the isolates in the tree) allows you to instantly select (i.e., show in red font the tree and add them to the list of "Selected isolates") all isolates that fall within a SNP distance of your originally selected isolate(s).
    • "Search & Highlight in Tree" searches all labels that are currently displayed by the SNP Tree Viewer, including custom labels you might have added to the tree.
      • The browser will highlight (display in bold font) isolates that contain your search term in the tree.
      • The check mark icon that appears in the right hand side of the "Search & Highlight in Tree" text box allows you to select all of the highlighted isolates with a single click. Selected isolates are displayed in red font in the tree, and are added to the list of "Selected isolates" at the top of the SNP Tree Viewer display.
      • If you prefer to select individual isolates, rather than the complete set of highlighted isolates, simply left click on an isolate of interest and choose "select" from the pop-up menu.

"Share" function in the SNP Tree Viewer Tree Viewer, topic listback to top

  • A "Share" button is available in the SNP Tree Viewer display (as shown in part C of the illustrated example of a SNP Tree Viewer display). It produces a URL that captures your customized view of the tree, which can then be copied and shared with others to reproduce the same view.
  • The URL is temporary, remaining valid for 60 days.
  • For the first 30 days, the URL will open the customized display, showing the isolates you selected and any other customizations you made to the view.
  • For the second 30 days, the URL continues to be valid, but during that time, it will only show a link to the default display for the most recent version of the SNP cluster. That is, the URL will not open the original customized view, but instead will redirect to a version of the phylogenetic distance tree that reflects the most recent for the tree.

    (As noted above, under description of tree, the composition of a tree can change over time as new data are added to the Pathogen Detection Project. A separate section of this document describes the data retention and history tracking policy and examples of the ways in which data and analysis results continue to evolve.)

Illustrated example of SNP Tree Viewer display Tree Viewer, topic listback to top

Each tree displays all members of a SNP cluster, defined as a group of isolates whose genome assemblies are closely related, depending on the clustering methodology used (as noted in the data processing section of this document). The "Filters" option can be used, if desired, to display a subset. The interactive phylogenetic distance tree is at the bottom of a SNP Tree Viewer display, and selected isolates shown in red font in the tree.

Illustrated example of Pathogens SNP Tree Viewer display, showing the phylogenetic distance tree for a SNP cluster that contains isolates associated with an E. coli outbreak from all-purpose flour, reflecting data as of September 4, 2018. A footnotes under the illustration describes how to open a live SNP Tree Viewer display for the most current data in that SNP cluster.
  • The illustration above shows the SNP Tree Viewer display (as of September 4, 2018) for the Pathogen Detection Group (organism group) PDG000000004.997 and the SNP cluster PDS000003441.80, which includes isolates associated with an E. coli outbreak from all-purpose flour. (Read about that outbreak on the CDC website.)
  • As noted above, under description of tree, the composition of a tree can change over time as new data are added to the Pathogen Detection Project.
  • To open a live display of the most recent data for the SNP cluster, you can search for PDS000003441 in the Isolates Browser. That will retrieve all isolates that currently belong to that SNP cluster. Then click on the PDS000003441 accession number in the SNP Cluster column for any isolate in the search results to open the SNP Tree Viewer display for the current data. (see illustrated example of SNP Tree Viewer launch points)
  • The SNP Tree Viewer help section of this document provides additional information about using the tool.
    A "Share" button on the SNP Tree Viewer display can be used to copy a URL that captures your customized view of the tree, which can then be shared with others to reproduce the same view. The URL is temporary; the customized display remains available for one month (read more about the "share" function).

Automatic E-mail Notifications of New Data back to top

Background Isolates Browser Automated Searches, topic listback to top

  • The NCBI Pathogen Detection Project data are updated frequently. The project includes a feature for automatic e-mail notifications of new data. It is a current awareness service to inform you about new data as it becomes available, for pathogens that are of interest to you. This feature is designed to allow users to search once, and then get automatic notifications if any pathogen isolates match their search criteria.
  • Components of the automatic e-mail notifications system include:

    • A "Save" button in the Isolates Browser interface,
      which allows you to save a search and automatically notifies you about new isolates that match the criteria of the saved search. (Read more and view an illustrated example.)
    • A "Watch" button in the SNP Tree Viewer interface,
      which allows you to watch one or more selected isolates in a tree, and automatically notifies you about new isolates that are similar to the isolate(s) you have chosen to watch, because they fall within the SNP distance that you have specified. (Read more and view an illustrated example.)

Limitations Isolates Browser Automated Searches, topic listback to top

  • Searches are triggered for each and every organism group update that is delivered to the Pathogen Browser. An email is sent for each set of hits per organism group. That means if a search, for example for a particular antimicrobial resistance gene is not specific for a certain organism, then search results may be delivered multiple times per day. This is considered a feature and not a bug. There are currently 22 organism groups, and more are expected in the future. Not all searches can currently be done.

Requirements for automatic e-mail notifications Isolates Browser Automated Searches, topic listback to top

My NCBI login | Perform search in Pathogens Isolates Browser
  1. My NCBI login
    • Searches are tied to an email address. The only way to do this is to use your My NCBI login. If you do not yet have a My NCBI account, it is easy to set one up and there is no cost.
    • You will need to be logged in to My NCBI order to save searches, which will then be run in an automated way on a daily basis. The system will send e-mail notifications when new data arrive for a saved search.
    • You do not need to be logged in to receive the e-mail notifications. The notifications will be sent to the My NCBI email address you used when creating the account.
    • More information about My NCBI is available in the My NCBI help document, video overview (YouTube).
      • The main function of MyNCBI for the Pathogens Isolates Browser is to associate your e-mail address with the searches that you save, so you can received e-mail notifications about new data.
      • The My NCBI help document and video overview, above, provide general information about My NCBI and are included here as a general reference.
      • Some of the features described in help document and video overview apply to NCBI databases that are within the Entrez search system, but might not apply to Pathogens, which is outside of that system because it uses a different search engine (SOLR).
      • For example, the Pathogens saved searches will not appear directly on your My NCBI account page, but are instead accessible through the "Saved Searches" link in the Pathogens Isolates Browser or the "Watched Isolates" link in the SNP Tree Viewer.

SAVE a search in the Pathogens Isolates Browser Isolates Browser Automated Searches, topic listback to top

  • After you have: (1) logged into your MyNCBI account, and (2) performed a text search search in the Pathogens Isolates Browser, you can use the "Save" button to store the search strategy.
  • Your search will then be run in an automated way on a daily basis .
  • You will receive automatic e-mail notifications only if/when new isolates become available that match your search criteria.
  • Use the "Saved Searches" link on the Pathogens Isolates Browser interface to view the list of your saved searches, and to edit or delete the searches.
  • The illustrated example below shows the "Save" button, the "Saved Searches" link, and a sample automatic e-mail for a saved Search.

Illustrated example of automatic e-mail notification for a Saved Search Tree Viewer, topic listback to top

Illustrated example of the Pathogens Isolates Browser SAVE function, and including an example of an automatic e-mail message that contains a notification of new isolates that match the saved search.

WATCH an isolate in the SNP Tree Viewer Isolates Browser Automated Searches, topic listback to top

  • After you have (1) logged into your MyNCBI account, (2) performed a search in the Pathogens Isolates Browser, and (3) launched the SNP Tree Viewer for any isolate retrieved by your search, you can use the "Watch" button to store the isolate in your My NCBI account and receive automatic e-mail notifications of closely related new isolates as they become available in the system.
    (A separate section of this document provides more details about how to access the SNP Tree Viewer as well an an illustrated example of SNP Tree Viewer launch points.)
  • If you select multiple isolates in the SNP Tree View and then press the "Watch" button, then all of the selected isolates will be added to your list of watched isolates.
  • The system will prompt you to enter a name for the watched isolate(s), and to specify the maximum SNP distance for receiving reports of new data.
  • Each isolate will be watched on a daily basis in an automated way.
  • You will receive automatic e-mail notifications only if/when new isolates that fall within a specified SNP distance of the isolate(s) that you select in that tree view.
  • Use the "Watched Isolates" link on the SNP Tree Viewer interface to view your list of watched isolates, and to rename a watch, edit the SNP cutoff, or delete it from your list.
  • The illustrated example below shows the "Watch" button, the "Watched Isolates" link, and a sample automatic e-mail for a watched isolate.

Illustrated example of automatic e-mail notification for a Watched Isolate Tree Viewer, topic listback to top

Illustrated example of the Pathogens SNP Tree Viewer WATCH function, and including an example of an automatic e-mail message that contains a notification of new isolates that fall within the SNP distance you specified from an isolate that you are watching.


Antimicrobial Resistance (AMR) Resources back to top

AMR Overview Antimicrobial resistance (AMR) resources, topic listback to top

In response to the rising threat of antimicrobial resistance (AMR) in pathogens, the White House developed the National Action Plan for Combating Antibiotic-Resistant Bacteria in 2015 and updated that plan with the 2020-2025 National Action Plan for Combating Antibiotic-Resistant Bacteria. NCBI has built several resources and tools to achieve several specific project goals, including comparison of newly isolated pathogens to existing pathogen data to identify relationships, and to analyze the AMR repertoire of each isolate. The schematic illustration below shows the antimicrobial resistance resources in the NCBI Pathogen Detection project, including data sets and tools, as well as the relationships among them.

Schematic illustration showing the antimicrobial resistance (AMR) resources in the NCBI Pathogen Detection project.
Additional details about each resource are available:

AMR Landing page Antimicrobial resistance (AMR) resources, topic listback to top

The AMR landing page provides information about the NCBI National Database of Antibiotic Resistant Organisms (NDARO), a collaborative, cross-agency, centralized hub for researchers to access AMR data to facilitate real-time surveillance of pathogenic organisms. (Read more in the antimicrobial resistance factsheet.)

AMR Resources page Antimicrobial resistance (AMR) resources, topic listback to top

The AMR Resources page provides a list of available resources, with a brief description and sample searches or links to additional information about each one.

Pathogen Detection Reference Gene Catalog help Antimicrobial resistance (AMR) resources, topic listback to top

What is the Pathogen Detection Reference Gene Catalog? Pathogen Detection Reference Gene Catalog help, topic listback to top

The NCBI Pathogen Detection Reference Gene Catalog is a non-redundant database of bacterial genes related to antimicrobial resistance, biocide and stress resistance, general efflux, virulence, or antigenicity. A graphical user interface (GUI) allows you to browse and search the database.

Every row in the Pathogen Detection Reference Gene Catalog display is a reference gene or a point mutation.

Scope: the Reference Gene Catalog includes two data subsets: Pathogen Detection Reference Gene Catalog help, topic listback to top
  1. "Core": this subset includes highly curated, AMR-specific genes and proteins from the Bacterial Antimicrobial Resistance Reference Gene Database (BioProject PRJNA313047), plus point mutations. The sources of input for this curated database include: 1) allele assignments, 2) exchanges with other external curated resources, 3) reports of novel antimicrobial resistance proteins in the literature.
  2. "Plus": this subset includes genes related to biocide and stress resistance, general efflux, virulence, or antigenicity.

    The Pathogen Detection Reference Gene Catalog supercedes the previously available "AMR Reference Gene Browser," which encompassed only the protein-coding genes of the "core" set.
Non-redundant Pathogen Detection Reference Gene Catalog help, topic listback to top
  • The definition of redundant (or 'non-unique') will differ, depending on the type of data element (allele, gene, or point mutation). For example:
  • An ALLELE should only ever show up once in the table. An allele is a unique protein sequence that corresponds to a unique gene symbol, and so, by definition, should occur only once.
  • An allele name for a POINT MUTATION can occur in multiple rows of the Reference Gene Catalog, if the point mutation is found in different organisms, and if the proteins in those organisms are not identical.
    • For example, the allele name gyrA_D82G occurs in both E. coli and Salmonella. Each of those organisms has its own reference sequence protein (WP_* accession), because the protein sequences are not identical. The E. coli gyrA protein sequence is WP_001281243.1, and the Salmonella gyrA protein is WP_001281271.1.
    • If, on the other hand, two or more organisms have an identical protein sequence for a given gene, and the same allele has been found in all of those organisms, there will be a single row in the Reference Gene Catalog, showing the allele name and the Reference Sequence WP_* accession. In such a case, the organism field will have no value, because the organism field is populated only if an allele or gene has been found to be unique to an organism.
  • A given GENE SYMBOL can have multiple rows in the table, as multiple proteins can be assigned the same gene symbol, but each WP_* accession should be unique.

    Details about WP_* accessions are provided on the web pages that describe the RefSeq non-redundant proteins, the Prokaryotic RefSeq Genome Re-annotation Project, and the New RefSeq protein product and data model.
Relationship between the Pathogen Detection Reference Gene Catalog and Pathogens Isolates Browser Pathogen Detection Reference Gene Catalog help, topic listback to top
Relationship between the Pathogen Detection Reference Gene Catalog and the Reference Gene Hierarchy Pathogen Detection Reference Gene Catalog help, topic listback to top

Where to access the Pathogen Detection Reference Gene Catalog Pathogen Detection Reference Gene Catalog help, topic listback to top

The Pathogen Detection Reference Gene Catalog is accessible from the Pathogen Detection Project home page (as a link in the right hand margin under "Data Resources"), from the AMR landing page (National Database of Antibiotic Resistant Organisms (NDARO)), and from the AMR Resources page.

Browse/Search the Reference Gene Catalog:
/pathogens/refgene/.

Download Reference Gene Catalog data:

Data from the Reference Gene Catalog can be downloaded in multiple formats. From the web interface you can get sequence and table data you see by clicking on the Download button at the top of the table (See the Output section for more info).

To get the data in table format click Download then select the File type: Table, select either tab-delimited (.tsv) or comma-delimited (.csv) and select a filename to download. Only the rows and columns that are visible in the table view on the web interface will be included in the downloaded file.

To get sequence data from the web interface click the Download button then select the File type: Dataset. Choose Reference nucleotide, Reference nucleotide with flanks, and/or the Reference protein sequence to download in FASTA format. Note that reference sequences for point mutations will be the "wildtype" references not including the mutations, and that RNA genes or promoter region references will not have protein sequences. Flanking nucleotide sequences may be limited to 100-bp or less depending on the source sequences in GenBank or RefSeq. The .zip file downloaded will be in the "Datsets" format including the metadata for sequences included in JSON format. See the NCBI Datasets documentation for more information on metadata file formats.



Search tips for the Pathogen Detection Reference Gene Catalog Pathogen Detection Reference Gene Catalog help, topic listback to top

Allowable search terms (Pathogen Detection Reference Gene Catalog) Pathogen Detection Reference Gene Catalog help, topic listback to top
Basic search (Pathogen Detection Reference Gene Catalog) Pathogen Detection Reference Gene Catalog help, topic listback to top
Advanced search (Pathogen Detection Reference Gene Catalog) Pathogen Detection Reference Gene Catalog help, topic listback to top
Filters (Pathogen Detection Reference Gene Catalog) Pathogen Detection Reference Gene Catalog help, topic listback to top
  • The "Filters" menu options in the Pathogen Detection Reference Gene Catalog enable you to facet or subset the data in a variety of ways, and therefore can be used to refine your results, whether you have done a basic search or an advanced search.
  • By default, each filter displays the top 100 terms (based on the number of items retrieved by a term) listed by count of value within that set of top 100. Note that:
    • A Boolean "OR" is applied if multiple items are checked in the same filter field. This way you can choose multiple values in the same filter. For example:
      • Open the "Filters" tab of the Pathogen Detection Reference Gene Catalog, then check the boxes for "Stress" and for "Virulence" in the "Type" filter. The system will retrieve genes that are associated with either stress resistance or with virulence.
    • A Boolean "AND" is applied if you select items in several different filter fields (Type, Class, etc). For example:
      • Open the "Filters" tab of the Pathogen Detection Reference Gene Catalog, then check the boxes for "Point" in the "Subtype" filter and "Quinolone" in the "Class" filter. The system will retrieve alleles that meet both of your specified criteria (in this case, point mutations that confer resistance to quinolones).
  • As explained in the Isolates Browser help, Filters are generated on the fly. As a result, the terms that are listed under each filter will depend on the data set you are currently displaying in the browser. That is also true for the filters in the Pathogen Detection Reference Gene Catalog.

Data Fields in the Pathogen Detection Reference Gene Catalog Pathogen Detection Reference Gene Catalog help, topic listback to top

The data fields listed below have been indexed by the Pathogen Detection project and are therefore directly searchable, using the advanced search techniques that are described in the Isolates Browser help, because both the Pathogen Detection Reference Gene Catalog and the Isolates Browser use the SOLR query language. Note that the data field names and values are case sensitive, as described in the Isolates Browser help.

Each data field reflects an available column in the Pathogen Detection Reference Gene Catalog web interface. The output section of this document provides tips on how to customize the display, using the "choose columns" function.

Please note: in the list of available data fields below:
  • The term shown in the regular font is the display name (column header) shown by the Pathogen Detection Reference Gene Catalog web interface. The term shown in (italics) is the name of the corresponding data field, if you want to search that field directly.
  • For example, one data field is listed as: gene family (gene_family). The phrase "gene family" (with a space between the words) appears in the Reference Gene Catalog column header, and "gene_family" (with an underscore bar instead of a space) is the string you should use if you want to search that data field directly.
  • Brief italicized search examples are also provided for each data field, when possible, showing how to query the data field directly. The values represent text strings exactly as they appear in the data fields, including upper case and lower case letters, special characters such as hyphens, etc. The data field names and values are case sensitive.
The available data fields in the Pathogen Detection Reference Gene Catalog include the following: Pathogen Detection Reference Gene Catalog help, topic listback to top
Note that each field is written in this format:   display name (data_field_name)
The "display name" is the column header that appears in the Reference Gene Catalog web interface, and the "data_field_name" is the case-sensitive string you should enter if you want to search the data field directly using a SOLR query:
Allele (allele)
Gene family (gene_family)
Product name (product_name)
Scope (scope)
Type (type)
Subtype (subtype)
Class (class)
Subclass (subclass)
RefSeq protein accession (refseq_protein_accession)
RefSeq nucleotide accession (refseq_nucleotide_accession)
GenBank protein accession (genbank_protein_accession)
GenBank nucleotide accession (genbank_nucleotide_accession)
organism fields:
Curated refseq start (curated_refseq_start)
GenBank start (genbank_start)
GenBank stop (genbank_stop)
GenBank strand (genbank_strand)
RefSeq start (refseq_start)
RefSeq stop (refsesq_stop)
RefSeq strand (refseq_strand)
PubMed reference (pubmed_reference)
synonyms (synonyms)
  • Allele (allele) list of Reference Gene Catalog data fieldsback to top

    Gene or allele. If the data element is an allele (e.g., 23S_C2627A), its name reflects both the name of the gene family in which a point mutation was found, and the location coordinate of the mutation, and the wild type and mutated nucleotides/amino acids

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   allele:searchterm
    • Search for:   allele:gyrA_D82G
      to show all alleles with that name.
      (A separate section of this document describes the non-redundant nature of the Reference Gene Catalog, and how the definition of redundant (or 'non-unique') will differ, depending on the type of data element (allele, gene, or point mutation).)
    • Search for:   allele:blaB-1
      to show the reference gene for the blaB-1 allele: subclass B1 metallo-beta-lactamase BlaB-1.
    • Search for:   allele:blaB-*
      to show the reference genes for all blaB alleles.
  • Gene family (gene_family) list of Reference Gene Catalog data fieldsback to top

    Gene symbol, or, if a point mutation, the reference gene symbol.

    Data field names and values are case sensitive. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   gene_family:searchterm
    • Search for:   gene_family:bla2
      to show members of the bla2 gene family: BcII family subclass B1 metallo-beta-lactamases. Each hit will correspond to a unique protein sequence, and corresponding unique nucleotide sequence. That is, each hit will have a unique WP_* accession (refseq_protein_accession), and/or a corresponding unique NG_* accession (refseq_nucleotide_accession). (A separate section of this document describes the non-redundant nature of the Reference Gene Catalog, and how the definition of redundant (or 'non-unique') will differ, depending on the type of data element (allele, gene, or point mutation).)
  • Product name (product_name) list of Reference Gene Catalog data fieldsback to top

    Name of gene product or genomic region.

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of quotes to search for a phrase.

    Examples:
    • To search this field directly, enter a query such as:   product_name:searchterm
    • Search for:   product_name:"BcII family subclass B1 metallo-beta-lactamase"
      to show all entries in the Reference Gene Catalog that have the exact product name that you specified, including upper and lower case letters as well as special characters (in this case, hyphens). As of July 5, 2019, the search retrieves 6 hits.

      Note: If the search is entered without quotes surrounding the product name, such as:
      Search for:   product_name:BcII family subclass B1 metallo-beta-lactamase then each space is interpreted by the search system as a Boolean OR. As of July 5, 2019, the search retrieves 1,466 hits.
      (read more about SOLR operators)
  • Scope (scope) list of Reference Gene Catalog data fieldsback to top

    This field specifies the data subset to which an allele or gene belongs, and the value can either be core (curated for relevance to resistance, usually AMR-specific genes and point mutations) or plus (genes related to biocide and stress resistance, general efflux, virulence, or antigenicity , or where the presence of this gene may not be informative as to resistance phenotype or the relationship is not clear).

    Data field names and values are case sensitive. In this case, both the data field name and the value are written in all lower case, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   scope:searchterm
    • Search for:   scope:plus
      to show the genes in the "plus" subset of the Pathogen Detection Reference Gene Catalog. That subset includes genes related to biocide and stress resistance, general efflux, virulence, or antigenicity.
  • Type (type) list of Reference Gene Catalog data fieldsback to top

    Classification for the type of gene found, such as AMR, STRESS, or VIRULENCE. A more detailed description of the type and subtype fields is available on the AMRFinderPlus wiki

    (In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the a phenotype associated with the genetic element.)

    Data field names and values are case sensitive, and the values for this data field are written in upper case, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   type:searchterm
    • Search for:   type:STRESS
      to show genes that confer stress resistance.
      As an alternative method for retrieving those genes, you can open the "Filters" function of the Reference Gene Catalog and check the box for the desired Type. By doing so, the Filters function will refresh itself to show the subtype values that are available for the type you have selected, enabling you to further narrow your search results, if desired. For example, the subtype values under STRESS currently include BIOCIDE, HEAT, and METAL. (As noted below, filters are generated on the fly and reflect the attributes of the data that you are currently viewing.)
  • Subtype (subtype) list of Reference Gene Catalog data fieldsback to top

    Classification for the subtype of gene found. A more detailed description of the type and subtype fields is available on the AMRFinderPlus wiki

    (In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the substrate.)

    Data field names and values are case sensitive, and the values for this data field are written in upper case, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   subtype:searchterm
    • Search for:   subtype:HEAT
      to show genes that confer heat resistance.
      As an alternative method for retrieving those genes, you can open the "Filters" function of the Reference Gene Catalog and check the box for the desired Subtype. By doing so, the Filters function will refresh itself to show the corresponding type under which the selected subtype falls. For example, the subtype value of HEAT falls under the type STRESS. (As noted below, filters are generated on the fly and reflect the attributes of the data that you are currently viewing.)
  • Class (class) list of Reference Gene Catalog data fieldsback to top

    "Class" provides a broad definition of the phenotype affected by the gene or allele, and includes resistance phenotypes such as antimicrobial and stress resistance, virulence, and antigenicity. For some virulence genes this field contains typing information. More information about class and subclass fields can be found on the AMRFinderPlus wiki

    (In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the substrate.)

    Data field names and values are case sensitive, and the values for this data field are written in upper case, as shown in the example below.

    Additional sections of this document provide tips search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   class:searchterm
    • Search for:   class:BETA-LACTAM
      to show all genes classified as BETA-LACTAM.
      As an alternative method for retrieving those genes, you can open the "Filters" function of the Reference Gene Catalog and check the box for the desired Class. By doing so, the Filters function will refresh itself to show the subclass values that are available for the type you have selected, enabling you to further narrow your search results, if desired. For example, the subclass values under BETA-LACTAM currently include BETA-LACTAM, CARBAPENEM, CEPHALOSPORIN, CEPHALOTHIN, and METHICILLIN. (As noted below, filters are generated on the fly and reflect the attributes of the data that you are currently viewing.)
  • Subclass (subclass) list of Reference Gene Catalog data fieldsback to top

    Where it is known, "Subclass" provides a more specific definition of the particular antibiotics or classes that are affected by the gene or point mutation (e.g., that are resisted by the gene/allele). While most subclass designations are self-explanatory, a few others have particular meanings. Specifically, "CEPHALOSPORIN" is equivalent to the Lahey 2be definition; "CARBAPENEM" means the protein has carbapenemase activity, but it might or might not confer resistance to other beta-lactams; "QUARTERNARY AMMONIUM" are quarternary ammonium compounds. In addition, stx subtypes (e.g., STX2E) and intimin subtypes (e.g., ALPHA) are defined for Shiga toxin proteins (class of STX1 or STX2) and intimins (class of INTIMIN) respectively. Where the phenotypic information is incomplete, contradictory, or unclear, the "Class" value is used for the "Subclass" value.

    More information about the class and subclass fields can be found on the AMRFinderPlus wiki

    (In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the substrate.)

    Data field names and values are case sensitive, and the values for this data field are written in upper case, as shown in the example below.

    Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   subclass:searchterm
    • Search for:   subclass:CEPHALOSPORIN
      to show genes that confer resistance to cephalosporin antibiotics.
      As an alternative method for retrieving those genes, you can open the "Filters" function of the Reference Gene Catalog and check the box for the desired subclass. The Filters function will then refresh itself to show the corresponding class under which the selected subclass falls. For example, the subclass value of CEPHALOSPORIN falls under the class BETA-LACTAM. (As noted below, filters are generated on the fly and reflect the attributes of the data that you are currently viewing.)
  • RefSeq protein accession (refseq_protein_accession) list of Reference Gene Catalog data fieldsback to top

    Accession of the RefSeq protein sequence record in which the gene or allele is found. It generally has a WP_* prefix. (Read more about RefSeq, the distinct format of RefSeq accessions, and the various accession prefixes that appear in the Pathogen Detection project.)

    Enter the sequence record identifier in the accession.version format, as shown in the first example below.
    If you don't know the version of the sequence record, enter just the accession, as shown in the second example below.

    Examples:
    • To search this field directly, enter a query such as:   refseq_protein_accession:searchterm
    • Search for:   refseq_protein_accession:WP_001281243.1
      to show the Reference Gene Catalog entries associated with this RefSeq protein sequence record. If multiple alleles have been found to exist in this protein, there will be a separate entry for each allele. (A separate section of this document describes the non-redundant nature of the Reference Gene Catalog, and how the definition of redundant (or 'non-unique') will differ, depending on the type of data element (allele, gene, or point mutation.)
    • Search for:   refseq_protein_accession:WP_001281243
      to show the Reference Gene Catalog entries associated with this RefSeq protein accession, regardless of its version number. (The Reference Gene Catalog contains the latest version of a sequence record; if you don't know what version number is the latest, enter only the accession as your query, without any dot or version number.)
  • RefSeq nucleotide accession (refseq_nucleotide_accession) list of Reference Gene Catalog data fieldsback to top

    Accession of the RefSeq nucleotide sequence record in which the gene or allele is found. It generally has an NG_* prefix. (Read more about RefSeq, the distinct format of RefSeq accessions, and the various accession prefixes that appear in the Pathogen Detection project.)

    Enter the sequence record identifier in the accession.version format, as shown in the first example below.
    If you don't know the version of the sequence record, enter just the accession, as shown in the second example below.

    Examples:
    • To search this field directly, enter a query such as:   refseq_nucleotide_accession:searchterm
    • Search for:   refseq_nucleotide_accession:NG_047553.1
      to show the Reference Gene Catalog entry associated with this RefSeq nucleotide sequence record.
    • Search for:   refseq_nucleotide_accession:NG_047553
      to show the Reference Gene Catalog entries associated with this RefSeq nucleotide accession, regardless of its version number. (The Reference Gene Catalog contains the latest version of a sequence record; if you don't know what version number is the latest, enter only the accession as your query, without any dot or version number.)
  • GenBank protein accession (genbank_protein_accession) list of Reference Gene Catalog data fieldsback to top

    Accession of the GenBank protein sequence record in which the gene or allele is found. (Read more about the format of GenBank accessions, and about the various accession prefixes that appear in the Pathogen Detection project.)

    Enter the sequence record identifier in the accession.version format, as shown in the example below.
    If you don't know the version of the sequence record, enter just the accession, as shown in the second example below.

    Examples:
    • To search this field directly, enter a query such as:   genbank_protein_accession:searchterm
    • Search for:   genbank_protein_accession:AAB00464.1
      to show the Reference Gene Catalog entries associated with this GenBank protein.
    • Search for:   genbank_protein_accession:AAB00464
      to show the Reference Gene Catalog entries associated with this GenBank protein accession, regardless of its version number. (The Reference Gene Catalog contains the latest version of a sequence record; if you don't know what version number is the latest, enter only the accession as your query, without any dot or version number.)
  • GenBank nucleotide accession (genbank_nucleotide_accession) list of Reference Gene Catalog data fieldsback to top

    Accession of the GenBank nucleotide sequence record in which the gene or allele is found. (Read more about the format of GenBank accessions, and about the various accession prefixes that appear in the Pathogen Detection project.)

    Enter the sequence record identifier in the accession.version format, as shown in the example below.
    If you don't know the version of the sequence record, enter just the accession, as shown in the second example below.

    Examples:
    • To search this field directly, enter a query such as:   genbank_nucleotide_accession:searchterm
    • Search for:   genbank_nucleotide_accession:L26954.1
      to show the Reference Gene Catalog entries associated with this GenBank nucleotide sequence.
    • Search for:   genbank_nucleotide_accession:L26954
      to show the Reference Gene Catalog entries associated with this GenBank nucleotide sequence, regardless of its version number. (The Reference Gene Catalog contains the latest version of a sequence record; if you don't know what version number is the latest, enter only the accession as your query, without any dot or version number.)
  • organism fields: list of Reference Gene Catalog data fieldsback to top

    The whitelisted_taxa and blacklisted_taxa data fields below are used for retrieving organism-specific results. Specifically, they are used to screen for known resistance-causing point mutations within an organism group, and for common, non-informative genes, respectively.

    Point mutations are currently limited to one of Campylobacter, Escherichia, or Salmonella. Note that rRNA mutations will not be screened if only a protein file is provided. To screen known Shigella mutations use Escherichia as the organism. See Organism option below for more details.

    • Whitelisted taxa (whitelisted_taxa) list of Reference Gene Catalog data fieldsback to top

      The whitelisted_taxa data field indicates for which taxa this element is curated for mutational resistance mechansims.

      An example of a whitelisted sequence is the 16S_A1055G point mutation in E. coli.

      See the AMRFinderPlus documentation for a list of taxa where resistance mechanisms based on mutations are curated. Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

      Examples:
      • To search this field directly, enter a query such as:   whitelisted_taxa:searchterm
      • Search for:   whitelisted_taxa:Escherichia
        to list the resistance-causing point mutations found in the Escherichia taxonomic group (i.e., Escherichia coli and Shigella spp., Escherichia fergusonii).
      Additional note:
      • The AMRFinderPlus software automatically looks for whitelisted sequences if an organism is specified during a search. For example, if AMRFinderPlus is run with Escherichia in the organism field, then your isolate will be screened for the presence of point mutations that confer antimicrobial resistance in this taxonomic group (such as the 16S_A1055G point mutation). If AMRFinderPlus is run without Escherichia in the organism field, then your isolate will not be screened for the presence of this point mutation.
    • Blacklisted taxa (blacklisted_taxa) list of Reference Gene Catalog data fieldsback to top

      The blacklisted_taxa data field screens for genes that are common within a taxonomic group, and are therefore non-informative with regard to antimicrobial resistance.

      An example of a blacklisted sequence is fieF which is blacklisted for both E. coli and Salmonella.

      The available values in blacklisted_taxa currently include:
      • Escherichia > Escherichia coli and Shigella spp., Escherichia fergusonii
      • Klebsiella > Klebsiella pneumoniae and Klebsiella oxytoca
      • Salmonella > Salmonella enterica
      • Staphylococcus > Staphylococcus pseudintermedius
      • Vibrio > Vibrio cholerae
      Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

      Examples:
      • To search this field directly, enter a query such as:   blacklisted_taxa:searchterm
      • Search for:   blacklisted_taxa:Klebsiella
        to list genes that have been blacklisted in the Klebsiella taxonomic group (i.e., Klebsiella pneumoniae and Klebsiella oxytoca).
      • Search for:   blacklisted_taxa:Escherichia AND blacklisted_taxa:Salmonella
        to list genes that have been blacklisted in both the Escherichia taxonomic group (i.e., Escherichia coli and Shigella spp., Escherichia fergusonii), and in Salmonella.
      Additional note:
      • The AMRFinderPlus software automatically excludes blacklisted sequences if an organism is specified during a search. For example, if AMRFinderPlus is run with either Escherichia or Salmonella in the organism field, then your isolate will be screened for the presence of common genes in the taxonomic group, and those common genes will be eliminated from the AMRFinderPlus results. For example, the fieF gene will not be reported even if it is present in your isolate, since fieF is ubiquitous in both of these taxa and reporting it does not provide useful information.
  • Curated RefSeq start (curated_refseq_start) list of Reference Gene Catalog data fieldsback to top

    Did curators alter the start coordinate from the GenBank record when making the RefSeq record? The allowable values for this field are Yes or No, and must be written with a leading upper case letter.

    A "Yes" indicates that NCBI RefSeq curators either changed the translation start site (in the NG_* genomic sequence record) from what was shown on the corresponding GenBank record, or provided start and stop coordinates that the GenBank record lacked.

    The data field name is also case sensitive and should be written in all lower case, as shown in the example below. (Separate sections of this document provides additional details about case sensitive searches and accession prefixes that appear in the Pathogen Detection project.)

    Examples:
    • To search this field directly, enter a query such as:   curated_refseq_start:searchterm
    • Search for:   curated_refseq_start:Yes
      to show all genes an alleles that have a curated RefSeq start.
  • GenBank start (genbank_start) list of Reference Gene Catalog data fieldsback to top

    The start coordinate of the reference sequence for this element on the GenBank nucleotide sequence record. This field should always be lower than the GenBank stop field regardless of the GenBank strand

  • GenBank stop (genbank_stop) list of Reference Gene Catalog data fieldsback to top

    The stop coordinate of this reference sequence for this element on the GenBank nucleotide sequence record. This field should always be higher than the GenBank start field regardless of the GenBank strand.

  • GenBank strand (genbank_strand) list of Reference Gene Catalog data fieldsback to top

    The strand (+/-) on which the reference sequence occurs, relative to the nucleotide sequence that appears in the genbank_nucleotide_accession listed for the gene or allele.

  • RefSeq start (refseq_start) list of Reference Gene Catalog data fieldsback to top

    The start coordinate of this reference sequence for this element on the RefSeq nucleotide sequence record. This field should always be lower than the RefSeq stop field regardless of the RefSeq strand

  • RefSeq stop (refseq_stop) list of Reference Gene Catalog data fieldsback to top

    The stop coordinate of this reference sequence for this element on the RefSeq nucleotide sequence record. This field should always be higher than the RefSeq start field regardless of the RefSeq strand.

  • RefSeq strand (refseq_strand) list of Reference Gene Catalog data fieldsback to top

    The strand (+/-) of reference sequence for this element, relative to the nucleotide sequence that appears in the refseq_nucleotide_accession listed for the gene or allele.

  • PubMed reference (pubmed_reference) list of Reference Gene Catalog data fieldsback to top

    Links to references describing gene, if available. The value in the data field is a PubMed identifier (PMID). Clicking on an entry in this field will take you to the page for that paper in PubMed.

  • Synonyms (synonyms) list of Reference Gene Catalog data fieldsback to top

    Other symbols used to refer to this element / gene in the literature.

Output from the Pathogen Detection Reference Gene Catalog Pathogen Detection Reference Gene Catalog help, topic listback to top

Tabular list of genes Pathogen Detection Reference Gene Catalog help, topic listback to top
  • Upon opening the Pathogen Detection Reference Gene Catalog, a table displays data for all genes and alleles that are currently in the catalog.
  • Every row in the Pathogen Detection Reference Gene Catalog display is a reference gene or a point mutation.
  • The data available for each item can include gene or allele name, product name, type, subtype, class, subclass, and more, as available. (See the Pathogen Detection Reference Gene Catalog data fields for a complete list.) Some of the data elements, such a accessions for corresponding protein and nucleotide sequence records and publications, link to additional information in related databases such as RefSeq, GenBank, and PubMed.
  • The genes and point mutations can be sorted by clicking on column headers, faceted by using filters (e.g., class:AMINOGLYCOSIDE), or searched using basic or advanced search techniques.
  • Download the list of elements and their metadata shown. Click on the Download button just above the main data table and select File type: Table. From there you can select Tab-delimited (.tsv) or Comma-delimited (.csv) and set the filename. Clicking Download will download the data shown in the table filtered by the search and with the visible columns included. See the Download the Reference Gene Catalog data section for more information and how to download sequences.
Filters to refine results Pathogen Detection Reference Gene Catalog help, topic listback to top

  • The "Filters" menu options in the Reference Gene Catalog enable you to facet or subset the data in a variety of ways, and therefore can be used to refine your results, whether you have done a basic search or an advanced search.
  • By default, each filter displays the top 10 terms (based on the number of genes/alleles retrieved by a term). The "more [+]" option displays up to the top 100 terms, listed alphabetically within that set of top 100.
  • Filters are generated on the fly. The choices listed in the "Filters" tab depend on the data set you are currently displaying in the browser, and reflect the attributes of the genes and alleles in that data set.
  • A separate section of this document provides additional information about Filters.
Customize the Reference Gene Catalog display Pathogen Detection Reference Gene Catalog help, topic listback to top
  • The columns displayed by the Reference Gene Catalog reflect the data fields. By default, the Reference Gene Catalog displays only a subset of the available data fields.
  • You can use the "Choose Columns" option at the top of the tabular list of genes in order to remove columns, select additional columns to display, and/or change the order of the columns.
  • The options you select will persist within a given browser (e.g., Chrome, Edge, Internet Explorer, Firefox, Safari) until that browser's cookies are cleared/reset.

Use cases/sample searches of the Pathogen Detection Reference Gene Catalog Pathogen Detection Reference Gene Catalog help, topic listback to top

Find multidrug resistant genes Pathogen Detection Reference Gene Catalog help, topic listback to top

As an example:
  • Open the NCBI Pathogen Detection Reference Gene Catalog.
  • Open the "Filters" function.
  • By default, each filter shows the top 100 terms (based on the number of genes/alleles retrieved by a term).
  • In the Class section of the filters, scroll down to find MULTIDRUG or search for MULTIDRUG in the search box.
  • "MULTIDRUG" now appears as an option under Class. Select that option. Upon this action, the Filters display will refresh itself to show only the set of filters that apply to that class of antibiotics, and the tabular list of genes and alleles will refresh itself to show only the items that fall in that class.
An alternative method for retrieving those genes and allele is to search the subclass data field directly. To do this, open the Pathogen Detection Reference Gene Catalog and enter the query:
class:MULTIDRUG
Separate sections of this file provides details about filters, and about how to directly search specific data fields, such as the class and subclass fields, and case sensitive searches.


Find carbapenem resistant genes Pathogen Detection Reference Gene Catalog help, topic listback to top

As an example:
  • Open the NCBI Pathogen Detection Reference Gene Catalog.
  • Open the "Filters" function.
  • In the Class section of the filters, select "BETA-LACTAM." Upon this action, the Filters display will refresh itself to show only the set of filters that apply to the Beta-Lactam class.
  • The Subclass section of the filters will now list "CARBAPENEM" as an option. Check the box for CARBAPENEM to show the genes that confer resistance to that subclass of antibiotics.
An alternative method for retrieving those genes and allele is to search the subclass data field directly. To do this, open the Pathogen Detection Reference Gene Catalog and enter the query:
subclass:CARBAPENEM
Separate sections of this file provides details about filters, and about how to directly search specific data fields, such as the class and subclass fields, and case sensitive searches.


Find point mutations in Escherichia that confer resistance to quinolones Pathogen Detection Reference Gene Catalog help, topic listback to top

As an example:
  • Open the NCBI Pathogen Detection Reference Gene Catalog.
  • Open the "Filters" function.
  • By default, each filter shows the top 100 terms (based on the number of genes/alleles retrieved by a term).
  • In the Organism section of the filters, select "Escherichia." Upon this action, the Filters display will refresh itself to show only the set of filters that apply to Escherichia.
  • In the Subtype section of the filters, select "POINT." Upon this action, the Filters display will refresh itself to show only the set of filters that apply to Escherichia point mutations.
  • In the Subclass section of the filters, scroll to "QUINOLONE" or type that term in the search box. It now appears as an option. Select that option.
  • The resulting output is a list of Escherichia point mutations that confer resistance to quinolone antibiotics.
An alternative method for retrieving those genes and allele is to search the subclass data field directly. To do this, open the Pathogen Detection Reference Gene Catalog and enter the query:
organism:Escherichia AND subtype:POINT AND subclass:QUINOLONE
Separate sections of this file provides details about filters, and about how to directly search specific data fields, such as the organism, type, subtype, class, and subclass fields, and case sensitive searches.


AMRFinderPlus Antimicrobial resistance (AMR) resources, topic listback to top

What is AMRFinderPlus? | Install software | Download data files: Reference Gene Catalog data, Hidden Markov Models (HMMs), AMRFinder Hierarchy (illustrated example of a hierarchy) | Interpret AMRFinderPlus results | Read more | Publication/citation

What is AMRFinderPlus? AMRFinderPlus, topic listback to top

AMRFinderPlus - Identifies antimicrobial resistance (AMR) genes and point mutations in assembled nucleotide and protein sequences. AMRFinderPlus also identifies select virulence and stress resistance genes. AMRFinderPlus compares isolate genomes against the reference protein set using BLAST and against the HMM set using HMMER, and uses the gene hierarchy to provide the most specific protein assignment to antimicrobial resistant protein or family, if present in the query set of proteins. The original AMRFinder identifies acquired antimicrobial resistance (AMR) genes, as well as point mutations that confer antimicrobial resistance, in either protein datasets or nucleotide data, including genomic data. AMRFinderPlus identifies the AMR genes and point mutations that are found by the original AMRFinder, plus it identifies select members of additional classes of genes such as virulence factors, biocide, heat, acid, and metal resistance genes. Unlike other AMR gene detection methods that report the best hit, AMRFinderPlus reports the specific gene symbol based on the available evidence. For example, when presented with a novel blaKPC allele that is nearly identical to blaKPC-2, closest hit tools might return blaKPC-2, but AMRFinderPlus would call it as blaKPC (illustrated example). More details about the tool are provided in a publication by Feldgarden M, et al., 2021.

MicroBIGG-E (Microbial Browser for Identification of Genetic and Genomic Elements) Antimicrobial resistance (AMR) resources, topic listback to top

What is MicroBIGG-E? MicroBIGG-E help, topic listback to top

"MicroBIGG-E" is the Microbial Browser for Identification of Genetic and Genomic Elements. MicroBIGG-E help, topic listback to top
  • MicroBIGG-E contains genetic and genomic elements identified in assemblies analyzed by AMRFinderPlus as part of the Pathogen Detection Pipeline. See the AMRFinderPlus wiki for more information on how AMRFinderPlus works and the Pathogen Detection Reference Gene Catalog for a list of the elements that AMRFinderPlus is searching for.
  • MicroBIGG-E will be updated each time an organism group is updated in the Isolates Browser.
  • It contains the genetic and genomic elements that have been found in isolate genomes that have been published in GenBank. (This is in contrast to the Isolates Browser, which contains isolates that have been published in GenBank as well as those awaiting submission to GenBank.) The output is the results of AMRFinderPlus analyses, as described in the data processing pipeline section of this document.
  • The MicroBIGG-E will initially include genes, alleles, and point mutations.
  • Every row in the MicroBIGG-E display is an anti-microbial resistance (AMR), stress response, and/or virulence gene that has been identified in an isolate assembly by the data processing pipeline, with information about the method used to identify it, supporting evidence, and the element's type, subtype, class, subclass, and more.
  • The purpose of MicroBIGG-E is to enable researchers to obtain detailed information about the element as well as the actual contigs that contain a genetic or genomic element of interest, in order to conduct further analysis.
  • The Pathogen Detection pipeline uses two assemblers, a de novo assembler and a targeted assembler (SAUTE) to increase assembly sensitivity and accuracy for AMR genes. A region of the genome may therefor appear in two contigs so it looks like there are duplicated genes. For this reason the copy number for AMR genes in MicroBIGG-E will often be higher than appears in the actual isolate.
Relationship between MicroBIGG-E and Pathogens Isolates Browser MicroBIGG-E help, topic listback to top

Where to access MicroBIGG-E MicroBIGG-E help, topic listback to top

"MicroBIGG-E," the Microbial Browser for Identification of Genetic and Genomic Elements, is accessible from the Pathogen Detection Project home page (as a link in the right hand margin under "Data Resources"), from the AMR landing page (National Database of Antibiotic Resistant Organisms (NDARO)), and from the AMR Resources page.

and the raw data behind it is available at Google Cloud. You can also access MicroBIGG-E directly from the links below:

Browse/Search MicroBIGG-E:
/pathogens/isolates#/microbigge/.

Download the MicroBIGG-E data:
Click the "Download" button in the header of the MicroBIGG-E table to download data. You can either download a tab-delimited or csv formatted representation of the table view or a set of sequences under the "Dataset" selection.
  • Table downloads can be in either Tab-delimited (.tsv) format or Excel comma-delimited format (.csv), and have a maximum of 100,000 rows.
  • Datasets downloads contain protein or nucleotide data related to the elements shown in the table. These can be the DNA sequence of the elements, the elements plus flanks (up to 2,000 bp), the entire contig containing the elements (max 1,000 contigs), or the amino-acid sequences of the protein elements.
  • GCP BigQuery Full table access using SQL. See MicroBIGG-E data at Google Cloud Platform for more information on how to get full MicroBIGG-E data on Google Cloud in BigQuery.
Bulk access for MicroBIGG-E data is under active development. Table data is now avalable on GCP. Email NCBI at pd-help@ncbi.nlm.nih.gov if the current functionality does not meet your needs.

Search tips for MicroBIGG-E MicroBIGG-E help, topic listback to top

Allowable search terms (MicroBIGG-E) MicroBIGG-E help, topic listback to top
  • MicroBIGG-E can be searched by the terms that appear in any of the data fields described below. A search example is provided after each data field description, when possible.
Basic search (MicroBIGG-E) MicroBIGG-E help, topic listback to top
Advanced search (MicroBIGG-E) MicroBIGG-E help, topic listback to top
Filters (MicroBIGG-E) MicroBIGG-E help, topic listback to top
  • The "Filters" menu options in the MicroBIGG-E enable you to facet or subset the data in a variety of ways, and therefore can be used to refine your results, whether you have done a basic search or an advanced search.
  • Each filter displays counts of elements next to each term in the filter. Note that these counts are for elements in the browser, and may not accurately describe the number of genes in actual isolates because Pathogen Detection assemblies use both de novo and guided assemblies which may represent the same gene in an assembly multiple times.
  • By default, each filter displays the top 100 terms (based on the number of isolates retrieved by a term) listed by count of value within that set of top 100. Use the search box to search for filters not in the top 100. Note that:
    • A Boolean "OR" is applied if multiple items are checked in the same filter field. This way you can choose multiple values in the same filter. For example:
      • Open the "Filters" tab of the MicroBIGG-E, then check the boxes for "Stress" and for "Virulence" in the "Type" filter. The system will retrieve genetic/genomic elements that are associated with either stress resistance or with virulence.
    • A Boolean "AND" is applied if you select items in several different filter fields (Type, Class, etc). For example:
      • Open the "Filters" tab of the MicroBIGG-E web interface, then check the boxes for "Point" in the "Subtype" filter and "Quinolone" in the "Class" filter. The system will retrieve genetic/genomic elements that meet both of your specified criteria (in this case, point mutations that confer resistance to quinolones).
  • As explained in the Isolates Browser help, Filters are generated on the fly. As a result, the terms that are listed under each filter will depend on the data set you are currently displaying in the browser. That is also true for the filters in the MicroBIGG-E.

Data Fields in MicroBIGG-E MicroBIGG-E help, topic listback to top

The MicroBIGG-E data fields listed below have been indexed by the Pathogen Detection project and are therefore directly searchable, using the advanced search techniques that are described in the Isolates Browser help, because both MicroBIGG-E and the Isolates Browser use the SOLR query language. Note that the data field names and values are case sensitive, as described in the Isolates Browser help.

Each data field reflects an available column in the MicroBIGG-E web interface. The output section of this document provides tips on how to customize the display, using the "choose columns" function.

Please note: in the list of available data fields below:
  • The term shown in the regular font is the display name (column header) shown by the MicroBIGG-E web interface. The term shown in (italics) is the name of the corresponding data field, if you want to search that field directly.
  • For example, one data field is listed as: Method (amr_method). The term "Method" appears in the MicroBIGG-E column header, and "amr_method" (with an underscore bar instead of a space) is the string you should use if you want to search that data field directly.
  • Brief italicized search examples are also provided for each data field, when possible, showing how to query the data field directly. The values represent text strings exactly as they appear in the data fields, including upper case and lower case letters, special characters such as hyphens, etc. The data field names and values are case sensitive.
The available data fields in the MicroBIGG-E include the following: MicroBIGG-E help, topic listback to top
Note that each field is written in this format:   Display name (data_field_name)
The "Display name" is the column header that appears in the MicroBIGG-E web interface, and the "data_field_name" is the case-sensitive string you should enter if you want to search the data field directly using a SOLR query:
Isolate data fields:
Element data fields:
Reference data fields:
Analysis results (Element vs Reference) data fields:
Analysis log data fields:

Isolate data fields: list of MicroBIGG-E data fieldsback to top

Element data fields: list of MicroBIGG-E data fieldsback to top

  • Element symbol (element_symbol) list of MicroBIGG-E data fieldsback to top

    The symbol assigned to the element by AMRFinderPlus. Examples include an allele symbol (blaKPC-2), a protein symbol (blaKPC), or a point mutation symbol (gyrA_G81D). It can also be a very broad symbol representing a large family of proteins (bla) that you would not find in the reference gene catalog. This happens when AMRFinderPlus lacks evidence to use a more specific element symbol.

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   element_symbol:searchterm
    • Search for:   element_symbol:blaKPC
      to show all genetic/genomic elements with that exact symbol.
    • Search for:   element_symbol:blaKPC OR element_symbol:blaKPC-2
      to show all genetic/genomic elements that have either of those exact symbols.
  • Element name (element_name) list of MicroBIGG-E data fieldsback to top

    The name of the element assigned by AMRFinderPlus.

    Data field names and values are case sensitive, as shown in the examples below. Use quotes to search for phrases, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   element_name:searchterm
    • Search for:   element_name:"KPC family carbapenem-hydrolyzing class A beta-lactamase"
      to show genetic/genomic elements with that name.
    • Search for:   element_name:"KPC family carbapenem-hydrolyzing class A beta-lactamase" OR element_name:"carbapenem-hydrolyzing class A beta-lactamase KPC-2"
      to show all genetic/genomic elements that have either of those names.
  • Element length (element_length) list of MicroBIGG-E data fieldsback to top

    The length of this element in amino acids (AA) for protein elements, and in base pairs (bp) for nucleotide elements.

    Data field names and values are case sensitive, as shown in the examples below.
    This data field can be queried by a range search, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   element_length:searchterm
    • To search for a range of values, enter a query such as:  element_length:[value1 TO value2]
    • Search for:   element_length:234
      to show genetic/genomic elements are have a length of 234 amino acids (or 234 nucleotides).
    • Search for:   element_length:[200 TO 250]
      to show genetic/genomic elements that range in length between 200 and 250 amino acids (or between 200 and 250 nucleotides).
  • Protein (protein_acc) list of MicroBIGG-E data fieldsback to top

    The accession of the protein sequence record for this element.

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   protein_acc:searchterm
    • Search for:   protein_acc:WP_004199234.1
      to show genetic/genomic elements that have the protein sequence shown in the RefSeq record WP_004199234.1. This search retrieves more genetic/genomic elements from a large number of isolates, because the sequence has been found to be a multipecies protein.
    • Search for:   protein_acc:WP_124042569.1
      to show the genetic/genomic elements that have the protein sequence shown in the RefSeq record WP_124042569.1. As of May 23, 2020, this search retrieves a single element, from the E. coli isolate PDT000411318.1.
  • Contig (contig_acc) list of MicroBIGG-E data fieldsback to top

    The accession of the contig sequence record on which this element appears.

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   contig_acc:searchterm
    • Search for:   contig_acc:NZ_UWVC01000003.1
      to show the genetic/genomic elements that have been identified on the sequence of contig NZ_UWVC01000003.1.
  • Start (start_on_contig) list of MicroBIGG-E data fieldsback to top

    The start coordinate for the element on the contig sequence.
  • Stop (end_on_contig) list of MicroBIGG-E data fieldsback to top

    The stop coordinate for the element on the contig sequence.
  • Strand (strand) list of MicroBIGG-E data fieldsback to top

    The strand (+/-) on which the genetic or genomic element appears, relative to the nucleotide sequence that appears in the contig accession listed for the element.
  • Type (type) list of MicroBIGG-E data fieldsback to top

    Classification for the type of gene found, such as AMR, STRESS, or VIRULENCE.

    A more detailed description of the type and subtype fields is available on the AMRFinderPlus wiki

    This data field also appears in the Pathogen Detection Reference Gene Catalog; a description of Type and examples of queries for that field appear in the Reference Gene Catalog data fields help section.

    (In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the substrate.)
  • Subtype (subtype) list of MicroBIGG-E data fieldsback to top

    Classification for the subtype of gene found. A more detailed description of the type and subtype fields is available on the AMRFinderPlus wiki

    This data field also appears in the Pathogen Detection Reference Gene Catalog; a description of Subtype and examples of queries for that field appear in the Reference Gene Catalog data fields help section.

    (In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the substrate.)
  • Class (class) list of MicroBIGG-E data fieldsback to top

    Class of resistance for "core" genes (see scope), and typing information for some virulence genes.

    This data field also appears in the Pathogen Detection Reference Gene Catalog; a description of Class and examples of queries for that field appear in the Reference Gene Catalog data fields help section.

    (In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the substrate.)
  • Subclass (subclass) list of MicroBIGG-E data fieldsback to top

    Where it is known, "Subclass" provides a more specific definition of the particular antibiotics or classes that are affected by the gene or point mutation (e.g., that are resisted by the gene/allele). While most subclass designations are self-explanatory, a few others have particular meanings. Specifically, "CEPHALOSPORIN" is equivalent to the Lahey 2be definition; "CARBAPENEM" means the protein has carbapenemase activity, but it might or might not confer resistance to other beta-lactams; "QUARTERNARY AMMONIUM" are quarternary ammonium compounds. In addition, stx subtypes (e.g., STX2E) and intimin subtypes (e.g., ALPHA) are defined for Shiga toxin proteins (class of STX1 or STX2) and intimins (class of INTIMIN) respectively. Where the phenotypic information is incomplete, contradictory, or unclear, the "Class" value is used for the "Subclass" value.

    More information about the class and subclass fields can be found on the AMRFinderPlus wiki

    This data field also appears in the Pathogen Detection Reference Gene Catalog; a description of Subclass and examples of queries for that field appear in the Reference Gene Catalog data fields help section.

    (In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the substrate.)
  • Scope (scope) list of MicroBIGG-E data fieldsback to top

    This data field also appears in the Pathogen Detection Reference Gene Catalog; a description of Scope and examples of queries for that field appear in the Reference Gene Catalog data fields help section.
Reference data fields: list of MicroBIGG-E data fieldsback to top

  • Closest reference accession (closest_reference_acc) list of MicroBIGG-E data fieldsback to top

    The accession of closest reference sequence. Note that only one reference will be chosen if the blast hit is equidistant from multiple references (NA if HMM-only hit). For point mutations the reference is the sensitive "wild-type" allele, and the element symbol describes the specific mutation. Check the Reference Gene Catalog for more information on specific mutations or reference genes.

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   closest_reference_acc:searchterm
    • Search for:   closest_reference_acc:WP_001083725.1
      to show genetic/genomic elements whose protein sequence is most closely related to the sequence in RefSeq record https://www.ncbi.nlm.nih.gov/protein/WP_001083725.1.

      Note that some elements retrieved by the search above will list that accession in both the closest_reference_acc and protein_acc columns, while other proteins will list it only in the closest_reference_acc column. You can retrieve either subset with the following searches:
      Search for:   closest_reference_acc:WP_001083725.1 AND protein_acc:WP_001083725.1
      Search for:   closest_reference_acc:WP_001083725.1 NOT protein_acc:WP_001083725.1
  • Closest reference name (closest_reference_name) list of MicroBIGG-E data fieldsback to top

    The name of closest reference sequence.

    Data field names and values are case sensitive, as shown in the examples below. Use quotes to search for phrases, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   closest_reference_name:searchterm
    • Search for:   closest_reference_name:"trimethoprim-resistant dihydrofolate reductase DfrA12"
      to show genetic/genomic elements whose closes RefSeq protein sequence is named "trimethoprim-resistant dihydrofolate reductase DfrA12."
  • Reference element length (reference_element_length) list of MicroBIGG-E data fieldsback to top

    Length of the reference sequence in amino acids (AA) for protein elements, and in base pairs (bp) for nucleotide elements.

    Data field names and values are case sensitive, as shown in the examples below.
    This data field can be queried by a range search, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   reference_element_length:searchterm
    • To search for a range of values, enter a query such as:  reference_element_length:[value1 TO value2]
    • Search for:   reference_element_length:284
      to show genetic/genomic elements whose reference elements have a length of 234 amino acids (or 234 nucleotides).
    • Search for:   reference_element_length:[200 TO 250]
      to show genetic/genomic elements whose reference elements range in length between 200 and 250 amino acids (or between 200 and 250 nucleotides).
  • HMM Accession (hmm_acc) list of MicroBIGG-E data fieldsback to top

    The accession of the Hidden Markov Model (HMM) that hits this element above cutoff (if any). Clicking the HMM accession will take you to the HMM page in the Protein Family Models database. From that page you can download the HMM itself and get additional information including the curated cutoffs, the seed alignment, and RefSeq sequences identified by this HMM.

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   hmm_acc:searchterm
    • Search for:   hmm_acc:NF000053.2
      to show genetic/genomic elements that have a match to the Hidden Markov Model with accession NF000053.2 (trimethoprim-resistant dihydrofolate reductase DfrA12).
  • HMM Description (hmm_description) list of MicroBIGG-E data fieldsback to top

    The name of the Hidden Markov Model (HMM) that hits this element (if any).

    Data field names and values are case sensitive, as shown in the examples below. Use quotes to search for phrases, as shown in the example below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   hmm_description:searchterm
    • Search for:   hmm_description:"trimethoprim-resistant dihydrofolate reductase DfrA12"
      to show genetic/genomic elements that have a match to the Hidden Markov Model with the name "trimethoprim-resistant dihydrofolate reductase DfrA12."
Analysis results (Element vs Reference) data fields: list of MicroBIGG-E data fieldsback to top

  • Method (amr_method) list of MicroBIGG-E data fieldsback to top

    The method used by AMRFinderPlus to identify this element. A separate section of this document provides a table that summarizes AMRFinderPlus methods that are used by the tool to analyze pathogen isolate genome assemblies and identify genetic and genomic elements. The AMRFinderPlus Wiki provides additional details about the methods.

    Data field names and values are case sensitive, as shown in the examples below. Additional sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards (such as the asterisk or question mark).

    Examples:
    • To search this field directly, enter a query such as:   amr_method:searchterm
    • Search for:   amr_method:HMM
      to show proteins that were found by HMM only, more distant to reference proteins than our BLAST cutoffs.
    • Search for:   amr_method:POINTN OR amr_method:POINTP OR amr_method:POINTX
      to show point mutations that were identified using nucleotide BLAST (BLASTN), protein BLAST (BLASTP), or translated BLAST (BLASTX).
  • Alignment length (align_length) list of MicroBIGG-E data fieldsback to top

    The length of the alignment between the genetic/genomic element, which was identified by AMRFinderPlus in the isolate genome assembly, and the reference element. The length is measured in amino acids (AA) for protein elements, and in base pairs (bp) for nucleotide elements.

    Data field names and values are case sensitive, as shown in the examples below.
    This data field can be queried by a range search, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   align_length:searchterm
    • To search for a range of values, enter a query such as:  align_length:[value1 TO value2]
    • Search for:   align_length:[200 TO 250]
      to show genetic/genomic elements whose alignment to the closest reference element ranges in length between 200 and 250 amino acids (or between 200 and 250 nucleotides).
  • % Identity (pct_ref_identity) list of MicroBIGG-E data fieldsback to top

    The percent of identical amino acids or base pairs within the aligned region of the genetic/genomic element (identified by AMRFinderPlus in the isolate genome assembly) and the reference element.

    Data field names and values are case sensitive, as shown in the examples below.
    This data field can be queried by a range search, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   pct_ref_identity:searchterm
    • To search for a range of values, enter a query such as:  pct_ref_identity:[value1 TO value2]
    • Search for:   pct_ref_identity:100
      to show genetic/genomic elements that have a 100% identity to the reference element, within the aligned region.
    • Search for:   pct_ref_identity:[98 TO 100]
      to show genetic/genomic elements that have an identity that ranges from 98% to 100% to the reference element, within the aligned region.
  • % Coverage (pct_ref_coverage) list of MicroBIGG-E data fieldsback to top

    The proportion of the reference sequence covered by the alignment between the target element and the reference element.
    For example, a coverage of 90% means that the alignment between the target element and the reference element covers 90% of the reference sequence's length.

    Data field names and values are case sensitive, as shown in the examples below.
    This data field can be queried by a range search, as shown in the example below.

    Examples:
    • To search this field directly, enter a query such as:   pct_ref_coverage:searchterm
    • To search for a range of values, enter a query such as:  pct_ref_coverage:[value1 TO value2]
    • Search for:   pct_ref_coverage:100
      to show genetic/genomic elements whose alignment to the closest reference element covers 100% of the reference element's length.
    • Search for:   pct_ref_coverage:[50 TO 75]
      to show genetic/genomic elements whose alignment to the closest reference element covers 50% to 75% of the reference element's length.
  • Contig coverage (contig_coverage) list of MicroBIGG-E data fieldsback to top

    Contig coverage is the mean coverage of aligned reads for the contig containing this hit. This is a decimal (floating point) number > 0, not a percentage.

    Data field names and values are case sensitive, as shown in the examples below.
    This data field can be queried by a range search, as shown in the example below.

    Example:
    • To search this field directly, enter a query such as:   contig_coverage:searchterm
    • To search for a range of values, enter a query such as:  contig_coverage:[value1 TO value2]
    • Search for:   contig_coverage:[96 TO 106]
      to show genetic/genomic elements that have a contig coverage between 96 and 106.
  • Relative assembly coverage (rel_asm_cov) list of MicroBIGG-E data fieldsback to top

    This is the mean coverage by aligned reads of the entire contig divided by the mean coverage by aligned reads of the entire assembly. Mathematically the value is contig_coverage / asm_coverage. This is a ratio, a decimal (floating point) number > 0, not a percentage.

    Data field names and values are case sensitive, as shown in the examples below.
    This data field can be queried by a range search, as shown in the example below.

    Example:
    • To search this field directly, enter a query such as:   rel_asm_cov:searchterm
    • To search for a range of values, enter a query such as:  rel_asm_cov:[value1 TO value2]
    • Search for:   rel_asm_cov:[1.190 TO 1.202]
      to show genetic/genomic elements that have relative assembly coverage between 1.190 TO 1.202.
    • Search for:   rel_asm_cov:[1.1 TO 1.2]
      to show genetic/genomic elements that have relative assembly coverage between 1.100 TO 1.200.
  • Assembly coverage (asm_coverage) list of MicroBIGG-E data fieldsback to top

    Assembly coverage is the mean coverage of aligned reads for the entire assembly. This is a decimal (floating point) number > 0, not a percentage.

    Data field names and values are case sensitive, as shown in the examples below.
    This data field can be queried by a range search, as shown in the example below.

    Example:
    • To search this field directly, enter a query such as:   rel_asm_cov:searchterm
    • To search for a range of values, enter a query such as:  asm_coverage:[value1 TO value2]
    • Search for:   asm_coverage:[98 TO 110]
      to show genetic/genomic elements that have assembly coverage between 98 TO 110.
Analysis log data fields: list of MicroBIGG-E data fieldsback to top

Output from MicroBIGG-E MicroBIGG-E help, topic listback to top

Tabular list of genes and genetic elements MicroBIGG-E help, topic listback to top
  • Upon opening the MicroBIGG-E web interface, a table displays data for all genetic and genomic elements that have been identified in isolates genomes that have been deposited into GenBank.
  • Every row in the MicroBIGG-E display is an anti-microbial resistance (AMR), stress response, and/or virulence gene that has been identified in an isolate by the data processing pipeline.
  • The data available for each item can include gene name, type, subtype, class, subclass, method used to identify the element, supporting evidence, and more, as available. (See the MicroBIGG-E data fields for a complete list.) Some of the data elements, such as accessions for BioSample, nucleotide sequence, and protein sequence records, link to additional information in the corresponding databases.
  • The genes can be sorted by clicking on column headers, faceted by using filters (e.g., class:AMINOGLYCOSIDE), or searched using basic or advanced search techniques.
Filters to refine results MicroBIGG-E help, topic listback to top
  • The "Filters" menu options in the MicroBIGG-E web interface enable you to facet or subset the data in a variety of ways, and therefore can be used to refine your results, whether you have done a basic search or an advanced search.
  • By default, each filter displays the top 100 terms (based on the number of genes/alleles retrieved by a term) listed by count of value within that set of top 100.
  • Filters are generated on the fly. The choices listed in the "Filters" tab depend on the data set you are currently displaying in the browser, and reflect the attributes of the genes and alleles in that data set.
  • A separate section of this document provides additional information about Filters.
Customize the MicroBIGG-E display MicroBIGG-E help, topic listback to top
  • The columns displayed by MicroBIGG-E reflect the data fields. By default, the MicroBIGG-E displays only a subset of the available data fields.
  • You can use the "Choose Columns" option at the top of the tabular list of genes in order to remove columns, select additional columns to display, and/or change the order of the columns.
  • The options you select will persist within a given browser (e.g., Chrome, Edge, Internet Explorer, Firefox, Safari) until that browser's cookies are cleared/reset.
Cross-browser selection MicroBIGG-E help, topic listback to top
  • It is possible to view isolates that you have identified in MicroBIGG-E in the Isolates Browser
  • Click the Cross-browser selection button to the right of the Download button (you must be logged into your myNCBI account for this functionality). By default, all of the isolates for every row of your MicroBIGG-E search will be selected, as indicated by the checkbox column; however, you can deselect rows manually.
  • Then click the Show in Isolates button. A new tab will open with the Isolates browser results for the selected elements in MicroBIGG-E. You can then examine clusters of interest using the SNP Tree Viewer, or perform other tasks in the Isolates Browser.

Use cases/sample searches of MicroBIGG-E MicroBIGG-E help, topic listback to top

Identify hits from isolates with specific genes that co-occur on the same contig MicroBIGG-E help, topic listback to top

As an example, identify hits from contigs that have a set of genes (e.g., blaTEM-1 and blaKPC*) co-occurring on the same contig. A researcher might want to know which contigs (likely plasmids) have TEM-1 and a KPC allele, as opposed to a specific allele, since a single mutational event can alter the KPC allele and its clinical phenotype (such as KPC-3 and KPC-28), in order to understand the co-transmission and co-evolution of these two gene families.
  • Open the MicroBIGG-E: Microbial Browser for Identification of Genetic and Genomic Elements.
  • Search for contigs with genes of interest (e.g., blaTEM-1 and blaKPC*)
  • To do this, enter a search such as:
    genes_on_contig:blaTEM-1 AND genes_on_contig:blaKPC*
    (Note that field-specified searches are case-sensitive, and separate sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards such as the asterisk.)
  • Examine if genes of interest co-occur on same contig, either by clicking Download or visual inspection.
Identify partial gene sequences in the middle of contigs MicroBIGG-E help, topic listback to top

As an example, identify partial gene sequences in the middle of contigs, as that form of partiality might imply loss or alteration of function, and might need to be excluded or treated differently:
  • Open the MicroBIGG-E: Microbial Browser for Identification of Genetic and Genomic Elements.
  • For a given isolate and gene sequence (i.e., row), exclude columns where method equals "PARTIAL_CONTIG_ENDP" or "PARTIAL_CONTIG_ENDX".
  • To do this, enter a search such as:
    amr_method:PARTIAL* AND NOT amr_method:PARTIAL_CONTIG_END*
    (Note that field-specified searches are case-sensitive, and separate sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards such as the asterisk.)
Display isolates in isolates browser that have the same set of genes co-occurring on the same contig MicroBIGG-E help, topic listback to top

Having identified which contigs (likely plasmids) have TEM-1 and a KPC allele, a researcher might want to see their phylogenetic context in the SNP Tree Viewer
  • Open the MicroBIGG-E: Microbial Browser for Identification of Genetic and Genomic Elements.
  • Search for contigs that have a blaKPC gene and a blaTEM-1 allele.
  • To do this, enter the following search:
    genes_on_contig:blaTEM-1 AND genes_on_contig:blaKPC*
    (Note that field-specified searches are case-sensitive, and separate sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards such as the asterisk.)
  • Click the Cross-browser selection button to the right of the Download button (you must be logged into your myNCBI account for this functionality).
  • A new tab will open in the Isolates Browser containing all of the isolates meeting your search criteria. You can then examine clusters of interest using the SNP Tree Viewer, or perform other tasks in the Isolates Browser.
Display hits from isolates with co-occurring genes MicroBIGG-E help, topic listback to top

Get all hits from isolates that share a set of genes. Can then link to isolates browser (and subsequently SNP Tree Viewer) to get more information about those isolates using Cross-browser selection. E.g., get a set of hits from all isolates that share a blaTEM-1 and blaKPC gene.
  • Open the MicroBIGG-E: Microbial Browser for Identification of Genetic and Genomic Elements.
  • Search for hits from isolates that have a blaKPC gene and a blaTEM-1 allele.
  • To do this, enter the following search:
    genes_on_isolate:blaTEM-1 AND genes_on_isolate:blaKPC*
    (Note that field-specified searches are case-sensitive, and separate sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards such as the asterisk.)
  • Click the Cross-browser selection button to the right of the Download button (you must be logged into your myNCBI account for this functionality).
  • A new tab will open in the Isolates Browser containing all of the isolates meeting your search criteria. You can then examine clusters of interest using the SNP Tree Viewer, or perform other tasks in the Isolates Browser.
Find the isolates and alleles described by a publication of interest MicroBIGG-E help, topic listback to top

As an example, the paper by Shields, et al., describes a two amino-acid deletion in blaACT alleles that confers resistance to ceftazidime-avibactam (PubMed ID 32236408). Use the publication and MicroBIGG-E to retrieve and examine the isolates and specific alleles reported in the paper:
  • Identify the isolates, "Surv196" and "ENT630," from the paper that have these blaACT variants.
  • Open the MicroBIGG-E: Microbial Browser for Identification of Genetic and Genomic Elements.
  • Search for the isolates in the strain data field by entering the following query:
    strain:Surv196 OR strain:ENT630
    (Note that field-specified searches are case-sensitive, and separate sections of this document provide tips about search terms that contain special characters (such as the parentheses, hyphens, and apostrophes), and the use of wildcards such as the asterisk.)
  • Identify the blaACT alleles among the genetic/genomic elements that are retrieved by MicroBIGG-E.
  • For the allele of interest, retrieve the corresponding WP_* accession from the Protein database to view the RefSeq protein sequence record. In this case, the accessions for the blaACT proteins that were identified on the isolate genomes are: WP_154123408.1 (on the Surv196 isolate) and WP_152819218.1 (on the ENT630 isolate).
  • Note that a WP_* accession can appear in the protein_acc column and/or the closest_reference_acc column.
    • Use the choose columns function to display the desired data fields, as only a subset are displayed by default.
    • The protein_acc column and closest_reference_acc column might contain the same value (if the protein sequence of the element that was annotated on the isolate genome is identical to the reference protein sequence), or different values (if the protein sequence of the element that was annotated on the isolate genome is not identical to the reference protein sequence).
    • If a WP_* accession is not linked to the Protein database, you can search for the accession number directly in the Protein database.

Submit sequence and phenotype data related to AMR Antimicrobial resistance (AMR) resources, topic listback to top

Download AMR Raw Data Antimicrobial resistance (AMR) resources, topic listback to top

The AMR subdirectory of the Pathogens FTP site allows Raw Data Download. It includes AMRFinderPlus data files and the Bacterial Antimicrobial Resistance Reference Gene Database (BioProject PRJNA313047).
(An overview of the Pathogens FTP site is provided below.)


FTP Site help back to top

What data are on the Pathogens FTP site? FTP site, topic listback to top

The NCBI Pathogen Detection analysis pipeline artifacts are copied to FTP for bulk downloading. The data that are available include metadata tables, cluster lists, and individual SNP trees, and mimic what is available in the Pathogen Browser. In addition, there are files that support efforts for antimicrobial resistance gene detection including reference tables, and files used by NCBI AMRFinderPlus.

How are the Pathogens data organized on the FTP site? FTP site, topic listback to top

Results directory | Reference directory | Antimicrobial Resistance directory | Other directories
  • Results directory:
    Individual phylogenetic trees for each SNP cluster are available in the Pathogens FTP "Results" directory.

    Note: Individual phylogenetic trees for each SNP cluster are also accessible from the NCBI Pathogen Detection Isolates Browser. In the Isolates Browser, isolates that have "PDS*" accession number in the "SNP Cluster" column have a link to the SNP Tree Viewer, which provides an interactive display of the SNP cluster. (read more...)

    Each folder in this directory contains the data analysis results, such as phylogenetic distance trees, for a given organism group. The folders contain the results of the most current data analyses, as well as archival results from previous analyses. The results for a given organism group are updated daily for each taxgroup, only if new data arrives. Archived results are stored according to the data retention policy.
    • Organism group folders - These folders contains the results of data analyses, such as phylogenetic distance trees, that were done on the genome assemblies of isolates within each organism group in the Pathogen Detection Project. Within an given organism group, the subfolder named with the most recent Pathogen Detection Group accession.version number (PDGxxxxxxxxxx.xxx*) contains the most recent results. The results for a given organism group are updated daily for each taxgroup, only if new data arrives. The "latest_kmer" and "latest_snps" links provide updated links to the most recent results for kmer and SNP analyses, respectively, which may be asynchronously produced (may point to different PDG versions); otherwise they will both point to the most recent PDG version.
      • Rapid_reports for select organisms - This directory is a pilot phase test of rapid reporting based solely on wgMLST allele differences and is only operational for a few submitters for a few organisms. The FTP Rapid Reports for a given organism are updated on average within an hour of receiving sequence read submissions for a new isolate.
  • Reference directory:
    This directory does NOT include real-time analysis results, and is only based on genomes available in GenBank that are not submitted as part of surveillance networks to SRA.
  • Antimicrobial_resistance directory:
    This directory contains the reference table for AMR genes, and the data files used for AMRFinderPlus.
    For more information on NCBI's efforts on antimicrobial resistance, see this page:
    /pathogens/antimicrobial-resistance/.
    For more information on AMRFinderPlus see this page:
    /pathogens/antimicrobial-resistance/AMRFinder/.
  • Other directories:
    For descriptions of the other subdirectories see the FTP README file.

FTP Readme File FTP site, topic listback to top



Data Submissions back to top

  • Please refer to these instructions to submit data to NCBI Pathogen Detection resource.



  • Data Processing Pipelineback to top

    Overview Data Processing, topic listback to top

    NCBI has developed a multi-stage pipeline with two goals: 1) clustering closely related pathogen isolates and 2) identifying antimicrobial resistance genes/proteins in pathogen genomes. The pipeline first assembles the short read sequence data for an isolate into a genome sequence. This includes targeted assembly for certain genes of interest (such as AMR genes) for increased sensitivity. Second, the pipeline clusters the genomes from the assembly process along with the genomes found in GenBank for each organism (see Organism Table for current list). Third, phylogenetic trees are reconstructed after SNP calling within each cluster. The fourth step involves annotation and identification of AMR genes. Details on the full pipeline will be published at a later date. Note: there is a small pilot project pipeline that simply assembles and using the wgMLST scheme to generate a table of nearest neighbors. That pipeline currently only runs for Listeria and Salmonella.

    I.   Assembly pipeline. Data Processing, topic listback to top

    • The assembly pipeline uses SKESA to generate de novo assemblies as well as the guided assembler SAUTE to sensitively and comprehensively catalog antimicrobial resistance genes. The current pipeline only assembles Illumina data, assemblies from other sequencing technologies are included when uploaded to GenBank. Note that the de novo and guided assembler pipelines may both independently assemble the same region of the genome, so there will often be duplicated sequence in the final assembly.
    For any Bioproject that is flagged to monitor for incoming data, the assembly process automatically initiates as data are submitted. Not all BioProjects are flagged, and not all SRA data are automatically added to the system. Note that the assemblies generated by this process are submitted to GenBank when possible.

    II.   Clustering. Data Processing, topic listback to top

    There are also two different clustering pipelines in operation. Clustering automatically starts once a day for each organism only if new data are submitted.
    1. The first uses a reference wgMLST scheme (one for each organism if one exists), identifies the loci and alleles in each assembled genome, and uses a 25-allele cut-off to cluster related isolates. This system is gradually being rolled out. Most of the taxgroups with large numbers of isolates submitted are using the wgMLST method. A hard cut-off of 1K isolates is in place before a reference wgMLST scheme is developed, therefore not all organisms will be switched to this system.
    2. The second uses k-mer distances to first cluster related isolates, then a first pass SNP analysis. Clusters are created using 50-SNP single-linkage clustering. This system is gradually being replaced by the wgMLST but will remain for those organisms that have less than 1K isolates.
    For BOTH pipelines, once clusters are created, within each cluster of closely related isolates, a reference assembly is chosen, assemblies are aligned, SNPs are called, and phylogenetic trees are inferred. For each organism group there will be isolates that do not end up in a cluster. For those that do end up in a cluster, the cluster sizes can be from size two to several thousand.

    III.   Phylogenetic tree reconstruction. Data Processing, topic listback to top

    For each cluster, a phylogenetic tree is reconstructed from the SNPs for that cluster by using the maximum compatibility criteria.

    IV.   Annotation and antimicrobial gene/protein identification. Data Processing, topic listback to top

    Annotation of assembled genomes uses the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) system. Antimicrobial resistance (AMR) genes are identified using AMRFinderPlus (additional details are provided in an overview about AMRFinderPlus and a publication by Feldgarden M, et al., 2019). Genes are grouped into genotype categories, as described below.

    Each assembled genome that passes validation criteria will end up in the NCBI Pathogen Detection Isolates Browser. Each SNP cluster is also available, both on FTP as well as in the NCBI Pathogen Detection Isolates Browser. AMR results are available both on FTP and in the browser as a separate column. Rapid Reports are only available on FTP.

    New isolates are analyzed using the latest version of the AMRFinderPlus software and the latest version of Pathogen Detection Reference Gene Catalog (read about the Reference Gene Catalog). Older isolates may have been analyzed with earlier versions of the AMRFinderPlus software and the Reference Gene Catalog. There might be occasional updates to annotation on all isolates in special circumstances, such as the identification of a new genes (e.g., mobilized colistin resistance (mcr) genes). Data fields in the Isolates Browser indicate the analysis type (amrfinderplus_analysis_type), AMRFinderPlus version (amrfinderplus_version), and Reference Gene Catalog version (refgene_db_version) that were used in the analysis of a given isolate.

    (Separate sections of this file provide Isolates Browser help documentation and an overview of the data available on the FTP site. The AMRFinderPlus wiki provides details about installing and running the program, interpreting the results, and methods used for isolate genome analysis.)

    Genotype Categories Data Processing, topic listback to top

    The genes identified in an isolate's genome by the NCBI Pathogen Detection data processing pipeline are grouped into genotype categories.

    The stand-alone AMRFinderPlus software produces a detailed categorization, based on the method used to identify the genotypes. (The AMRFinderPlus wiki provides details about the methods, under "Running AMRFinderPlus > Output Format > Fields > Method".)

    The Isolates Browser web interface displays a simplified categorization of genotypes. (The genotype categories appear when you use the choose columns function to display data such as AMR genotypes (AMR_genotypes), Stress genotypes (stress_genotypes), and/or Virulence genotypes (virulence_genotypes).)

    The table below shows the correspondences between the AMRFinderPlus methods used to identify genotypes and the simplified genotype categories displayed by the Isolates Browser web interface:

    AMRFinderPlus Method Genotype Category
    in the Isolates Browser
    web display
    Notes
    ALLELEP COMPLETE "Complete" genes are sequences that have BLAST alignments that cover ≥ 90% of the reference protein in the Pathogen Detection Reference Gene Catalog (sometimes referred to as the AMRFinderPlus database).

    Specifically:
    • Those identified by the ALLELEP or ALLELEX method have a 100% sequence match to 100% of length to a protein named at the allele level in the Pathogen Detection Reference Gene Catalog.
    • Those identified by the EXACTP or EXACTX method have a 100% sequence match to 100% of length to a protein in the in the Pathogen Detection Reference Gene Catalog that is not a named allele.
    • Those identified by the BLASTP or BLASTX method have a BLAST alignment that covers > 90% of the length, and a sequence identity of > 90% (default cutoff), to a protein in the Pathogen Detection Reference Gene Catalog. For some genes, however, the sequence identity cutoff may be higher or lower, based on manual curation.
    The suffix "P" refers to Protein BLAST (protein vs protein sequence comparisons), and the suffix "X" refers to Translated BLAST (nucleotide vs protein sequence comparisons).
    ALLELEX
    BLASTP
    BLASTX
    EXACTP
    EXACTX
    HMM HMM These are proteins that were found by HMM only, more distant to reference proteins than our BLAST cutoffs. (The HMM was hit above the cutoff, but there was not a BLAST hit that met standards for BLAST or PARTIAL. This does not have a suffix of "P" or "X" because only protein sequences are searched by HMM.)
    INTERNAL_STOP MISTRANSLATION Indicates a stop codon was found within the BLASTX alignment of the nucleotide sequence to the reference protein. In the future this may be extended to include frame shifts (which are currently not directly detected by AMRFinderPlus).
    PARTIALP PARTIAL "Partial" genes are identified by BLAST to cover > 50% but < 90% of the length of the reference sequence, and the BLAST alignment does not end at a contig boundary. The aligned region has > 90% identity to the reference protein (default cutoff). For some genes, however, the sequence identity cutoff may be higher or lower, based on manual curation..
    PARTIALX
    PARTIAL_CONTIG_ENDP PARTIAL_END_OF_CONTIG "Partial end of contig" genes are "partial" alignments that end at contig boundaries, indicating that they are more likely to have been split by a sequencing or assembly issue. Like "partial" genes, these are identified by BLAST to cover > 50% but < 90% of the length of the reference sequence. The aligned region has > 90% sequence identity to the reference (default cutoff). For some genes, however, the sequence identity cutoff may be higher or lower, based on manual curation.
    PARTIAL_CONTIG_ENDX
    POINTN POINT Point mutation identified by BLAST:
    • POINTN mutations were identified by nucleotide BLAST (BLASTN)
    • POINTP mutations were identified by protein BLAST (BLASTP)
    • POINTX mutations were identified by translated BLAST (BLASTX)
    POINTP
    POINTX


    Data Retention and History Trackingback to top

    Pathogen Reference Data and Analysis Results Continue to Evolve Data Submissions, topic listback to top

    • Unlike many other databases and resources at NCBI, the Pathogen Detection Project was designed to provide updates of analyses in real-time. Therefore, the content of the resource may be updated multiple times per day. For any given pathogen isolate, organism group, or SNP cluster, the Pathogen Detection Browsers display, by default, the most current data and analysis results, including the relationships among isolates that have been calculated by the data processing pipeline. Therefore, for most uses of the browsers, the latest data are being presented. Of the browsers only the isolates browser retains some tracking of history as described below. The Reference Browsers (Reference Gene Catalog, Reference Gene Hierarchy, and Reference HMM Catalog) all only show data for the most recent release. A complete history is maintained on the FTP site. See AMRFinderPlus Reference Data Retention for details

    Isolates Browser data retention Data Submissions, topic listback to top

    Three critical Isolates Browser data objects are tracked Data Submissions, topic listback to top

    • The system tracks versions for three critical data objects:
      1. the assembly of any isolate ("PDT")
      2. the SNP cluster of an isolate if it belongs to one ("PDS")
      3. the organism group ("PDG")
      The organism group is the entire package of new isolate updates, which could consist of both new or updated assemblies as well as new or updated clusters. Updates of each organism group could occur as frequently as every 24 hours, and as each organism group is independent of one another, multiple organism groups could be updated in a given day.

    Accession.Versions are used to track changes Data Submissions, topic listback to top

    • The Pathogen Detection Project assigns an accession.version to each isolate genome assembly, organism group, and SNP cluster in order to track changes to the pathogens data and analysis results. For example:

      • The Pathogen Detection Target ("PDT" accession.version) is the genome assembly for an individual isolate.
        A new version of a PDT record indicates a change in the assembly.
      • The Pathogen Detection SNP cluster ("PDS" accession.version) is a group of isolates that are closely related, based on the SNP distance between their genome assemblies as calculated by the Pathogen Detection Project data processing pipeline.
        A new version of a PDS record may indicate changes such as the following:
        • The SNP cluster changed its membership.
        • Some of its SNP distances have changed among the isolates that are members of the SNP cluster

      • The Pathogen Detection Group ("PDG" accession.version) is also known as an organism group.
        A new version of a PDG record includes additions or deletions of isolates, changes to isolate assemblies. All of these isolate assembly changes may or may not include changes to SNP clusters (additions, deletions, modifications). The Pathogens Detection Project retains the most recent 300 versions of a PDG.
        • Technical note: An organism group (PDG) contains one or more targets (PDTs). A PDT is a member of zero or one SNP cluster (PDS), and never more than one cluster. A SNP cluster is composed of two or more PDTs, and each ach PDS is completely contained within a PDG.

      As the data and analysis results evolve, the Pathogen Detection Project applies data retention and history tracking policies as described below.

    Two interactions that may not present the most up to date information Data Submissions, topic listback to top

    • As noted under Pathogen Data and Analysis Results Continue to Evolve, the latest data are presented by default for most uses of the browser. There are only two specific user interactions with the browser that may not present the most up-to-date information: 1) searches with specific accession.versions of one of the three objects mentioned above (PDT, PDG, PDS) that are from older analyses, and 2) the "share URL" button on the SNP Tree Viewer. For both of these cases there is a data retention policy is place that culls older data (i.e., removes versions of isolates, organism groups, and SNP clusters that were retired more than 30 days ago) so that the system does not need to retain every single piece of data ever calculated.

    Two states for data retention besides the most up to date version Data Submissions, topic listback to top

    • There are two states for data retention besides the most up to date version. These include a window of 30 days where older data can be viewed, including the SNP tree as it looked for that particular version, and beyond that, the interface will present the user with links to the most current versions of that data.

    Example scenarios: requests for previous versions of data that are older than 30 days Data Submissions, topic listback to top

    • If you try to view previous versions of the data, the following scenarios can occur:
      • If you are requesting an accesion/version that is older than the 30 day retention period, you can no longer see the content (e.g., phylogenetic tree, SNP distances, metadata) for a target or cluster. However, the Pathogen Browser will indicate the current version of a requested target or cluster. It can also help you find successor target or cluster(s) if the requested target or cluster no longer exists. These hints are displayed at the top of the Browser.
        • For example, if you enter PDT000000625.5 in the Search Isolates box, you get a message like this:
          Record PDT000000625.5 replaced by PDG000000002.1212/PDT000000625.6. The system is directing you to the newer version PDT000000625.6 published in PDG version PDG000000002.1212.

      • If the requested cluster no longer exists, then a list of one or more successor clusters may be presented. The Pathogen Browser determines the past target membership of the requested cluster and traces forward to the current clusters that contain those targets. This allows forward tracking of a cluster when the cluster has split or merged over time, or has been completely replaced.
        • For example, if you enter PDS000029842.1 in the Search Isolates box, you get message like this:
          SNP cluster(s) succeeded by PDG000000002.1212/PDS000032550.9.

      • Occasionally a target may be withdrawn (taken out of service) usually as a result of data retraction by a submitter. When you request such a target, the Pathogen Browser will try to direct you to the cluster (or its successor(s)) that once included the target as a member.
        • For example, if you enter PDT000111278.1 you will get a message like this:
          Record removed: PDT000111278.1 SNP cluster(s) succeeded by PDG000000002.1212/PDS000028815.20.

      • Using a shared URL that you either made in the past or got from a collaborator could result in any one of the following, depending on the age of the shared URL and whether the URL refers to actual content within the 30d retention period:
        • A tree viewer display (if the URL refers to current data, or to data that is still available as a result of the 30 day retention policy)
          OR
        • A history tracking message such as the ones in the examples above (if the URL is less than 60 days old and refers to data that is no longer available in its previous form)
          OR
        • A message saying the URL has expired (if the URL is more than 60 days old). In that case, if you are still interested in viewing the isolate, organism group, or SNP cluster that is cited in the URL, you can enter the corresponding PDT*, PDG*, or PDS* accession number in the Isolates Browser to access the most recent version of the data.

    Shared URLs are valid for 60 days Data Submissions, topic listback to top

    • A "Share" button is available in the SNP Tree Viewer display (as shown in part C of the illustrated example of a SNP Tree Viewer display). It produces a URL that captures your customized view of the tree, which can then be copied and shared with others to reproduce the same view.
    • The URL is temporary, remaining valid for 60 days:

      • For the first 30 days, the URL will open the customized display, showing the isolates you selected and any other customizations you made to the view.
      • For the second 30 days, the URL continues to be valid, but during that time, it will only show a link to the default display for the most recent version of the SNP cluster. That is, the URL will not open the original customized view, but instead will redirect to a version of the phylogenetic distance tree that reflects the most recent for the tree.
      As mentioned near the top of this section on data retention and history tracking, the composition of a tree can change over time as new data are added to the Pathogen Detection Project. Even if a tree remains unchanged, however, a saved URL is only retained in the system for 60 days.

    Isolates browser data published to FTP are also subject to retention policies

    • Progressive retention policy:

      • Every publication within 30 days
      • One publication per week after 30 days but within 6 months
      • One publication per month after 6 months but within 1 year
      • One publication per year thereafter
      For latest details please consult the FTP ReadMe.txt file.

    MicroBIGG-E data retention Data Submissions, topic listback to top

    Reference browser data retention Data Submissions, topic listback to top


    Log of Changes to Pathogen Detection Project back to top

    AMRFinderPlus database change log

    References back to top

    Citing the Pathogen Detection Project
    NCBI Publications/Methods used by the Pathogen Detection Project
    Third party Publications/Methods used by the Pathogen Detection Project
    Publications from Other Sources using the Pathogen Detection Browser
    Publications from External Labs using the Pathogen Detection Browser
    Presentations about the Pathogen Detection Project
    References about the Genomics for Food Safety (GenFS) initiative
    References about the FDA GenomeTrakr project and WGS activities
    References about the CDC PulseNet network and WGS activities
    References about Public Health England WGS activities
    Other related references
    References on antimicrobial resistance, including AMRFinder

    Citing the NCBI Pathogen Detection Project References, topic listback to top

    • The NCBI Pathogen Detection Project [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information. 2016 May [cited YYYY MMM DD]. Available from: https://www.ncbi.nlm.nih.gov/pathogens/

    NCBI Publications/Methods used by the Pathogen Detection Project References, topic listback to top

    The SAUTE reference-guided assembler is used in the Pathogen Detection pipeline in conjunction with gene sequences from the AMRFinderPlus data release. Software is available at https://github.com/ncbi/skesa/releases
    • Souvorov A and Agarwala R. SAUTE: sequence assembly using target enrichment. BMC Bioinformatics. 2021 Jul 21;22(1):375. doi: 10.1186/s12859-021-04174-9. PubMed PMID: 34289805; Full text at BMC.
    The SKESA assembler is used in the Pathogen Detection pipeline. Software is available at https://github.com/ncbi/SKESA The maximum compatibility algorithm is used to create the SNP trees in the Pathogen Detection browser. Software available at https://ftp.ncbi.nih.gov/pub/jcherry/compat/

    The AMRFinderPlus software is used to identify antimicrobial resistance genes plus select virulence, biocide, metal, and stress resistance genes. Software is available at https://github.com/ncbi/amr/wiki

    • Feldgarden M, Brover V, Fedorov B, Haft DH, Prasad AB, Klimke W. Curation of the AMRFinderPlus databases: applications, functionality and impact. Microb Genome. 2022 Jun;8(6). doi: 10.1099/mgen.0.000832. PubMed PMID: 35675101; Full text at Microbial Genomics.
    • Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J, Haft DH, Hoffmann M, Pettengill JB, Prasad AB, Tillman GE, Tyson GH, Klimke W. AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. 2021 June 16;11(1):12728. https://doi.org/10.1038/s41598-021-91456-0. PubMed PMID: 34135355; Full text at Nature Scientific Reports.
    • Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S, Hsu C-H, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J, Folster JP, Klimke W. Validating the NCBI AMRFinder Tool and Resistance Gene Database Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of NARMS Isolates. Antimicrobial Agents and Chemotherapy. 2019 Nov 1;63(11). doi: e10.1128/AAC.00483-19 PubMed PMID: 31427293; Full text in PubMed Central PMCID: PMC6811410; Full text at AAC.
    The PGAP annotation pipeline is used to annotate bacterial assemblies. Software is available at: https://github.com/ncbi/pgap

    Third Party Publications/Methods used by the Pathogen Detection Project References, topic listback to top

    SeqSero2 is run on Salmonella assemblies to populate the serotype and antigen_formula values in the computed_types field of the Pathogen Detection Isolates Browser

    Publications from other sources using the Pathogen Detection Browser References, topic listback to top

    Pro Publica Used Genomic Sequencing Data to Track an Ongoing Salmonella Outbreak.

    Publications from External Labs using the Pathogen Detection Browser References, topic listback to top

    Scientists use the Pathogen Detection System to link isolates from Southeast Asia to clinical cases in England and the US, some with a history of travel.
    • Schwan CL, Dallman TJ, Cook PW, Vipham J (2022) A case report of Salmonella enterica serovar Corvallis from environmental isolates from Cambodia and clinical isolates in the UK. Access Microbiology: Vol4(1) https://doi.org/10.1099/acmi.0.000315
    Economic evaluation of whole genome sequence analysis using the publicly available data in the Pathogen Detection System.
    • Brown B, Allard M, Bazaco MC, Blankenship J, Minor T (2021) An economic evaluation of the Whole Genome Sequencing source tracking program in the U.S. PLoS ONE 16(10): e0258262. https://doi.org/10.1371/journal.pone.0258262
    Scientists in Oregon public health institutions use NCBI Pathogen Detection Browser to identify pathogenic Escherichia coli O157:H7 from venison from harvested deer and clinical cases from hunters in the same area.
    • Ladd-Wilson SG, Morey K, Turpen L, DeMarco K,Van Der Veen G,Fontana JL, Dannenhoffer RL, Tenney K, Kutumbaka KK, Samadpour M, Cieslak PR. Escherichia coli O157:H7 Cluster Associated With Deer Harvested at a Single Wildlife Hunting Area, Oregon, 2017. Full text at Public Health Reports.
    Scientists at multiple institutions use NCBI Pathogen Detection Browser for evaluation of Shigella isolates. Analysis includes evaluation of macrolide resistance, plasmid structure, and identified multiple outbreaks in the United States and evidence of intercontinental transmission
    • Worley JN, Javkar K, Hoffmann M, Hysell K, Garcia-Williams A, Tagg K, Kanjilal S, Strain E, Pop M, Allard M, Francois Watkins L, Bry L. Genomic Drivers of Multidrug-Resistant Shigella Affecting Vulnerable Patient Populations in the United States and Abroad. PubMed PMID: 33500335; Full text at mBio.
    Scientists at New York State Department of Health use NCBI Pathogen Detection Browser retrospective analysis of Clostridium prefringens outbreaks
    • Carey J, Cole J, Venkata SLG, Hoyt H, Mingle L, Nicholas D, Musser KA, Wolfgang WJ. Genomic Epidemiology of Historical Clostridium perfringens Outbreaks in New York State Using Two Web-based Platforms: National Center for Biotechnology Information-Pathogen Detection and FDA-GalaxyTrakr. PubMed PMID: 33177125; Full text at Journal of Clinical Microbiology.
    Scientists at multiple public health agencies use NCBI Pathogen Detection Browser for international Listeria outbreak
    • Pettengill J, Markell A, Conrad A, Carleton H, Beal J, Rand H, Musser S, Brown E, Allard M, Huffman J, Harris S, Wise M, Locas A. A multinational listeriosis outbreak and the importance of sharing genomic data. Full text at The Lancet.
    Scientists at BWH use NCBI Pathogen Detection Browser to examine C. difficile transmission Oregon Health Authority uses NCBI Pathogen Detection Browser to Uncover Outbreak Israeli Ministry of Health uses Pathogen Detection Browser and AMRFinder Results to Examine Multi-Drug Resistant Shigella spp. Scientists at University of Pretoria, South Africa, use antimicrobial resistance data from the NCBI Pathogen Detection Browser to exammine the genomic epidemiology of African Gram-negative bacteria
    • Sekyere JO, and Reta MA. Genomic and Resistance Epidemiology of Gram-Negative Bacteria in Africa: a Systematic Review and Phylogenomic Analyses from a One Health Perspective. PubMed PMID: 33234606; Full text at mSystems.
    Department of Civil Engineering, Hawaii, Uses Pathogen Browser for Analysis of Municipal Wastewater Salmonella Isolates

    Presentations about the Pathogen Detection Project References, topic listback to top

    Freely available: The APHL webinar below requires registration:
    • APHL Webinar: "Using the NCBI Pathogen Detection Portal to Aid in Surveillance of Enteric Pathogens," March 8, 2018, hosted by the Association of Public Health Laboratories (WebEx presentation: slides and audio (1 hour 6 minutes; the archived program is available until March 7, 2019), slides only)

    References about the Genomics for Food Safety (GenFS) initiative References, topic listback to top

    The Genomics for Food Safety Interagency Collaboration (CDC, FDA, USDA-FSIS, and NCBI-NLM-NIH) is described including Pathogen Detection.
    • Stevens EL, Carleton HA, Beal J, Tillman GE, Lindsey RL, Lauer AC, Pightling A, Jarvis KG, Ottesen A, Ramachandran P, Hintz L, Katz LS, Folster JP, Whichard JM, Trees E, Timme RE, McDERMOTT P, Wolpert B, Bazaco M, Zhao S, Lindley S, Bruce BB, Griffin PM, Brown E, Allard M, Tallent S, Irvin K, Hoffmann M, Wise M, Tauxe R, Gerner-Smidt P, Simmons M, Kissler B, Defibaugh-Chavez S, Klimke W, Agarwala R, Lindsay J, Cook K, Austerman SR, Goldman D, McGARRY S, Hale KR, Dessai U, Musser SM, Braden C. Use of Whole Genome Sequencing by the Federal Interagency Collaboration for Genomics for Food and Feed Safety in the United States. J Food Prot. 2022. May 1;85(5):755-772. doi: 10.4315/JFP-21-437. PubMed PMID: 35259246;
    A publication describing datasets for phylogenetic validation based on WGS of four foodborne pathogens from the data standards working group:

    References about the FDA GenomeTrakr project and WGS activities References, topic listback to top

    FDA Podcast on Food Safety and WGS: FDA 2021 Focus on Regulatory Science: GenomeTrakr proficiency testing: GenomeTrakr proficiency testing: Demonstration of the value of WGS and data sharing: FDA uses Isolates Browser for Listeria ice cream outbreak analysis
    • Allard MW, Strain E, Rand H, Melka D, Correll WA, Hintz L, Stevens E, Timme R, Lomonaco S, Chen Y, Musser SM, Brown EW. Whole genome sequencing uses for foodborne contamination and compliance: Discovery of an emerging contamination event in an ice cream facility using whole genome sequencing. Infect Genet Evol. 2019 Sep;73:214-220. doi: 10.1016/j.meegid.2019.04.026. Epub 2019 Apr 27. PubMed PMID: 31039448; Full text at Infection, Genetics and Evolution
    FDA uses Isolates Browser for Salmonella enterica Analyses
    • Trinetta V, Magossi G, Allard MW, Tallent SM, Brown EW, Lomonaco S. Characterization of Salmonella enterica Isolates From Selected U.S. Swine Feed Mills by Whole-Genome Sequencing. Foodborne Pathog Dis. 2020 Feb;17(2):126-136. doi: 10.1089/fpd.2019.2701. Epub 2019 Nov 8. PubMed PMID: 31702400; Full text at Foodborne Pathog Dis.
    FDA GenomeTrakr Protocols IO

    References about the CDC PulseNet network and WGS activities References, topic listback to top

    Pathogen Genomes in Public Health - Cites NCBI Pathogen Detection Isolates Browser as a Model for Open Information PulseNet vision statement:
    • Nadon C, Van Walle I, Gerner-Smidt P, Campos J, Chinen I, Concepcion-Acevedo J, Gilpin B, Smith AM, Man Kam K, Perez E, Trees E, Kubota K, Takkinen J, Nielsen EM, Carleton H; FWD-NEXT Expert Panel. PulseNet International: Vision for the implementation of whole genome sequencing (WGS) for global food-borne disease surveillance. Euro Surveill. 2017 Jun 8;22(23). pii: 30544. doi: 10.2807/1560-7917.ES.2017.22.23.30544. Review. PubMed PMID: 28662764; Full text in PubMed Central PMCID: PMC5479977; Full text at Eurosurveillance
    PulseNet 20th anniversary announcement:
    • Announcement: 20th Anniversary of PulseNet: the National Molecular Subtyping Network for Foodborne Disease Surveillance - United States, 2016. MMWR Morb Mortal Wkly Rep. 2016 Jun 24;65(24):636. doi: 10.15585/mmwr.mm6524a5.. PubMed PMID: 27337605; Full text at CDC
    Showing that the switch to WGS results in decreased cluster sizes and more outbreaks solved:
    • Jackson BR, Tarr C, Strain E, Jackson KA, Conrad A, Carleton H, Katz LS, Stroika S, Gould LH, Mody RK, Silk BJ, Beal J, Chen Y, Timme R, Doyle M, Fields A, Wise M, Tillman G, Defibaugh-Chavez S, Kucerova Z, Sabol A, Roache K, Trees E, Simmons M, Wasilenko J, Kubota K, Pouseele H, Klimke W, Besser J, Brown E, Allard M, Gerner-Smidt P. Implementation of Nationwide Real-time Whole-genome Sequencing to Enhance Listeriosis Outbreak Detection and Investigation. Clin Infect Dis. 2016 Aug 1;63(3):380-6. doi: 10.1093/cid/ciw242. Epub 2016 Apr 18. PubMed PMID: 27090985; Full text in PubMed Central PMCID: PMC4946012; Full text at Oxford Academic.

    References about Public Health England WGS activities References, topic listback to top

    Public Health England Describes Switch to Whole Genome Sequencing for Pathogen Surveillance for Salmonella - Cites NCBI Pathogen Detection as a Model for Open Information

    Other related references References, topic listback to top

    References on antimicrobial resistance References, topic listback to top

    Using the NCBI AMRFinderPlus Tool:
    • Feldgarden M, Brover V, Fedorov B, Haft DH, Prasad AB, Klimke W. Curation of the AMRFinderPlus databases: applications, functionality and impact. Microb Genome. 2022 Jun;8(6). doi: 10.1099/mgen.0.000832. PubMed PMID: 35675101; Full text at Microbial Genomics.
    • Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J, Haft DH, Hoffmann M, Pettengill JB, Prasad AB, Tillman GE, Tyson GH, Klimke W. AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. 2021 June 16;11(1):12728. https://doi.org/10.1038/s41598-021-91456-0. PubMed PMID: 34135355; Full text at Nature Scientific Reports.
    • Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S, Hsu CH, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J, Folster JP, Klimke W. Validating the NCBI AMRFinder Tool and Resistance Gene Database Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of NARMS Isolates. Antimicrob Agents Chemother 2019 Aug 19. pii: AAC.00483-19. doi: 10.1128/AAC.00483-19. [Epub ahead of print] PubMed PMID: 31427293; Full text at AAC.
    AMRFinderPlus used to confirm putative virally-encoded beta-lactamases AMRFinderPlus implemented in SeqSphere+ Using AMRFinderPlus to identify metal resistance genes Uses AMRFinderPlus and MicroBIGG-E output to identify health risks of antibiotic resistance genes NIAID Funded Bioinformatics Resource Center PATRIC uses AMRFinderPlus Veterinary Laboratory Information and Response Network of FDA Identifies NDM-5 in E. coli from Companion Animals Using Isolates Browser and AMRFinderPlus Results
    • Cole SD, Peak L, Tyson GH, Reimschuessel R, Ceric O, Rankin SC. New Delhi Metallo-beta-Lactamase-5-producing Escherichia coli in Companion Animals, United States. Emerg Infect Dis. 2020 Feb https://doi.org/10.3201/eid2602.191221. Full text at Emerging Infectious Diseases.
    FDA Center for Veterinary Medicine uses Pathogen Browser and AMRFinder Results to Examine Fluoroquinolone Resistance in E. coli Review of beta lactamases and nomenclature:
    • Bush K. Past and Present Perspectives on β-Lactamases. Antimicrob Agents Chemother 2018 Sep 24;62(10). pii: e01076-18. doi: 10.1128/AAC.01076-18. Print 2018 Oct. Review. PubMed PMID: 30061284; Full text in PubMed Central PMCID: PMC6153792.
    • Mack AR, Barnes MD, Taracila MA, Hujer AM, Hujer KM, Cabot G, Feldgarden M, Haft DH, Klimke W, van den Akker F, Vila AJ, Smania A, Haider S, Papp-Wallace KM, Bradford PA, Rossolini GM, Docquier JD, Frère JM, Galleni M, Hanson ND, Oliver A, Plésiat P, Poirel L, Nordmann P, Palzkill TG, Jacoby GA, Bush K, Bonomo RA. A standard numbering scheme for class C β-Lactamases. Antimicrob Agents Chemother 2019 Nov 11. pii: AAC.01841-19. doi: 10.1128/AAC.01841-19. [Epub ahead of print]. PubMed PMID: 31712217; Full text in Antimicrobial Agents and Chemotherapy.
    Proposal for assignment of allele numbers for mobile colistin resistance (mcr) genes:
    • Partridge SR, Di Pilato V, Doi Y, Feldgarden M, Haft DH, Klimke W, Kumar-Singh S, Liu JH, Malhotra-Kumar S, Prasad A, Rossolini GM, Schwarz S, Shen J, Walsh T, Wang Y, Xavier BB. Proposal for assignment of allele numbers for mobile colistin resistance (mcr) genes. J Antimicrob Chemother 2018 2018 Oct 1;73(10):2625-2630. doi: 10.1093/jac/dky262. PubMed PMID: 30053115; Full text in PubMed Central PMCID: PMC6148208.
    The NCBI AMRFinder tool helps identify the fourth mcr-1 resistant isolate in the US:
    • Vasquez AM, Montero N, Laughlin M, Dancy E, Melmed R, Sosa L, Watkins LF, Folster JP, Strockbine N, Moulton-Meissner H, Ansari U, Cartter ML, Walters MS. Investigation of Escherichia coli Harboring the mcr-1 Resistance Gene - Connecticut, 2016. MMWR Morb Mortal Wkly Rep. 2016 Sep 16;65(36):979-80. doi: 10.15585/mmwr.mm6536e3. PubMed PMID: 27631346; Full text at CDC.
    The NCBI AMRFinder tools helps uncover a novel fosfomycin resistance gene: The Comprehensive Antibiotic Resistance Database:
    • Jia B, Raphenya AR, Alcock B, Waglechner N, Guo P, Tsang KK, Lago BA, Dave BM, Pereira S, Sharma AN, Doshi S, Courtot M, Lo R, Williams LE, Frye JG, Elsayegh T, Sardar D, Westman EL, Pawlowski AC, Johnson TA, Brinkman FS, Wright GD, McArthur AG. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 2017 Jan 4;45(D1):D566-D573. doi: 10.1093/nar/gkw1004. Epub 2016 Oct 26. PubMed PMID: 27789705; Full text in PubMed Central PMCID: PMC5210516; Full text at Oxford Academic.
    Resfinder: Pointfinder:

    Contact information back to top

    If you would like to contact the NCBI Pathogen Detection team, please send an email to: pd-help@ncbi.nlm.nih.gov


    Revised 10 August 2021 Pathogen Detection Project help: pd-help@ncbi.nlm.nih.gov