Health
Pathogen Detection
Help
Reference Gene Hierarchy Documentation

Reference Gene Hierarchy Documentation

Beta Release

What is the Reference Gene Hierarchy?
How to search the Reference Gene Hierarchy
- Viewing search results
Data fields in the Reference Gene Hierarchy
Output

What is the Reference Gene Hierarchy?

The Reference Gene Hierarchy is a web-based view into the hierarchy of genes, families, and upstream nodes that our curators use to organize and relate the genes and HMMs in the Pathogen Detection Reference Gene Catalog and Pathogen Detection Reference HMM Catalog. This hierarchy drives the gene identification and naming algorithm of AMRFinderPlus.

The Reference Gene Hierarchy is what provides the link between proteins, HMMs, and protein names for those proteins that do not have an exact match in the Reference Gene Catalog.

Every row in the Pathogen Detection Reference Gene Hierarchy is a node in a tree that organizes the sequences and HMMs in our reference database. Every row is a node in the database and columns are metadata about that node. Links are provided to proteins and HMMs that are assigned to that node. Some of those nodes higher up in the tree are purely organizational such as the AMR node or the Virulence_E._coli node, while other nodes correspond to gene families like the blaTEM node which contains the TEM family beta-lactamases.

Scope: The Reference Gene Hierarchy includes two data subsets noted in the "scope" column:

"Core": this is a more narrowly curated AMR-specific subset of genes and proteins that are considered more likely to be informative about AMR phenotype.
"Plus": this subset includes genes related to biocide and stress resistance, general efflux, virulence, or antigenicity as well as AMR genes whose presence or absence are not likely to be informative about phenotype.

Relationships among the Reference Hierarchy Catalog and other Pathogen Detection Browsers

NCBI Pathogen Detection provides five table-based browsers to provide easy web-based access to the databases we curate and the results of our analysis. All are related resources and integrated with each other.
The main similarities between the resources are their shared search engines and similar search techniques.
- All use the SOLR query language and allow searches by a wide variety of text terms.
- The search tips provided by the [Isolates Browser help documentation(/pathogens/pathogens_help/#isolates-browser) therefore also apply to the Reference HMM Catalog, such as basic search techniques, advanced search techniques, case sensitive versus case insensitive searches, and the availability of "filters" to refine search results.
The main difference between the resources is the scope of data being searched, the hierarchical nature of the data for the Hierarchy Browser, the set of data fields (and filters which are based on data fields) that are available for searching, and the columns that are shown in the display of search results.
- Every row in the Reference Hierarchy Catalog is a node in a tree organizing the HMMs and protein sequences in the database. Every column is data about that node in the hierarchy.
- Nodes can be expanded to show their children by clicking on the , and expanded fully by holding down Ctrl or Command and clicking on the .
- When a search or filter is enabled the hierarchy table is reduced to only show nodes that contain the search terms or their ancestors. Rows that match the search term have checks to the left of the Node ID.
- Because of this sister nodes to those shown may be hidden. To view any sister nodes to those shown click Show all... next to the Node ID. Non-selected nodes can be hidden again by clicking Hide....
- Clicking on the checkbox for a node will select that node and all shown children. Hidden nodes will not be selected.
- Ctrl-click or Cmd-click will select/deselect only the associated node with no effect on its children.
- The children of a node can be expanded or collapsed by by clicking the or buttons next to the node ID.

Relationship between the Reference Gene Hierarchy and the Reference HMM Catalog and Reference Gene Catalog

The Reference Gene Hierarchy provides the higher level organization integrating protein sequences included in the Pathogen Detection Reference Gene Catalog with the HMMs included in the Pathogen Detection Reference HMM Catalog. Clicking on the number in the Proteins field takes you to all the protein sequences at that node in the Reference Gene Catalog. Clicking on an HMM accession takes you to that HMM in the Pathogen Detection Reference HMM Catalog.

The Reference Gene Hierarchy, Pathogen Detection Reference HMM Catalog, and Pathogen Detection Reference Gene Catalog together represent the data used by AMRFinderPlus to identify genes. The hierarchy relates both the gene sequences in the Reference Gene Catalog and the HMMs in the Reference HMM Catalog to each other. Decisions on gene symbol and gene name are made based on the placement and arrangement of the BLAST and HMMER matches in the hierarchy along with cutoffs and a set of rules to identify the correct name and symbol for a gene. See Feldgarden et al., 2021 for details.

Linking from the Reference Gene Hierarchy to other browsers

Next to each Node ID is a checkbox (checked by default), and by clicking on the Show in Isolates, Show in MicroBIGG-E, or Show in RefGene catalog you can see associated an link to the Reference Gene Catalog, Isolates Browser, and MicroBIGG-E from the Reference Gene Hierarchy.

Above the table there is a "Select all" / "Select none" toggle that act to remove or add checkboxes.

NOTE: Links to MicroBIGG-E and the Isolates browser are based on gene symbols and may not be exact when gene symbols do not match the AMRFinderPlus Hierarchy. Links to the Reference Gene Catalog are based on Node ID and will accurately reflect the hierarchy.

Where to access the Reference Gene Hierarchy

The Reference Gene Hierarchy is accessible from a link in the right margin of the Pathogen Detection Project home page, from the AMR landing page, and the AMR Resources page.

You can also access the Reference Gene Hierarchy directly from the links below:

Browse/search the Reference Gene Hierarchy:
https://www.ncbi.nlm.nih.gov/pathogens/genehierarchy/
Download the Reference Gene Hierarchy data in tab-delimited text file format:
https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/latest/ReferenceGeneHierarchy.txt

How to search the Reference Gene Hierarchy

Search terms: The Pathogen Detection Reference Hierarchy Browser can be searched by the terms that appear in any of the data fields described below.
Basic search: The query tips described in the Isolates Browser help > basic search section also apply to the Reference Gene Hierarchy, such as searches for multiple terms, special characters, phrase searches, case sensitive vs. case insensitive searches, etc.
Advanced search: The query tips described in the Isolates Browser help > Advanced search section also apply to the Reference Gene Catalog, because both resources use the SOLR query language. The main difference is the data fields that are available to be searched, because each resource has its own set of data fields. (See a list of the data fields in the Pathogen Detection Reference Gene Hierarchy (below))
Filters: The "Filters" options in the Pathogen Detection Reference Gene Hierarchy enable you to subset the data in a variety of ways, and therefore can be used to refine your results, whether you have done a basic search or an advanced search.
- By default, each filter displays the top 100 terms (based on the number of items retrieved by a term) listed by count of value within that set of top 100.
  - A Boolean "OR" is applied if multiple items are checked in the same filter field. This way you can choose multiple values in the same filter. For example: A Boolean "OR" is applied if multiple items are checked in the same filter field. This way you can choose multiple values in the same filter. For example:
    - Click the "Filters" bar of the Pathogen Detection Reference Hierarchy Browser, then click the "Subclass" filter. In the new panel select "CARBAPENEM" to filter the HMM list for HMMs that identify genes associated with carbapenem resistance.
  - A Boolean "AND" is applied if you select items in several different filter panels (Type, Scope, etc). For example:
    - Click the "Filters" bar of the Pathogen Detection Reference Hierarchy Browser, then click the "Scope" filter. In the new panel select the "core" scope. Go back to the "Available Filters" panel and select the "Class" filter, then select the "BETA-LACTAM" class to filter for HMMs that are thought to identify beta-lactamase genes that are more likely to have an effect on phenotype.
- As explained in the Isolates Browser Help, Filters are generated on the fly. As a result, the terms that are listed under each filter will depend on the data your are currently displaying in the browser.

Viewing search results

When a search or filter is enabled the hierarchy table is reduced to only show nodes that contain the search terms or their ancestors. Rows that match the search term have checks to the left of the Node ID.
Because of this sister nodes to those shown may be hidden. To view any sister nodes to those shown click Show all... next to the Node ID. Non-selected nodes can be hidden again by clicking Hide....
Nodes can be expanded to show their children by clicking on the , and expanded fully by holding down Ctrl or Command and clicking on the .
When a search or filter is enabled the hierarchy table is reduced to only show nodes that contain the search terms or their ancestors. Rows that match the search term have checks to the left of the Node ID.
Because of this sister nodes to those shown may be hidden. To view any sister nodes to those shown click Show all... next to the Node ID. Non-selected nodes can be hidden again by clicking Hide....
Clicking on the checkbox for a node will select that node and all shown children. Hidden nodes will not be selected.
Ctrl-click or Cmd-click will select/deselect only the associated node with no effect on its children.
The children of a node can be expanded or collapsed by by clicking the or buttons next to the node ID.

Data fields in the Reference Gene Hierarchy

The data fields listed below have been indexed by the Pathogen Detection project and are therefore directly searchable, using the advanced search techniques that are described in the Isolates Browser help, because both use the SOLR query language. Note that the data field names and values are case sensitive, as described in the Isolates Browser help.

Each data field reflects an available column in the Reference Gene Hierarchy web interface. The output section of this document describes the use of filters as an alternate way of searching through the data.

Please note: in the list of available data fields below:

The term shown in the regular font is the display name (column header) shown by the Reference Gene Hierarchy. The term shown in (italics) is the name of the corresponding data field if you want to search that field directly.
For example, one data field is listed as: Gene symbol (gene_symbol) (with an underscore bar instead of a space). This is the case sensitive string you should use if you want to search the data field directly using the query box.
Brief Italicized search examples are also provided for some of the data fields showing how to query the field directly. The values represent text strings exactly as they appear in data fields, including upper case and lower case letters, including special characters such as hyphens, etc. The data field names are case sensitive.

In the table below columns shown by default are highlighted in blue

Node ID (node_id)
Symbol (symbol)
Synonyms (synonyms)
Proteins (num_prots)
HMM accession (hmm_acc)
Scope (scope)

Class (class)
Subclass (subclass)
Name (name)
Type (type)
Subtype (subtype)

Node ID (node_id)

The identifier for this node. These are unique to each node. Often times they are the same as the gene symbol, but the Reference Gene Hierarchy contains additional structure that is not reflected in gene nomenclature, so these are often not gene symbols. :

AMR proteins often share relationships by function, by sequence homology, or by both, and our database encodes these relationships hierarchically. Leaf nodes in the hierarchy typically simply repeat a gene symbol or allele assignment, but internal nodes also receive identifiers that must be unique. Those identifiers are chosen to be as brief, informative, human-readable, unambiguous, and consistent in structure and style as we could make them.

Typically, an internal node has the suffix "_fam" if it represents a family of homologous proteins, and sequences belonging to that node or its children get named consistently, and a single annotation rule such as an HMM can cleanly separate the family of proteins represented by the node from all all other proteins.

The suffix "_gen" is used in a node identifier to mean "the general family of." It is chosen it two situations. First: when more specific names for various branches of the family differ substantially from each other. For example, 'copR_gen' (CopR family heavy metal response regulator) has children that include 'copR', 'czcR', 'silR', and so on. Second: when the node has no linked annotation rule (typically an HMM) because additional homologs exist that are not AMR proteins. For example, several related families of mercury resistance transcriptional regulators are organized under the node 'merR_gen', a node that has no HMM attached to it because no HMM could be built that would find MerR proteins comprehensively without also finding regulators for other types of biological processes.

Examples:

To search this field directly, enter a query such as: node_id:searchterm
Search for: node_id:AME

Symbol (symbol)

A gene symbol for this node. If AMRFinderPlus were to identify a protein at this level (by HMM only or because the blast parameters were outside the cutoff for a more specific node) this is the gene symbol AMRFinderPlus would give. Note that this is often the same as node_id, but because the AMRFinderPlus Hierarchy has more structure than the accepted nomenclature node_id may not reflect an accepted nomenclature.

Examples:

To search this field directly, enter a query such as: symbol:searchterm
To search for all nodes that get the gene symbol 'bla': symbol:bla

Synonyms (synonyms)

Other gene symbols that have been used for this gene. These may appear in the literature, but are not the standard symbols used in AMRFinderPlus or the Reference Gene Catalog.

Proteins (num_prots)

The number of proteins in the Pathogen Detection Reference Gene Catalog that are associated with this node or any of its children. Clicking this number will open the Reference Gene Catalog entries for those proteins in a new window.

HMM accession (hmm_acc)

The accession of the Hidden Markov Model (HMM) that defines this node. Not all nodes will have HMMs associated with them. Clicking the HMM accession will open a new window showing that HMM in the Pathogen Detection Reference HMM Catalog.

Scope (scope)

The subset of the AMRFinderPlus database that this node belongs to. Briefly 'core' nodes are highly curated AMR-specific genes and proteins from the Bacterial Antimicrobial Resistance Reference Gene Database (BioProject PRJNA313047). While 'plus' nodes include genes related to biocide and stress resistance, general efflux, virulence, antigenicity and/or non-expressed or universal genes unlikely to be informative for phenotype.

Examples:

To search this field directly, enter a query such as: scope:searchterm
Search for: scope:core
to show the nodes in the "core" subset of the Pathogen Detection Reference Gene Catalog.

Class (class)

Resistance target for genes of type AMR or STRESS, or typing information for some virulence genes.

This data field also appears in the Pathogen Detection Reference Gene Catalog; a description of Class and examples of queries for that field appear in the Reference Gene Catalog data fields help section. A more detailed description of the class and subclass fields is available on the AMRFinderPlus wiki.

(In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the a phenotype associated with the genetic element.)

Subclass (subclass)

Where it is known, "subclass" provides a more specific definition of the particular antibiotics or classes of stressors that are affected by the genes identified by this HMM (e.g., that are resisted by the gene). While most subclass designations are self-explanatory, a few others have particular meanings. Specifically, "CEPHALOSPORIN" is equivalent to the Lahey 2be definition; "CARBAPENEM" means the protein has carbapenemase activity, but it might or might not confer resistance to other beta-lactams; "QUATERNARY AMMONIUM" are quaternary ammonium compounds. In addition, stx subtypes (e.g., STX2E) and intimin subtypes (e.g., ALPHA) are defined for Shiga toxin proteins (class of STX1 or STX2) and intimins (class of INTIMIN) respectively. Where the phenotypic information is incomplete, contradictory, or unclear, the "Class" value is used for the "Subclass" value.

This data field also appears in the Pathogen Detection Reference Gene Catalog; a description of Subclass and examples of queries for that field appear in the Reference Gene Catalog data fields help section.

(In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the a phenotype associated with the genetic element.)

Examples:

To search this field directly, enter a query such as: subclass:searchterm
Search for: subclass:CARBAPENEM
to show HMMs that identify genes that contribute to carbapenem resistance.
As an alternative method for retrieving those genes, you can open the "Filters" function of the Reference HMM Catalog and check the box for the desired class. You can search through available classes by using the Search field at the top of the filter box. Note that searches are case-sensitive, so to identify QUINOLONE resistance HMMs you could type QUINOLONE in the Search field and the Filters function will refresh itself to show the subclass values that contain that substring (currently QUINOLONE and PHENICOL/QUINOLONE). (As noted below, filters are Filters are generated on the fly and reflect the attributes of the data that you are currently viewing.)

Name (name)

The name a gene at this node would be given by AMRFinderPlus.

Type (type)

Classification for the type of gene found, such as AMR, STRESS, or VIRULENCE. Note that the value here is actually assigned by the placement of this node in the hierarchy (shown in the Node ID column), and is included here for convenience of searching and filtering. A more detailed description of the type and subtype fields is available on the AMRFinderPlus wiki.

(In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the a phenotype associated with the genetic element.)

Data field names and values are case sensitive, and the values for this data field are written in upper case, as shown in the example below.

Examples:

To search this field directly, enter a query such as: type:searchterm
Search for: type:STRESS
to show genes that confer stress resistance.
As an alternative method for retrieving those genes, you can open the "Filters" function of the Reference HMM Catalog and check the box for the desired Type. By doing so, the Filters function will refresh itself to show the subtype values that are available for the type you have selected, enabling you to further narrow your search results, if desired. For example, the subtype values under STRESS currently include BIOCIDE, METAL, and ACID. (As noted below, filters are Filters are generated on the fly and reflect the attributes of the data that you are currently viewing.)

Subtype (subtype)

More specific type for element if available. Otherwise contents will be identical to Type. As with the 'Type' field this is actually assigned by the placement of this node in the hierarchy (shown in the Node ID column), and is included here for convenience of searching and filtering. A more detailed description of the type and subtype fields is available on the AMRFinderPlus wiki.

(In general, type and subtype refer to the category of gene or genetic element, while class and subclass refer to the a phenotype associated with the genetic element.)

Examples:

To search this field directly, enter a query such as: subtype:searchterm
Search for: subtype:METAL
to show genes that contribute to metal resistance.
As an alternative method for retrieving those genes, you can open the "Filters" function of the Reference HMM Catalog and check the box for the desired Type. By doing so, the Filters function will refresh itself to show the subtype values that are available for the type you have selected, enabling you to further narrow your search results, if desired. For example, the subtype values under STRESS currently include BIOCIDE, METAL, and ACID. (As noted below, filters are Filters are generated on the fly and reflect the attributes of the data that you are currently viewing.)

Output

Tabular list of nodes

Upon opening the Reference Gene Hierarchy, a table displays data for all nodes in the AMRFinderPlus reference gene hierarchy.
Every row in the Reference Gene Hierarchy is a node and every column is metadata about that node.
The relationships between nodes are indicated in the first "Node ID" column.
Nodes can be expanded to show their children by clicking on the , and expanded fully by holding down Ctrl or Command and clicking on the .
Rows can be filtered by clicking on the filters bar, or searched using basic and advanced search techniques.
When a search or filter is enabled the hierarchy table is reduced to only show nodes that contain the search terms or their ancestors. Rows that match the search term have checks to the left of the Node ID.
Because of this sister nodes to those shown may be hidden. To view any sister nodes to those shown click Show all... next to the Node ID. Non-selected nodes can be hidden again by clicking Hide....
Clicking on the checkbox for a node will select that node and all shown children. Hidden nodes will not be selected.
Ctrl-click or Cmd-click will select/deselect only the associated node with no effect on its children.
The children of a node can be expanded or collapsed by by clicking the or buttons next to the node ID.

Filters to refine results

Filters are activated by clicking on the bar labeled "Filters" just under the search box. This allows you to facet or subset the data in a variety of ways, and therefore can be used to refine your results, whether you have done a basic or advanced search.

The filter menu allows all data fields in the column chooser to be filtered. By default, each filter displays the top 100 terms (based on the number of rows retrieved by a term).

A Boolean "OR" is applied if multiple items are checked in the same filter field. This way you can choose multiple values in the same filter.
A Boolean "AND" is applied if you select items in multiple different filter fields. (e.g., Scope, Class, etc).
If you prefer to apply a boolean "AND" to multiple terms in the same filter field, you can enter a Solr query.
Filters are generated "on the fly" for a given dataset
- The choices listed in the "Filters" panels reflect the attributes of the isolates that you are currently viewing in the browser.
- By default only the top 100 terms (based on the number of rows retrieved by a term are shown).
- Numbers of rows for each filter term are displayed to the right of that term.
- The total number of values in the filter is displayed at the bottom of the filter tab.
The list of values within each filter tab can be searched using the controls the top of the tab. This can reveal values not in the top 100.
- Text fields can be searched by typing exact substrings of the values in the field.
- Numeric fields have ranges that can be selected using the check buttons and ranges listed.
- Date fields can be searched using date ranges with some commonly used presets listed as buttons.
The search box can be reset with the reset button beside the search box. The entire filter can be removed with the 'X' at the top right corner.
Filters can be collapsed if more than one is shown with the double left hand arrow at the bottom left, and opened again after collapse with the double right hand arrow on collapsed tabs. Each tab is labeled with the filter name in the left margin.
Clicking the filter bar again will collapse the filter and show a SOLR query string that can be used in the search box.
Note that the filters match exact strings, so capitalization and punctuation will be matched against. Use multiple synonyms where needed in fields that don't have a controlled vocabulary.

Data Retention Policy

The Reference Gene Hierarchy web interface only shows the most recent release of the AMRFinderPlus database, but all previous releases are retained on our FTP site. See Reference data retention for details.