SPARCLE Help Document

SPARCLE Help

	This help document describes describes how to use SPARCLE, the Subfamily Protein Architecture Labeling Engine, a resource for protein classification. The Conserved Domains resources page describes additional, related resources and provides "How To" guides that illustrate how those resources can be used.

DETAILED TABLE OF CONTENTS:

What is SPARCLE?

Overview

What is a conserved domain architecture?
Types of architectures
Architectures with single conserved domain footprint
Each architecture receives a unique and stable architecture ID

How can SPARCLE be used to learn more about proteins?

Classify a protein based on its conserved domain architecture
Retrieve conserved domain architectures whose descriptions contain the keywords you specify
Retrieve proteins that have the same conserved domain architecture, regardless of the extent of their overall sequence similarity
Infer the biological function of a hypothetical protein

Compare CDD, CDART, and SPARCLE

Input Options

Enter a query sequence into CD-Search

Illustrated example
Note about ongoing research
CD-Search help document provides additional details

Search the SPARCLE database by keyword

Illustrated example
Scope of keyword search
Search tips to narrow or broaden search

How to limit your query to a specific search field
How to use quotes to force a phrase search
How to use an asterisk (*) for truncation
Compare some sample search strategies

Search fields

Output

Output from a sequence search
Output from a keyword search

Sample SPARCLE Record

Classification of proteins by domain architecture
Description of architecture

Name of architecture
Label (description of function)
Architecture ID
Version
Date Published
Review Level

Sequences with this architecture

Folder tabs

All
Protein with PubMed Reference
3D Structure
Gene
RefSeq
Swiss-Prot

Filters

Tags
Source
Organism
Description
Gene Symbol

Note: Empty Set

Curated Names and Labels

Taxonomic Scope
Name
Label
Supporting evidence

Protein sequences
Conserved domains
Publications
Other

Conserved domains in this architecture
Functional sites in this architecture

Data Processing

Data processing overview
Three tiers of data:

Curated architectures
Autonamed architectures
NamedByDomain architectures

Two types of architectures:

Superfamily architectures
Subfamily architectures

Architectures with single conserved domain footprint
Each architecture receives a unique and stable architecture ID
Ongoing Research
Links from architectures to other data types

Log of changes to SPARCLE

References

Citing SPARCLE
Additional References

BRIEF TABLE OF CONTENTS


	What is SPARCLE? Overview What is a conserved domain architecture? Types of architectures How can SPARCLE be used? Compare CDD, CDART, and SPARCLE Input options Enter a query sequence into CD-Search Illustrated example Note about ongoing research CD-Search help Search SPARCLE database by keyword Illustrated example Scope of search Search tips Search fields Output Sequence search Keyword search Sample SPARCLE Record Classification of proteins by architecture Description of architecture Sequences with this architecture Curated names and labels Taxonomic Scope Name Label Supporting evidence Conserved domains in this architecture Functional sites in this architecture Data Processing Data processing overview Three tiers of data: Curated architectures Autonamed architectures NamedByDomain architectures Two types of architectures: Superfamily architectures Subfamily architectures Ongoing research Links from architectures to other data Log of changes to SPARCLE References

SAMPLE SPARCLE RECORD

What is SPARCLE?

overview | what is a conserved domain architecture? | two types of architectures: superfamily architectures, subfamily architectures | single domain architectures | each architecture receives a unique and stable architecture ID |
how can SPARCLE be used to learn more about proteins? | compare CDD, CDART, and SPARCLE

Overview

SPARCLE, the Subfamily Protein Architecture Labeling Engine, is a resource for the functional characterization and labeling of protein sequences that have been grouped by their characteristic conserved domain architecture.

A conserved domain architecture is defined as the sequential order of conserved domains in a protein sequence.

To given an example of proteins that have similar function but different domain architectures:

DNA gyrase B (NP_387887), an antibiotic target, has a conserved domain architecture that includes a histidine kinase-like ATPase domain, a transducer domain, a topoisomerase-primase domain, followed by a type II topoisomerase carboxy domain.

In contrast, enzymes of similar function, such as topoisomerase IV (Q45066), have a different conserved domain architecture.

Note: In each of the examples above, the default graphical summary that appears when you click on the "conserved domain architecture" link depicts the full length protein model. Click on the display option to "View: Full Results" link in the upper right hand corner of the display to see the individual conserved domains that compose the full length protein model. (The CD-Search help document provides additional information about the features and options on the search results display.)

Regardless of which display option you use, the "Protein Classification" section that appears above the graphical summary includes a "domain architecture ID xxxxxx" link, which opens the corresponding SPARCLE record.

The SPARCLE record (illustrated example), also referred to as the conserved domain architecture's "summary page," shows the architecture's name and functional label (description) of the domain architecture, the supporting evidence that was used to name the architecture, as well as links to other protein sequences with the same architecture and to the individual conserved domains that are in the architecture.

There are two types of conserved domain architectures:

Superfamily architectures

Superfamily architectures consist solely of conserved domain superfamilies. This infers a general functional category for the proteins which have that architecture.

Additional details about superfamily architectures are provided in the data processing section of this document.

Note: Superfamily architectures are currently found only the CDART resource. A brief description of CDART is provided in the "Compare CDD, CDART, and SPARCLE" section of this document.

Subfamily architectures

Subfamily architectures either contain a mix of conserved domain superfamilies and subfamilies, or consist solely of conserved domain subfamilies.

A subfamily is represented by a conserved domain model that gets a specific hit to the protein query sequence. The specific hits represent a high confidence that the query sequence belongs to the same protein family as the sequences used to create each conserved domain model, and therefore a high confidence level for the inferred function of the protein query sequence.

To see if a conserved domain is a superfamily or subfamily, mouse over a conserved domain's footprint in the architecture's graphical display. A superfamily will have a "cl" prefix in the accession number; the "cl" stands for superfamily cluster. A subfamily will have an accession number prefix other than "cl".

Additional details about subfamily architectures are provided in the data processing section of this document.

Note: Subfamily architectures are currently found only the SPARCLE resource. A brief description of SPARCLE is provided in the Compare CDD, CDART, and SPARCLE section of this document.

Architectures with single conserved domain footprint:

It is also possible for a domain architecture to consist of a single conserved domain footprint. That footprint can represent either a superfamily architecture or a subfamily architecture.

Each architecture receives a unique and stable architecture ID:

Each conserved domain architecture receives a unique and stable architecture ID, which reflects the set of conserved domain models that are top-scoring hits (as determined by the CD-Search service) on the proteins that possess the architecture, the sequential order of those domains, and the type of hit each domain has to the proteins. Architectures that consist of a single conserved domain footprint also receive an architecture ID.

Additional information about conserved domains:

The Conserved Domain Database (CDD) help document provides additional information about domain family hierarchies, including superfamilies and subfamilies. It also provides additional information about the companion CD-Search tool, including the hit types displayed in CD-Search results, such as specific hits, non-specific hits, the superfamily(ies) to which those hits belong, and multi-domain models. Each superfamily on a CD-Search results page is represented by a cartoon with a distinct color/shape combination, in order to distinguish domains from each other.

How can SPARCLE be used to learn more about proteins?

Classify a protein based on its conserved domain architecture

If you enter a query sequence into CD-Search, the results page will include a "Protein Classification" section, if the query protein has a hit to a curated domain architecture in the SPARCLE database. (See an illustrated example that uses NP_387887, Bacillus subtilis DNA gyrase subunit B, as the protein query sequence.)

Retrieve conserved domain architectures whose descriptions contain the keywords you specify

You can search the SPARCLE database by keyword to retrieve conserved domain architectures that contain the term(s) of interest in their descriptions. (See an illustrated example that looks for the words "chloride" and "channel" and limits the results to curated domain architecture records by adding curated[ReviewLevel] to the search.)

Retrieve proteins that have the same conserved domain architecture, regardless of the extent of their overall sequence similarity

Use either of the search methods described in the Input Options section of this document to retrieve conserved domain architectures. Then click on an architecture of interest to open its summary page. (As an example, open the summary page for domain architecture ID 10002697, cytochrome c biogenesis protein DipZ.) Scroll to the section of the record labeled "Sequences with the domain architecture." There, you can view all sequences with that architecture or a pre-defined subset. Subsets include protein sequences that have links to corresponding literature references in PubMed, 3D structures, genes, and reference sequence (RefSeq) records. (A separate section of this document provides additional details about the "sequences with this architecture" section of a SPARCLE record.)

Infer the biological function of a hypothetical protein

The examples below, from bacterial genome sequencing projects, have been named "hypothetical protein" by the data submitters. SPARCLE can be used infer a rather precise biological function for these proteins with good confidence:

Hypothetical protein CNBA2510 [Cryptococcus neoformans var. neoformans B-3501A], Accession.Version: EAL23604.1
View the sequence data in FASTA format
View the CD-Search results, which include protein classification

Hypothetical protein ACD_46C00685G0010 [uncultured bacterium], Accession.Version: EKD69980.1
View the sequence data in FASTA format
View the CD-Search results, which include protein classification

Hypothetical protein CKR_0411 [Clostridium kluyveri NBRC 12016], Accession.Version: BAH05462.1
View the sequence data in FASTA format
View the CD-Search results, which include protein classification

Hypothetical protein CARG_04030 [Corynebacterium argentoratense DSM 44202], Accession.Version: AGU14953.1
View the sequence data in FASTA format
View the CD-Search results, which include protein classification

Compare CDD, CDART, and SPARCLE

What is the association among the CDD, CDART, and SPARCLE resources?
How are they related to each other, and how do they differ?
For what purpose would you use one versus another?
These questions are answered below.

Conserved Domain Database (CDD)

The Conserved Domain Database (CDD) is the foundation upon which CDART and SPARCLE are built.

CDD is a repository of conserved domain models from a variety of source databases, including NCBI-curated conserved domain models, which use 3D-structure information to explicitly to define domain boundaries, aligned blocks, and amend alignment details. Sets of conserved domain models that generate overlapping annotation on the same protein sequences are grouped into superfamilies.

The individual conserved domain models and superfamilies are used by CD-Search (RPS-BLAST) to identify conserved domains in protein sequences, and thereby infer the function of the proteins. Each conserved domain model can fall into one of four types of RPS-BLAST hits, and CD-Search offers a three levels of detail in the search results (concise, standard, full results).

In addition to being accessible through CD-Search, the conserved domain models in CDD can also be searched by text term.

Additional details are provided in the CDD help document, CD-Search help document, and CDD publications.

Examples of how CDD can be used and the types of information it displays:

Retrieve a list of conserved domain models that contain a specific keyword or phrase.

Example: retrieve domains that have the phrase "chloride channel" in their description.

If desired, restrict the search to NCBI-curated domain models by adding cdd[database] to the query. For example, a search for:
"chloride channel" AND cdd[database]
will retrieve the NCBI-curated domain models that contain the phrase "chloride channel."

Read more: the CDD help document provides search tips, including details about allowable search terms, examples of basic and advanced search methods, a list of available search fields, tips about use of quotes and truncation, and more.

View the details of a conserved domain model, such as its description, multiple sequence alignment, conserved features/sites, and corresonding 3D structures.

Example: view the conserved domain summary page for the voltage-gated chloride channel, cd00400.

Read more: the CDD help document describes the types of information shown on a summary page for a conserved domain model.

Infer the putative function of a query protein by indentifying its conserved domains.

Example: identify the conserved domains in the Arabisopdis thaliana chloride channel E protein sequence (NP_001190924).

Read more: the CD-Search help document describes how to use the CD-Search tool, including allowable types of input and display controls for the output.

Conserved Domain Architecture Retrieval Tool (CDART)

The Conserved Domain Architecture Retrieval Tool (CDART) is built upon CDD.

CDART is a database of conserved domain architectures and a tool for finding protein similarities across significant evolutionary distances using sensitive domain profiles rather than direct sequence similarity, focusing on the overall conserved domain architecture of the protein rather than on in individual conserved domains.

A domain architecture is defined as the sequential order of conserved domains in a protein.
CDART uses purely automated techniques to identify the conserved domain architecture of each sequence in the Entrez Protein database.

CDART then uses automated methods to identify domain architectures that are similar to each other.
A similar domain architecture must include at least one of the conserved domain superfamilies in the query sequence. The similarity score of each domain architecture indicates the number of domain superfamilies in the architecture that match domain superfamilies in the query protein, and is used to rank the search results.

Through these methods, CDART makes it possible to retrieve all of the protein sequences with a given conserved domain architecture, and to retrieve proteins with similar domain architectures.

Additional details are provided on the "About CDART" page, in the CDART Help Document, and in the CDART publication.

Examples of how CDART can be used and the types of information it displays:

View the conserved domain architecture of a query protein, followed by a list of similar conserved domain architectures:

Example: View the conserved domain architecture for Arabisopdis thaliana chloride channel E (NP_001190924) protein sequence.

Read more: the CDART help document describes how to use the tool, including a quick start guide, input options. and output.

View a list of similar conserved domain architectures and retrieve proteins that have an architecture of interest:

Example: Starting with the CDART display for the Arabisopdis thaliana chloride channel E (NP_001190924), scroll through the list of similar conserved domain architectures that appears beneath the query protein's architecture. Click on any architcture of interest to retrieve all sequences from the non-redundant ("nr") protein database that possess the architecture.

Read more: the CDART help document describes how to use the tool, and provides details about the list of similar conserved domain architectures as well as options to filter your results.

Subfamily Protein Architecture Labeling Engine (SPARCLE)

The Subfamily Protein Architecture Labeling Engine (SPARCLE) is built upon CDART.

SPARCLE contains the subset of subset of domain architectures that include at least one conserved domain model that is a specific hit to at least one protein sequence in the non-redundant ("nr") protein database.

SPARCLE then assigns a name and label (a description of the architecture's biological function) to each conserved domain architecture. As noted in the data processing section of this document, names are assigned to the architectures either by a manual curation process, or by automated processes that use algorithms to autoname an architecture, or to name an architecture based on the domains it contains. Each SPARCLE record includes a list of the supporting evidence that was used in assigning a name to the architecture.

In this way, SPARCLE is used to classify proteins, based on the functional characterization and labeling of protein sequences that have been grouped by their characteristic conserved domain architecture.

Additional details are provided on the "About SPARCLE" page, in this SPARCLE Help Document, and in the SPARCLE publication.

Examples of how SPARCLE can be used and the types of information it displays:

Find the protein classification of a query sequence:

Example: Enter the Bacillus subtilis DNA gyrase protein as a query sequence into the CD-Search tool, either as an accession number (NP_387887) or as FASTA-formatted sequence data for NP_387887. The "Protein Classification" section of the CD-Search results for NP_387887 will show the protein's domain architecture, including a link to the corresponding SPARCLE record.

Read more: A separate section of this document provides additional details and an illustrated example showing the classfication of a protein query sequence.

Retrieve the domain architectures that contain a keyword or phrase of interest in their description:

Example: Retrieve domain architectures that contain the terms "chloride" and "channel", and limits the results to curated domain architecture records.

Read more: A separate section of this document provides additional details and an illustrated example showing the retrieval of domain architectures using a keyword search.

Retrieve a non-redundant list of protein sequences that have a domain architecture of interest:

Example: Open the SPARCLE record for domain architecture ID 10002697, for cytochrome c biogenesis protein DipZ. Scroll down to the blue header, "Sequences with this architecture," which by default shows all non-redundant sequences with that architecture. If desired, use the folder tabs in that section to view a pre-defined subset of proteins, such as those from the RefSeq or SwissProt databases, or those which have resolved 3D structures.

Read more: A separate section of this document describes the information and options available on a domain architecture summary page.

Input Options

To access SPARCLE, you can either:

Enter a query sequence into CD-Search (illustrated example & note about ongoing research)
-OR-
Search the SPARCLE database by keyword (illustrated example)

With either approach, the corresponding SPARCLE record(s) will display the name and functional label of the protein's conserved domain architecture, supporting evidence, and links to other proteins with the same architecture. Details about each approach are below.

Enter a query sequence into CD-Search

The most common way to access SPARCLE is to enter a query sequence into CD-Search, either as FASTA-formatted sequence data, or as an accession number of a sequence that is in the protein or nucleotide databases. The search results will include a "Protein Classification" section if the query protein has a hit to a curated domain architecture in the SPARCLE database. In the protein classification section, click on the domain architecture ID in order to open the corresponding SPARCLE record.

The illustration below provides an example, using NP_387887, DNA gyrase subunit B, as the protein query sequence.

You can click on the individual panels of the illustration to open the corresponding live web page:

the 1st panel opens a blank CD-Search page, where you can either paste the FASTA-formatted sequence for NP_387887 or enter your own query sequence

the 2nd panel* opens a live view of the search results for for NP_387887

the 3rd panel* opens the SPARCLE record for the conserved domain architecture

* Please note that the 2nd and 3rd panels of the illustration reflect the search results as of January 2017. The corresponding live web pages will show a slightly different result, because the annotation of domain architectures on proteins continues to evolve as new data and publications become available. (See the note about ongoing research beneath the illustration.)

Step 1 in using SPARCLE: Enter a query protein sequence into the CD-Search tool. Click on this graphic to open the CD-Search tool and input your own query protein sequence.

Step 2 in using SPARCLE: The CD-Search results page will display a Protein Classification section above the graphic summary of conserved domains, if a SPARCLE record exists for the domain architecture in the query protein sequence. Click on this graphic to open the CD-Search results for NP_387887, DNA gyrase subunit B from Bacillus subtilis.

Step 3 in using SPARCLE: The Protein Classification section of the CD-Search results links to the corresponding SPARCLE record, illustrated here. The SPARCLE record shows the name and functional label of the architecture, supporting evidence, and links to other proteins with the same architecture. Click on this graphic to open the SPARCLE record for the domain architecture (architecture ID 10647733) that was found in the protein query sequence, NP_387887, DNA gyrase subunit B from Bacillus subtilis.

Ongoing research: The Conserved Domain Database (CDD), as well as the conserved domain architecture annotated on proteins by SPARCLE, continue to evolve as new data become available and as research progresses. Therefore, the live web page views might differ from the illustration above.

For example, in January 2017, the protein sequence NP_387887 was initially annotated with architecture ID 10647733 (as shown in the illustration above). That architecture is named "DNA gyrase subunit B" and includes four distinct conserved domains.

In March 2017, when a new build of CDD/SPARCLE was released, the conserved domain architecture annotation for NP_387887 was revised to architecture ID 11481348 (which is a multi-domain that encompasses the four original conserved domains, and which can be seen in the current CD-Search results for NP_387887). That architecture has a more specific and precise name, "type IIA DNA topoisomerase subunit B," and reflects the full length protein model.

To see the four distinct conserved domains that compose the full length protein model, simply change the CD-Search display option on the live CD-Search results for NP_387887 from "Concise Results" to "Full Results" (using the "View" menu near the upper right hand corner of the CD-Search results page). The Full Results display will show the four conserved domains that compose the full length protein model.

As the available data and understanding of conserved domain architectures continue to evolve, the domain architectures that are annotated on proteins may evolve as well, as shown in this example. Comments about the data are welcome and can be sent to the NCBI Support Center/Help Desk, which is accessible as a link in the footer of NCBI web pages.

Additional details about using the CD-Search tool are provided in the CD-Search Help Document.

Search the SPARCLE database by keyword

The SPARCLE database can be searched by keyword. That will retrieve domain architectures that contain the term(s) of interest in their descriptions.

The illustration below provides an example. It searches the SPARCLE database for conserved domain architecture records that contain the terms "chloride" and "channel", and limits the results to curated domain architecture records by adding curated[ReviewLevel] to the search.

Click on the individual panels of the illustration below to open the corresponding live web page:

the 1st panel opens the SPARCLE database home page, where you can either enter the example query:
chloride channel AND curated[ReviewLevel]
or enter your own search terms.

the 2nd panel opens a live view of the search results
(Please note this panel shows the search results as of March 2, 2017. The corresponding live web page will retrieve a larger number of records, as the SPARCLE database continues to grow.)

the 3rd panel opens a conserved domain architecture record for the chloride channel protein.

Beneath the illustration are additional details about:

the scope of a search, describing which fields of a database record are searched

search tips for narrowing or broadening your search:

How to limit your query to a specific search field

How to use quotes to force a phrase search

How to use an asterisk (*) for truncation

Compare some sample search strategies

a tabular list of search fields, including a description and sample search for each field

Step 1 in searching the SPARCLE database by keyword: Enter the desired search terms in the query box, adding curated[ReviewLevel], if desired, to limit results to curated domain architectures. Click on this graphic to open the SPARCLE home and input your own search terms.

Step 2 in searching the SPARCLE database by keyword: View the search results and click on the architecture ID of any domain architecture of interest to open its summary page. Click on this graphic to open the results of a SPARCLE search for chloride channel AND curated[ReviewLevel].

Step 3 in searching the SPARCLE database by keyword: view the SPARCLE record for the domain architecture of interest. Click on this graphic to open the SPARCLE record architecture ID 10087058, chloride channel protein. From there, you can view evidence used to curate the domain architecture, retrieve all protein sequences which contain that architecture, and more.

Scope of a keyword search:

When you search the SPARCLE database by keyword (e.g., gyrase), All Fields are searched by default. This includes looking for your keyword(s) in the name & functional label (description) of the conserved domain architecture. This also includes looking for your keyword(s) in the entities that were used as evidence to give a name to the architecture, such as gene names (names of genes whose protein products have that architecture), protein names (definition lines of proteins used as evidence to support the domain architecture, such as SwissProt records, where protein sequences are named based on literature), conserved domain names (including the short and long names of conserved domains that are present in the architecture), Enzyme Commission (EC) numbers and corresponding EC text descriptions.

Search tips for keyword searches:

all fields are searched by default | how to limit your query to a specific search field | use quotes to force a phrase search | use an asterisk (*) for truncation | compare some sample search strategies

By default, All Fields are searched in the SPARCLE database.

Limit a query to a specific search field:
If you prefer to narrow your search to a specific field, you can:

Use the "Limits" page or the "Advanced" search page to view a list of available search fields, and select the field of interest from a pull-down menu.

Alternatively, you can type the field name, surrounced by square brackets [], directly after your search term, with or without a space between your term and the first bracket. For example:

a search for: curated[ReviewLevel] looks for the term "curated" in the "ReviewLevel" search field

a search for: bacteria[Organism] looks for the term "bacteria" only in the "Organism" search field. This will retrieve conserved domain architectures whose names and labels are applicable within bacteria but not within other taxonomic nodes.

The available search fields are listed in a table below, including a description and search example for each field.

A footnote under the table shows how search fields can be specified using either their full spelling or an abbreviation, and in upper case, lower case, or mixed case.

The "Show Index" link on the SPARCLE Advanced Search page allows you to browse the index of each search field, where you can see the available terms, the number of records containing each term or phrase, as well as the syntax for entering values in search fields such as CreateDate.

Use quotes to search for a phrase:
Another way to narrow your search is to enclose multiple terms in quotes (e.g., search for "chloride channel").

Using quotes will require the system to search for the terms as a phrase. It will therefore only retrieve records where the two words occur together, adjacent to each other.

If quotes are not used, the Entrez system may still recognize and handle the terms as a phrase, if they are present in a phrase dictionary used by the search engine. If the terms are not present in the phrase dictionary and are not surrounded by quotes, Entrez will insert a Boolean AND between the terms; in that case, they may or may not appear adjacent to each other in the retrieved records.

The "Details" section in the right hand margin of a search results page will show you exactly how the Entrez system parsed your query. More search tips are provided in the PubMed help document and Entrez help document.

Use an asterisk (*) for truncation
To broaden a search, you can use an asterisk (*) as a wild card to search for a word stem.

For example, a search for chlori* will retrieve records with terms such as chloride, chlorin, chlorinate, chlorinated, chlorinating, chlorination, chlorine, chlorite, and chloritidismutans.

As another example, a search for arachidon* will retrieve records with terms such as arachidonate, arachidonic, arachidonoyl, and arachidonyl.

The Entrez Help document provides additional information about truncating search terms in this way.

Compare some sample search strategies:
As examples of various search strategies, compare the results of the following searches:

chlori*
If an asterisk is used to truncate a search term, the system will retrieve all records that contain the specified word stem. The word stem can appear in any field of the record, unless you specify a desired search field.

chloride channel
If no search field is specified, [All Fields] are searched by default, Also, the keywords are not necessarily searched as a phrase, but can occur separately in different parts of the record.

"chloride channel"
Use quotes to search for the terms as a phrase.

"chloride channel"[Name]
Limit the query to a specific search field, such as the [Name] field shown here, to narrow the search results.

"chloride channel" AND curated[ReviewLevel]
Add a [ReviewLevel] criterion to the query, as shown above, to limit retrieval to a specified subset of architectures (e.g., architectures that have been curated, autonamed, or namedByDomain).

Search Fields:

As noted in the Search Tips above, when you search the SPARCLE database by keyword, All Fields are searched by default. If you prefer to restrict your search to a specific data field, you can use the pull-down menus on either the "Limits or the "Advanced" search page to select the desired field. Alternatively, you can type the desired field directly in your query, surrounding field name with square brackets [].*

The available search fields include:

All Fields
BiosystemsDescription
CDDDescription
CDDShortname
CDDTitle
Comment
CreateDate
Defline
ECNumber
Filter
GeneDescription
GeneSymbol
Label
Name
Organism
PDBTitle
ReviewLevel
Status
UID

Field name Abbreviation* Description Sample Search

All Fields [All]
[All Fields] Searches all of the indexed fields in the SPARCLE database.

If no field specifier is included in a query, the system searches [All] fields by default, as happens with the first sample search shown at the right. Click on that search to open the corresponding results page. The "Search Details" box that appears in the right hand margin of the search results page shows that the query was translated by the system to:
chloride[All Fields] AND channel[All Fields] chloride channel

The basic search above, in which the query terms are entered without quotes, will retrieve the architecture(s) that contain the word "chloride" and the word "channel" in any field of the record. The words do not have to be adjacent to each other in the record (i.e., they do not have to appear as a phrase), and they do not have to appear in the same field.

"chloride channel"[all]

The search above, which surrounds the search terms with quotes, will retrieve the architecture(s) that contain the phrase "chloride channel" in any field of the record. (The quotes surrounding the search terms ensure they are searched as a phrase.)

Note: Compare the results of the above search, which looks for the phrase "chloride channel" in any field of the record, with the more specific results obtained by the sample [Name] field search:
"chloride channel"[Name]
which retrieves records containing the phrase "chloride channel" only in the name of the conserved domain architecture.
(The data processing section of this document describes how architectures are named.)

BiosystemsDescription [BiosystemsDescription] Descriptions of BioSystems that are listed as supporting evidence for conserved domain architectures in the SPARCLE database.

As noted on the About BioSystems page, a biosystem is a group of molecules that interact in a biological system. One type of biosystem is a biological pathway, which can consist of interacting genes, proteins, and small molecules. Another type of biosystem is a disease, which can involve components such as genes, biomarkers, and drugs.

"folate biosynthesis"[BiosystemsDescription]

will retrieve architecture(s) that list, as supporting evidence, biosystems whose descriptions contain the phrase "folate biosynthesis."

CDDDescription [CDDDescription] Description of conserved domain models that are components of, or that are listed as supporting evidence for, conserved domain architectures in the SPARCLE database. "transport proteins"[CDDDescription]

will retrieve architecture(s) that contain conserved domain models whose description includes the phrase "transport proteins."

CDDShortname [CDDShortname] Short names of conserved domain models that are components of, or that are listed as supporting evidence for, conserved domain architectures in the SPARCLE database.

The short name is the label that appears on the conserved domain's cartoon in a CD-Search results display.

Note: This field can only be searched by entering the complete short name, surrounded by quotes. Entering a single term or other fragment from the short name will not retrieve results. (See examples below.)

Because of this, it is better to search the [CDDDescription] field because it offers more comprehensive searches.

--------------------
Examples: To illustrate the use of the [CDDShortname] field:

A search for the following complete string: "voltage gated clc"[CDDShortname] will retrieve architectures that contain a conserved domain model with that short name.

However, a search for the single word: voltage[CDDShortname] will not retrieve any records, because there are no conserved domains that have a short title of the single word "voltage."
--------------------

Tip: The Advanced search page can be used to browse the available terms in any index.
For example, to see a list of short names, use the "Builder" section of the advanced search page, select the CDDShortname search field from the pull-down menu, then click on "Show index list."

Note: If you do not enter any term in the text box beside the selected search field, the system will automatically take you to the top of the index for the selected search field, and you can then scroll through the terms.

If you enter a term in the text box before clicking on "Show index list," the search system will jump to the part of the index that contains your term, then you can scroll up or down.

"voltage gated clc"[CDDShortname]

will retrieve architecture(s) that contain a conserved domain model whose short name is "voltage gated clc".

(The quotes surrounding the search terms ensure they are searched as a phrase.)

CDDTitle [CDDTitle] Title of conserved domain models that are components of, or that are listed as supporting evidence for, conserved domain architectures in the SPARCLE database.

Note: Some older conserved domain models do not have a title. For example, the conserved domain model with accession cd00400 has a short name of "Voltage_gated_ClC" and an extensive description, but it doesn't have a separate title. As a result, those records will not be retrieved by a search of the [CDDTitle] field.

Therefore, is generally better to search for the [CDDDescription] field, rather than the [CDDTitle] field, because the [CDDDescription] field provides a more comprehensive search.

For example, compare the results of the [CDDTitle] and [CDDDescription] searches:

voltage[CDDTitle]
vs.
voltage[CDDDescription]

voltage[CDDTitle]

will retrieve architecture(s) that contain a conserved domain model whose title includes the word "voltage".

"voltage gated chloride channel"[CDDTitle]

will retrieve architecture(s) that contain a conserved domain model whose title includes the phrase "voltage gated chloride channel".

Note: It is generally better to search for the [CDDDescription] field, rather than the [CDDTitle] field, because the [CDDDescription] field provides a more comprehensive search. See the note and examples in the preceding column.

Comment [Comment] The [Comment] field contains free text that was written by curators in the supporting evidence fields of SPARCLE records. It represents something the curators wanted to note about the conserved domain architecture, based on the research they did in curating and naming the architecture.

chloride[Comment]

will retrieve the architectures that contain the word "chloride" in the comments section of a conserved domain architecture's supporting evidence.

CreateDate [CreateDate]
[CDAT]
[PDAT]
[DP] The date on which the current version of a conserved domain architecture record was published in the SPARCLE curation system.

This is referred to as the Create Date [CDAT]. Alternatively, it is sometimes referred to as the Publication Date, or Date of Publication, hence the alternative abbreviations of [PDAT] or [DP].

The architecture subsequently becomes available in the public SPARCLE database, although that might happen a bit later.

Examples:
--------------------
To search for a specific day, month, or year, enter it in any one of the following formats:

YYYY/MM/DD
will retrieve all architectures that were published in the SPARCLE curation system on the specified day

or

YYYY/MM
will retrieve all architectures that were published in the SPARCLE curation system in the specified month

or

YYYY
will retrieve all architectures that were published in the SPARCLE curation system in the specified year

--------------------
To search for a range of dates, enter your in any one of the following formats, using the colon (:) as the range operator:

YYYY/MM/DD[CDAT]:YYYY/MM/DD[CDAT]
will retrieve all architectures that were published in the SPARCLE curation system between the two dates you specified

Single date:

2017/04/20[CDAT]

will retrieve all architectures that were published in the SPARCLE curation system on 20 April 2017.

Date range:

2017/04/20[CreateDate] : 2017/05/18[CreateDate]

will retrieve all architectures that were published in the SPARCLE curation system between 20 April 2017 and 18 May 2017.

In the query above, the colon (:) serves as the range operator.

Defline [Defline] The definition line (description) of any protein sequence that was used as supporting evidence for a conserved domain architecture. chloride[defline]

will retrieve the architectures that list, as supporting evidence, any proteins that have the term "chloride" in their definition line.

ECNumber [ECNumber] The Enzyme Commission (EC) number that is found in the sequence record of any protein that was used as evidence for a conserved domain architecture, or the EC number that is found in a high quality (e.g., curated) sequence record that belongs to the group of proteins annotated with the architecture.

The Enzyme Nomenclature and Classification system is based on the reactions catalyzed by the enzymes. The system is developed by one of the Nomenclature Committees of the International Union of Biochemistry and Molecular Biology (IUBMB). Separate websites enable you to browse enzymes by class, or to search the enzyme nomenclature database by text word or number.

--------------------
Method for assigning EC numbers to conserved domain architecture records in SPARCLE:

Typically, the EC numbers are taken from Swiss-Prot records that belong to the cluster of proteins that have a given architecture.

In addition, the EC number from a Swiss-Prot record might also be applied to other, similar protein clusters that essentially represent the same architecture. Those architectures might have been split into separate SPARCLE records only because they contain slightly different domain models. For example, two or more protein clusters might have top-scoring hits to overalapping/redundant conserved domain models from different source databases, but their architectures are essentially similar, as in the hypothetical example below.

--------------------
As a hypothetical example of how an EC Number from one architecture might be annotated on other architectures:

a) Let's say you have three architectures that are similar to each other:

They each have their own SPARCLE record because their top scoring domain models are slightly different from each other:

------[pfam01]------[pfam05]------

------[pfam01]------[COG12]------

------[pfam01]------[cd0008]------

b) Let's also say that:

domain models pfam05, COG12, and cd0008 are redundant (i.e., they come from different source databases, but they overlap with each other on protein sequences and are therefore redundant)

architecture #2 maps to protein sequence SwissProt P0321

SwissProt P0321 has been annotated with an EC number.

c) As a result:

architectures #1, 2, and 3 above are essentially the same (due to the redundant nature of pfam05, COG12, and cd0008)

all three architectures (all three SPARCLE records) will be indexed with the same EC number that was annotated on SwissProt P0321

3.6.4.13[ECNumber]

will retrieve architectures that have the Enzyme Commission number of 3.6.4.13, RNA helicase.

Filter [Filter] The [Filter] field can be used to limit your search to conserved domain architectures that have links to another Entrez database of interest, as shown in the search examples to the right.

NCBI uses the following methods to create links between conserved domain architectures and records in other databases:

The SPARCLE data processing pipeline calculates two types of direct links:

sparcle_protein: each conserved domain architecture in the SPARCLE database links to all protein sequences that have the architecture.

sparcle_cdd: each conserved domain architecture in the SPARCLE database links to all of the conserved domain models (specific hits and superfamilies) that compose the architecture. For example, if an architecture contains one specific hit and one superfamily, that SPARCLE record will link to two Conserved Domain Database (CDD) records -- one for the specific hit and one for the superfamily.

All other links between SPARCLE and other Entrez databases are indirect, created by a join between the proteins that contain the architecture and the other data types.

For example, links from SPARCLE architectures to Gene records are created by a join between the following:

sparcle_protein AND protein_gene → sparcle_gene

"chloride channel"[All] AND "sparcle_gene"[Filter]

will retrieve conserved domain architectures that have the phrase "chloride channel" in any field of the record, and have links to records in the Gene database.

"chloride channel"[All] AND "sparcle_biosystems"[Filter]

will retrieve conserved domain architectures that have the phrase "chloride channel" in any field of the record, and have links to records in the Biosystems database.

(Note: To view the biosystems that are linked to an architecture, click on an architecture of interest in the SPARCLE search results, then click on the "pathways" link in the right hand margin of the architecture's summary page to open the corresponding Biosystems records.)

GeneDescription [GeneDescription] The description of Gene records that were used as supporting evidence for conserved domain architectures.

The [GeneDescription] index includes text terms from the gene's official full name, official symbol, alternative symbols, and gene summary.

"chloride channel"[GeneDescription]

will retrieve the architecture that lists, as supporting evidence, genes that include the phrase "chloride channel" in their description.

GeneSymbol [GeneSymbol] The gene symbol of Gene records that were used as supporting evidence for conserved domain architectures. nat16[GeneSymbol]

will retrieve the architecture that lists, as supporting evidence, genes whose symbol is "nat16."

Label [Label] The functional label (description) of a conserved domain architecture. "chloride channel"[Label]

will retrieve the architecture(s) that contain the phrase "chloride channel" in the functional Label (description) of the architecture.

Name [Name]
[NM] The name of a conserved domain architecture.

The data processing section of this document describes the three different methods by which conserved domain architectures are named:

Curated architectures

Autonamed architectures

NamedByDomain architectures

These represent three tiers of SPARCLE records, which can be retrieved, if desired, using the [ReviewLevel] search field.

"chloride channel"[Name]

will retrieve the architecture(s) that contain the phrase "chloride channel" in the name of the architecture.

Organism [Organism]
[Orgn] The taxonomic node to which the name and label of the conserved domain architecture apply.

By default, conserved domain architectures are associated with the root of the taxonomic tree (i.e., all organisms). When an architecture is associated with the root, it means the name/label of the architecture is not specific to any node of the full taxonomic tree. This is true of most architectures in the SPARCLE database.

If the [Organism] classification of an architecture is not root, but is instead a more specific taxonomic node, that means the curator is asserting that the name/label chosen for the architecture is applicable within the specified node, but not necessarily within other taxonomic branches.

--------------------

For example, the total number of architectures in the SPARCLE database was 129405 as of July 13, 2017. (Note: the current total number of architectures might be larger or smaller, if more architectures have been added or removed since that date as a result of ongoing research).

Most of those architectures are assigned, by default, to the root of the taxonomic tree. (As an example, retrieve the architectures that have a taxonomic scope of all organisms.)

A small number of architectures are assigned to more specific taxonomic nodes, as follows:

archaea

bacteria

eukaryota

fungi

metazoa

viridiplantae

viruses

The next column provides examples of search strategies that will retrieve conserved domain architectures that have a taxonomic scope of interest.

The SPARCLE record for each architecture contains a section entitled "Curated names and labels, which includes the architecture's taxonomic scope.

bacteria[Organism]

will retrieve the architectures whose names and labels are applicable within bacteria but not within other taxonomic nodes.

viruses[Organism]

will retrieve the architectures whose names and labels are applicable within viruses but not within other taxonomic nodes.

guanylate cyclase AND bacteria[Organism]

will retrieve the architectures that contain the terms "guanylate" and "cyclase" in any field of the SPARCLE architecture record, and whose names and labels are applicable within bacteria but not within other taxonomic nodes.

guanylate cyclase AND eukaryota[Organism]

will retrieve the architectures that contain the terms "guanylate" and "cyclase" in any field of the SPARCLE architecture record, and whose names and labels are applicable within eukaryota but not within other taxonomic nodes.

PDBTitle [PDBTitle]
[PDBTL] The title of any Protein Data Bank (PDB) record (3D macromolecular structure) that was used as supporting evidence for the conserved domain architecture. "DNA modification"[PDBTitle]

will retrieve the SPARCLE record that contains the phrase "DNA modification" in the title of any 3D structure record that was used as supporting evidence for the conserved domain architecture.

ReviewLevel [ReviewLevel]
[REV] The SPARCLE database has three tiers (review levels) of conserved domain architecture records:

Curated architectures

Autonamed architectures

NamedByDomain architectures

The data processing section of this document describes the methods by which architectures in each tier are handled.

The [ReviewLevel] search field can be used to limit retrieval to a specific tier of records, if desired, as shown in the search examples in the next column.

(Note: The [ReviewLevel] field is similar to the [Status] field, described below.)

curated[ReviewLevel]
will retrieve all of the curated architectures from the SPARCLE database.

autonamed[ReviewLevel]
will retrieve all of the autonamed architectures from the SPARCLE database.

namedbydomain[ReviewLevel]
will retrieve all of the architectures from the SPARCLE database that were named by domain.

"chloride channel" AND curated[ReviewLevel]
will retrieve all architectures that contain the phrase "chloride channel" in any field of the record, and will then limit the retrieval to curated architectures.

Status [Status] The [Status] field is similar to the [ReviewLevel] field (described above).

The [Status] field divides the SPARCLE database into two broad subsets:

Reviewed (which represents curated records)

Provisional (which represents all other SPARCLE records, such as those that were autonamed or namedByDomain)

(In contrast, the [ReviewLevel] field divides the SPARCLE database based on the method by which the data have been processed (i.e., curated, autonamed, namedByDomain).

Because of this, a search for curated[ReviewLevel] will retrieve the same subset of architectures as reviewed[Status].
A search for provisional[Status] will retrieve all architectures that have not been curated.

reviewed[Status]

will retrieve all of the reviewed (i.e., curated) architectures from the SPARCLE database.

"chloride channel" AND reviewed[Status]

will retrieve all architectures that contain the phrase "chloride channel" in any field of the record, and will then limit the retrieval to reviewed (i.e., curated) architectures.

UID [UID]
[ArchID] The unique identification number (UID) of a conserved domain architecture. It is also referred to as an architecture ID, or archid.

If you enter an integer as a query, the search system will interpret the query by default as a search of the [UID] field.

Additional information about architecture IDs is provided in the section of this document that describes the contents of a conserved domain architecture's summary page. 10087058[UID]

The search above, which uses the [UID] field specifier, will retrieve the architecture that has the unique identification number (UID) 10087058.

10087058

If you enter the query as just the integer, as shown above, without the [UID] field specifier, the search system will search the [UID] field by default.

Therefore, both of the searches above will retrieve the same architecture.

* In a query, the field name may be typed as the full name or abbreviation, and may be in upper, lower, or mixed case. If more than one abbreviation is shown, any one of them can be used. The field name must be surrounded by square brackets []. A space between the search term and the field specifier is optional. If desired, surround a phrase with quotes to force an adjacency search. For example, all of the sample queries below will work equally:
      "chloride channel"[NAME]
      "chloride channel" [NAME]
      "chloride channel"[name]
      "chloride channel" [name]
      "chloride channel" [NM]
      "chloride channel"[nm]

** The quotes surrounding the query terms in some of the sample searches force the terms to be searched as a phrase. If quotes are not used, the Entrez system may still recognize and handle the terms as a phrase, if they are present in a phrase dictionary used by the search engine. If the terms are not present in the phrase dictionary and are not surrounded by quotes, Entrez will insert a Boolean AND between the terms; in that case, they may or may not appear adjacent to each other in the retrieved records. The "Details" section in the right hand margin of a search results page will show you exactly how the Entrez system parsed your query. More search tips are provided in the PubMed help document and Entrez help document.

It is also possible to search for a word stem by using an asterisk (*) as a wild card; for example, arachidon* will retrieve records with terms such as arachidonate, arachidonic, arachidonoyl. The Entrez Help document provides additional information about truncating search terms in this way.

Output

Output from a sequence search

If you have entered a query sequence into the CD-Search tool, the CD-Search results page will include a "Protein Classification" section if the query sequence maps to a conserved domain architecture in the SPARCLE database. (If a query sequence does not map to any conserved domain architecture in the SPARCLE database, then the CD-Search results will not include a Protein Classification section.)

A sample protein classification section is shown in the illustration at the right, which displays the CD-Search results for the query sequence DNA gyrase B (NP_387887), an antibiotic target. Click on the illustration to see open the live CD-Search results.

Please note that the "Graphical Summary" on the live CD-Search results page might look different from the illustration at the right because conserved domain architecture records in the SPARCLE database continue to evolve with ongoing research.

For example, in January 2017, the protein sequence NP_387887 was initially annotated with architecture ID 10647733 (as shown in the illustration). That architecture is named "DNA gyrase subunit B" and includes four distinct conserved domains.

In March 2017, when a new build of CDD/SPARCLE was released, the conserved domain architecture annotation for NP_387887 was revised to architecture ID 11481348, which is a multi-domain that encompasses the four original conserved domains, and which can be seen in the current CD-Search results for NP_387887. That architecture has a more specific and precise name, "type IIA DNA topoisomerase subunit B," and reflects the full length protein model.

To see the four conserved domains that compose the multi-domain, simply change the CD-Search display option on the live CD-Search results for NP_387887 from "Concise Results" to "Full Results" (using the "View" menu near the upper right hand corner). The Full Results display will show the four conserved domains that compose the multi-domain.

As the available data and understanding of conserved domain architectures continue to evolve, the domain architectures that are annotated on proteins may evolve as well, as shown in this example. Comments about the data are welcome and can be sent to the NCBI Support Center/Help Desk, which is accessible as a link in the footer of NCBI web pages.

Output from a keyword search

If you are searching for keywords in the SPARCLE database, the SPARCLE search results will display a list of the conserved domain architectures that contain the keyword(s) you specified.

Depending on how you entered the search, the search terms can either appear in any field of a conserved domain architecture record, or in a search field you specify, and they can either appear together as a phrase or separate from each other.

The Search Tips section of this document provides details about the scope of a keyword search, as well as tips on how to limit your query to specific search fields, use quotes to force a phrase search, and use an asterisk (*) for truncation. It also includes a comparison of some sample search strategies.

The illustration at the right shows the results of a sample search for the words chloride and channel in any field of an architecture record, and limited to the subset of architecture records that meet the criterion of curated[ReviewLevel].

Click on the illustration to open the corresponding live search results in the SPARCLE database. (Please note that the second panel of the illustration shows the search results as of March 2, 2017; the corresponding live web page will retrieve a larger number of records, as the SPARCLE database continues to grow.)

A comparison of some sample search strategies shows other ways of constructing the query, with links to the search results in each case.

Sample SPARCLE Record

Classification of proteins by domain architecture

A SPARCLE database record is also referred to as a conserved domain architecture's "summary page."

An individual SPARCLE record shows a unique architecture that has been observed in at least one protein sequence.

The summary page displays the name and label of the architecture, along with evidence used to assign that name and label.

Additionally, because SPARCLE is used to classify proteins by their characteristic conserved domain architecture, the summary page includes a list of protein sequences with this architecture.

As noted in the section of the document about ongoing research, the conserved domain models, architectures, and the resulting protein sequence clusters, continue to evolve as new data become available and as research progresses.

The complete contents of a SPARCLE record include the following. Click on any item to read more about it:

Description of architecture

Name
Label (description of function)
Architecture ID
Version
Date Published
Review Level

Sequences with this architecture

Folder tabs

All
Protein with PubMed Reference
3D Structure
Gene
RefSeq
Swiss-Prot

Filters

Tags
Source
Organism
Description
Gene Symbol

Note: Empty Set

Curated Names and Labels

Taxonomic Scope
Name
Label
Supporting evidence

Protein sequences
Conserved domains
Publications
Other

Conserved domains in this architecture
Functional sites in this architecture

Description of the conserved domain architecture

Name of architecture:

The name of a conserved domain architecture is either assigned manually by curation, or computationally by the autoname algorithm or the namedByDomain algorithm.
The architecture name is displayed in two places on a SPARCLE record: near the top of the record (in bold font), and in the "Curated names and labels" section of the record.

For example, the name of the conserved domain architecture shown in the illustrated example of a SPARCLE record is:
"DNA gyrase subunit B." (You can also see this name in the live SPARCLE record for architecture ID 10647733.)

Label (description of function):

The label provides a description of the conserved domain architecture's biological function.
The label is displayed in two places on a SPARCLE record: near the top of the record (beneath the bold font that shows the architecture's name), and in the "Curated names and labels" section of the record.

For example, the label of the conserved domain architecture shown in the illustrated example of a SPARCLE record is:
"DNA gyrase is a type 2 topoisomerase that relaxes supercoils but can also introduce negative supercoils into DNA in an ATP-dependent manner." (You can also see this label in the live SPARCLE record for architecture ID 10647733.)

Architecture ID:

An integer, assigned by NCBI, that uniquely identifies a conserved domain architecture.
The architecture ID is also referred to as a unique identifier (UID) and can be searched directly in the SPARCLE database.

Each architecture ID reflects the set of conserved domain models that are top-scoring hits (as determined by the CD-Search service) on the proteins that possess the architecture, the sequential order of those domains, and the type of hit each domain has to the proteins. (As noted in the Overview section of this document, it is also possible for a domain architecture to consist of a single conserved domain footprint. Such architectures also receive an architecture ID.)

The conserved domain models that compose an architecture are shown in two places on the architecture's summary page: (a) in the graphical display at the top of the page (illustrated example), and (b) in the section labeled "Conserved domains in this architecture."

In the graphical display of a conserved domain architecture, you can mouse over a conserved domain's cartoon in order to see its accession number, or click on the cartoon to see detailed information about that domain model, including a multiple sequence alignment of its member proteins.

The accession number prefix for each conserved domain model in the architecture reflects the type of hit it has on the proteins that possess the architecture. Accession numbers that begin with the "cl" prefix indicate a superfamily hit (the "cl" prefix stands for superfamily cluster). All other type of accession numbers (i.e., accessions that begin with any prefix other than "cl") indicate specific hits.

As an example of the unique composition of each architecture, search the SPARCLE database for:
tumor[Name]
That will retrieve conserved domain architectures which contain the term "tumor" in the architecture name, including a number of architectures named "P53 and SAM_tumor-p63 domain-containing protein."
At first glance, some of the architectures appear similar to each other. Upon closer look, however, you will see that each architecture is comprised of a unique series of conserved domain accession numbers. (To see the accession numbers, open the SPARCLE record for any domain architecture of interest, then either mouse over the cartoon for each domain in the architecture's graphic, or view the tabular list of "Conserved domains in this architecture.") As a result, each architecture receives its own architecture ID.

Version:

Each SPARCLE record is assigned a version of 1 when it is first published (i.e., first released into the public SPARCLE database). If a SPARCLE record is later revised in any way, the version number is incremented when the revised record is published.
Details: The information within a SPARCLE architecture record can change over time, as new data and publications become available about a given conserved domain architecture. Each time a change is made to a SPARCLE record, and the revised record is then published (i.e., released into the public database), it receives a new version number. The majority of changes are generally minor, such as corrections of typing error or the addition of punctuation, such as a dashes, to protein names. Other changes might be more important, such as the addition of new evidence in support of the domain architecture, or the correction of a protein name.

Date Published:

The date on which the current version of a conserved domain architecture record was published in the SPARCLE curation system.
The architecture subsequently becomes available in the public SPARCLE database, although that might happen a bit later.

Search tip: To retrieve architectures by their publication date, use the search field called [CreateDate] on the SPARCLE Advanced Search page.

Review Level:

The SPARCLE database has three tiers (review levels) of conserved domain architecture records:
Additional details about each tier are provided in the data processing section of this document, including a description of the method by which the architectures in each tier are named.

Search tip: When doing keyword search of the SPARCLE database, you can limit your search results to architectures that belong to a given tier by using the search field called [ReviewLevel] on the SPARCLE Advanced Search page. Alternatively, you can simply use the "Filter your results" options in the upper right hand margin of a SPARCLE search results page (illustrated example) to select the desired tier.

Sequences with this architecture

Introductory note
Folder tabs: All | Protein with PubMed Reference | 3D Structure | Gene | RefSeq | Swiss-Prot
Filters: Tags | Source | Organism | Description | Gene Symbol
Note: Empty Set

Introductory note:

The "Sequences with this architecture" table lists the sequences from the NCBI Protein database that have the conserved domain architecture currently being viewed.

A conserved domain architecture is defined as the sequential order of conserved domains in a protein sequence. Additionally, each domain within the architecture can get any one of several hit types against a query protein sequence (e.g., specific hit, non-specific hit, superfamily, multi-domain), as determined by the CD-Search service.

In order to be listed in the "Sequences with this architecture" table, a protein must have the exact order of conserved domains shown in the graphic at the top of a conserved domain architecture's summary page. Additionally, each conserved domain shown in the graphic must be the top-scoring hit for the corresponding region of the protein sequence, and must be of the same hit type as shown in the architecture's graphic. For example, when you mouse over a conserved domain cartoon in the architecture's graphic, and you and see a conserved domain accession number that begins with the "cl" prefix, that indicates a superfamily hit. A conserved domain accession number that begins with any other prefix indicates a specific hit.

Therefore, every protein listed in the "sequences with this architecture" table has the exact order of conserved domains, and the exact hit type to each domain, as shown in the graphic at the top of a conserved domain architecture's summary page.

You can choose to view all proteins that have the architecture, or a pre-defined subset, using the folder tabs and filters described below:

Folder Tabs:

All | Protein with PubMed Reference | 3D Structure | Gene | RefSeq | Swiss-Prot
The folder tabs under "sequences with this architecture" provide quick access to some commonly used data subsets. (A complete list of available data subsets is provided under "Filters.")

"All" folder tab - All proteins in the Protein database that have the conserved domain architecture.

"Protein with PubMed Reference" folder tab - The subset of protein sequences that have this conserved domain architecture, and that include reference to a published article in PubMed.

"3D Structure" folder tab - The subset of protein sequences that have this conserved domain architecture, and that have an experimentally resolved 3-dimensional structure.

"Gene" folder tab - A subset of protein sequences that have this conserved domain architecture, and that have a link to a Gene record. This folder tab shows only one representative protein for each gene to which the architecture is linked, in order to provide a non-redundant view of the genes associated with the architecture.
Note: The "Gene" folder tab lists the same subset of protein sequences that can be retrieved using the option for "Filters:Tags:Gene Representative." However, the displays are slightly different. The "Gene" folder tab provides a gene-centric view that displays the gene ID, gene symbol, and gene description associated with each protein. In contrast, the "Filters:Tags:Gene Representative" displays the protein ID and description of each sequence that is linked to a gene. Both views include the source organism, protein length (in amino acids), and an "Actions" column that provide access to the protein sequences in FASTA format and links to other tools/resources.
"RefSeq" folder tab - The subset of protein sequences that have this conserved domain architecture, and that are from the RefSeq database.

"Swiss-Prot" folder tab - The subset of protein sequences that have this conserved domain architecture, and that are from the UniProtKB/Swiss-Prot database.

Filters:

Tags: Annotated | BioAssay | Gene | Gene Representative | NR Representative | PubMed | Reference
Source | Organism | Description | Gene Symbol
The "Filters" under "sequences with this architecture" enable you to view a number of pre-defined data subsets. Click on the down arrow (V) beside "Filters" to see the complete list and to activate the check box(es) of the desired filter(s). (Some of the commonly used filters are shown as folder tabs near the top of the "sequences with this architecture" section.)

After a filter is selected, it will remain active unless/until you deactivate its checkbox or dismiss the filter(s) by clicking the red X in the "Filters" tab.

The number and types of filters that appear in a SPARCLE record depend on the set of protein sequences with that architecture, and on the information/data links that are available for those proteins. An example architecture that has a wide variety of filters is type IIA DNA topoisomerase subunit B (architecture ID 11481348). Other architectures might have only a small number of filters.
The various filters you might see on a page are described below:

Tags - This filter enables you to view the subset of proteins that have been tagged with various attributes. The available tags include: Annotated, BioAssay, Gene, Gene Representative, NR Representative, PubMed, and Reference.

The "Annotated" tag retrieves the subset of protein sequences that have this conserved domain architecture, and that either link to PubMed, BioAssay, Structure, or OMIM, or are considered to be "landmark" sequences by Smart BLAST.

The SmartBLAST help document includes a section on the "Landmark Database," which describes how the landmark sequences are seleted. Excerpt:

"The landmark database includes proteomes from 27 genomes spanning a wide taxonomic range. This search set is produced using the best available genomic assemblies for each organism with the following procedure. First, the most recent representative assembly from each organism is identified. Second, all proteins annotated on each assembly are downloaded and compiled into the landmark BLAST database. The result is a taxonomically diverse non-redundant set of proteins supported by genomic assemblies."

The "BioAssay" tag retrieves the subset of protein sequences that have this conserved domain architecture, and that are the targets of BioAssay experiments.

These are identified by an automated process that looks at the proteins that have the architecture in question, and finds the subset of proteins whose sequence identifiers are listed as the targets of BioAssay experiments.
The section of this document on data processing: links from architectures to other data types describes how links are identified between conserved domain architectures in the SPARCLE database and other data types.

The "Gene" tag retrieves the subset of protein sequences that have this conserved domain architecture, and that are linked to a record in the Gene database.

This option lists all of the protein sequences that have the architecture in question, and that have links to Gene records.
If several protein sequence records have links to the same gene record, all of those protein sequence records will be listed in this view.
For example, if 10 protein sequence records link to 2 genes, the "gene" tag will display all 10 protein sequences.
The section of this document on data processing: links from architectures to other data types describes how links are identified between conserved domain architectures in the SPARCLE database and other data types.

The "Gene representative" tag shows only one representative protein sequence for each gene that is linked to the architecture, in order to provide a non-redundant view of the genes associated with that architecture.

For example, if 10 protein sequence records link to 2 genes, the "gene representative" tag will display only 2 proteins -- one representative protein sequence for each gene.
Note: The "Gene" folder tab lists the same subset of protein sequences that is retrieved by the "Gene Representative" tag. However, the displays are slightly different. The "Gene" folder tab provides a gene-centric view that displays the gene ID, gene symbol, and gene description associated with each protein. In contrast, the "Gene Representative" filter tag displays the protein ID and description of each sequence that is linked to a gene. Both views include the source organism, protein length (in amino acids), and an "Actions" column that provide access to the protein sequences in FASTA format and links to other tools/resources.

An "NR representative" is a protein sequence that has been selected as the representative of a group of identical sequences, for the purpose of creating a protein non-redundant (NR) database. The "NR Representative" tag therefore retrieves a non-redundant list of protein sequences that have this conserved domain architecture. Technical details:

To create a non-redundant protein database, the NCBI data processing pipeline organizes protein sequences into protein identity groups (PIGs). A protein identity group contains protein sequences that are identical in length and composition, regardless of taxonomic source (i.e., regardless of TaxID). Each group is given a stable identification number (PIG ID).
One protein sequence from each PIG is selected as the representative. If the PIG includes a RefSeq record, that is selected as the representative. If no RefSeq record is present, then a representative is selected from one the following databases: Swiss-Prot, PIR, PDB, GenPept (protein translations of nucleotide sequence records in GenBank that have been annotated with a coding sequence, or CDS, feature), or PRF, respectively.

The "PubMed" tag retrieves the subset of protein sequences that have this conserved domain architecture, and that have a link to published literature represented in the PubMed database.

The "Reference" tag retrieves the subset of protein sequences that have this conserved domain architecture, and that are considered to be "landmark" sequences by Smart BLAST.

The SmartBLAST help document includes a section on the "Landmark Database," which describes how the landmark sequences are seleted. Excerpt:

"The landmark database includes proteomes from 27 genomes spanning a wide taxonomic range. This search set is produced using the best available genomic assemblies for each organism with the following procedure. First, the most recent representative assembly from each organism is identified. Second, all proteins annotated on each assembly are downloaded and compiled into the landmark BLAST database. The result is a taxonomically diverse non-redundant set of proteins supported by genomic assemblies."

Source - The source database from which a protein sequence record came

The NCBI Protein database pulls together sequences from a variety of source databases, such as the protein translations of nucleotide sequence records in the GenBank, European Molecular Biology Laboratory (EMBL), DNA Data Bank of Japan (DDBJ), and NCBI Reference Sequence (RefSeq) databases, as well as the Swiss-Prot, Protein Information Resource (PIR), Protein Research Foundation (PRF) , and the NCBI Third Party Annotation (TPA) databases.
You can use the "Filters:Source" check boxes on a conserved domain architecture's summary page to retrieve the subset of proteins that are from the source database(s) of interest.
Note: The folder tabs that appear under the blue header for "Sequences with this architecture" provide an alternative way to retrieve proteins from some of the commonly used source databases, such as RefSeq and Swiss-Prot.

Organism - Enter the scientific name of any organism that appears in the list of "sequences with this architecture" to display only the protein sequences that have this conserved domain architecture, and that come from the organismyou have specified.

Note: You can enter a taxonomic node other than Genus species. However, the system will currently retrieve only the proteins that have been classified down to the specified level of the taxonomic tree, but no deeper.
For example, open the "Sequences with this architecture" table for architecture ID 11481348: type IIA DNA topoisomerase subunit B.
Enter "Pseudobutyrivibrio ruminis" (without the quotes) in the "Filters:Organism" text box. The system will display only the proteins that have been classified with that exact genus and species.
Now clear/dismiss the filter you just entered (or simply reload the SPARCLE record for architecture ID 11481348) in order to once again display all of the proteins, before doing the next step below.
Enter the taxonomic node "Clostridiales" (without quotes) in the "Organism" filter. The system will display only the proteins that have been classified down to that node of the taxonomic tree, and no deeper.
Note: The Organism filter will be enhanced in the future to allow retrieval by any node in an organism's lineage.

Description - This filter retrieves the subset of proteins that have this conserved domain architecture, and that have a description (definition line) containing the keyword(s) that you type in the textbox.

For example, open the "Sequences with this architecture" table for architecture ID 12201410: hybrid sensor histidine kinase/response regulator.
The "Sequences with this architecture" table includes some proteins with the description of "Signal transduction histidine kinase."
To view only those proteins, you can enter a single keyword such as "signal" or a phrase such as "signal transduction" (with or without quotes) in the text box beside the "Description" filter.
Note: if you enter two or more terms, they must be adjacent to each other in the description of a protein in order for the protein to be retrieved. That is, if you enter two or more words, the system will search for them as a phrase, whether or not you surround them with quotes.
For example, the protein sequences with the description "Signal transduction histidine kinase" will not be retrieved if you enter the words "signal histidine" (with or without quotes) in the "description" filter.

Gene Symbol - This filter retrieves the subset of proteins that have this conserved domain architecture, and that are linked to genes that have the symbol you specified in the textbox.

If a Gene record lists an official symbol as well as aliases (alternative gene symbols), and that gene is associated with the architecture, you can type any one of those symbols into the textbox to retrieve the subset of protein sequences linked to the gene.

Note: Empty set (no links to protein sequences): Occasionally, the "sequences with this architecture" table might display a message that says, "This architecture currently does not link to any protein sequence records." This might be true for either of the following reasons:

The original sequence(s) in which the architecture was found are no longer in the public database (e.g., they might have been found to be erroneous and were therefore withdrawn).
-- or --

The scoring used by the CDD/CD-Search systems might have been refined, and the sequences that were originally linked to this architecture are now linked to a different architecture that achieves a higher score. (An example of this is provided in the section of this document about ongoing research.)

In either case, however, the SPARCLE record for the domain architecture is retained in the database, and its architecture ID is also retained (and not re-used for any other architecture), because it is possible that another sequence in the future will map to the architecture.

Curated Names and Labels

The Curated Names and Labels section of a conserved domain architecture's summary page lists the architecture's:
Taxonomic Scope | Name | Label | Supporting evidence: Protein sequences, Conserved domains, Publications, Other

Taxonomic Scope

The taxonomic scope column indicates the taxonomic node to which the architecture name and label apply.

By default, conserved domain architectures are associated with the root of the taxonomic tree (i.e., all organisms). When an architecture is associated with the root, it means the name/label of the architecture is not specific to any node of the full taxonomic tree. This is true of most architectures in the SPARCLE database.

For example, search the SPARCLE database to:
retrieve the architectures that have a taxonomic scope of all organisms

If the taxonomic classification of an architecture is not root, but is instead a more specific taxonomic node, that means the curator is asserting that the name/label chosen for the architecture is applicable within the specified node, but not necessarily within other taxonomic branches.

For example, a search of the SPARCLE database for:

guanylate cyclase AND bacteria[Organism]

will retrieve the architectures that contain the terms "guanylate" and "cyclase" in any field of the SPARCLE architecture record, and whose names and labels are applicable within bacteria but not within other taxonomic nodes.

guanylate cyclase AND eukaryota[Organism]

will retrieve the architectures that contain the terms "guanylate" and "cyclase" in any field of the SPARCLE architecture record, and whose names and labels are applicable within eukaryota but not within other taxonomic nodes.

The section of this document about Search fields: [Organism] provides additional information about the taxonomic classification of conserved domain architectures and search tips on how to restrict your search to a specific taxonomic node, if desired.

Name

The name of the conserved domain architecture.
The name is displayed in two places on a SPARCLE record: near the top of the record (in bold font), and in the "Curated names and labels" section of the record.

As an example, the name of the conserved domain architecture shown in the illustration of a sample SPARCLE record is "DNA gyrase subunit B." (You can also see this name in the live SPARCLE record for architecture ID 10647733.)

The name of a conserved domain architecture is either assigned manually by curation, or computationally by the autoname algorithm or the namedByDomain algorithm.

Label

The label provides a description of the conserved domain architecture's biological function.
The label is displayed in two places on a SPARCLE record: near the top of the record (beneath the bold font that shows the architecture's name), and in the "Curated names and labels" section of the record.

As an example, the label of the conserved domain architecture shown in the illustration of a sample SPARCLE record is "DNA gyrase is a type 2 topoisomerase that relaxes supercoils but can also introduce negative supercoils into DNA in an ATP-dependent manner." (You can also see this label in the live SPARCLE record for architecture ID 10647733.)

Supporting Evidence:

The "Curated Names and Labels: Supporting Evidence" section of a conserved domain architecture's summary page lists the evidence that was used by NCBI curators, or by the "autonamed" or "namedbydomain" algorithms, to assign a name to the architecture. Some types of supporting evidence include:

Protein sequences

As described in the data processing section of this document, the names of high quality protein sequences are used by NCBI curators and by the "autonamed" algorithm in assigning a name to the conserved domain architecture (if those proteins are representative of the overall group of sequences that have the architecture in question). The Supporting Evidence: Protein Sequences section of a conserved domain architecture's summary page lists the protein sequence records that were used to name the architecture.

Conserved domains

As described in the data processing section of this document, the names of conserved domain models are used by NCBI curators and by the "namedbydomain" algorithm in assigning a name to the conserved domain architecture.

The "Supporting Evidence: Conserved Domains" section of a SPARCLE record might list one or more of the domains that are present in the architecture (i.e., one or more of the domains that are listed in the "Conserved domains in this architecture" section of the SPARCLE record). It might also list domain models that are not direct components of the architecture, but that belong to the same superfamily clusters as the components and are useful in helping to name the architecture.

As an example, see the conserved domain architecture for the PAS and AAA domain-containing protein (architecture ID 11530124), which was namedByDomain. The "Conserved domains in this architecture" section of the SPARCLE record lists the top-scoring domain models (as determined by CD-Search) on the proteins that have the architecture:

pfam00126: HTH_1 - Bacterial regulatory helix-turn-helix protein, lysR family
pfam00158: Sigma54_- activat Sigma-54 interaction domain
smart00091: PAS - PAS domain
smart00116: CBS - Domain in cystathionine beta-synthase and other proteins
The "Supporting Evidence: Conserved Domains" section of that SPARCLE record lists one of the domain models above, as well as three other domain models (with conserved domain accession numbers that begin with a "cd" prefix):

cd02205: CBS_pair
cd00130: PAS
cd00009: AAA
pfam00126: HTH_1

The domain models with "cd" accessions are not direct components of the architecture (i.e., they are not the top-scoring hits), but they belong to the same clusters as the component domains and are useful in helping to name the architecture because they are curated domains whose names were carefully selected based on published research about protein functions.

Publications

Published articles that describe the function of proteins that contain the conserved domains in the architecture, and that were used in naming the architecture.

Other

Other types of evidence, as available, might also influence the name and functional label that is assigned to a conserved domain architecture. An example of additional evidence could be the biological pathway (biosystem) of which the protein is a part.

Conserved domains in this architecture

Each conserved domain architecture reflects the set of conserved domain models that are top-scoring hits (as determined by the CD-Search service) on the proteins that possess the architecture, the sequential order of those domains, and the type of hit each domain has to the proteins. Each architecture is given a unique, stable architecture ID. (As noted in the Overview section of this document, it is also possible for a domain architecture to consist of a single conserved domain footprint.)

The "Conserved domains in this architecture" section of a SPARCLE record provides a tabular list of the conserved domain models that compose the architecture.

The type of accession number in the architecture reflects the type of hit it has on the proteins that possess the architecture.
Accession numbers that begin with the "cl" prefix indicate a superfamily hit (the "cl" prefix stands for superfamily cluster).
All other type of accession numbers indicate specific hits.

The order in which the domains are listed in the table does not necessarily reflect their N-terminal to C-terminal order on the proteins that contain the architecture. The graphic near the top of a conserved domain architecture's summary page, however, does show the N-terminal to C-terminal order of the domains (illustrated example).

Note: one or more of the conserved domains that compose the architecture might also be listed as supporting evidence that was used in assigning a the name to the architecture. However, the supporting evidence might also (or might instead) list related conserved domains, as explained in the section of this document that describes that part of a SPARCLE record.

Functional sites in this architecture

Functional sites are also referred to as conserved features/sites, and typically describe sites such as catalytic residues, binding sites, or motifs commonly referred to in the literature

The are generally identified in NCBI-curated domains.

Functional sites are listed on a SPARCLE record only if the proteins possessing that conserved domain architecture have a specific hit to the NCBI-curated domain model in which the conserved features/sites have been annotated.

Data Processing

data processing overview | three tiers of data: curated architectures, autonamed architectures, named by domain architectures | two types of architectures: superfamily architectures, subfamily architectures | single domain architectures | each architecture receives a unique and stable architecture ID | ongoing research | links from architectures to other data types

Data processing overview

As the number of publicly available protein sequences continues to grow exponentially, efforts are underway to organize that data in a biologically meaningful way. This includes identifying relationships among proteins with similar composition and function.

It is possible to cluster proteins by sequence similarity; however, it is computationally costly to compare all proteins against each other. So an all versus all comparison is not an efficient strategy.

Conserved domain annotations on proteins, on the other hand, provide an simple alternative strategy for clustering proteins. NCBI already computes conserved domain annotation on protein sequences as part of the standard data processing pipeline, using the Conserved Domain Database (CDD) and CD-Search tool.

Building on that effort, the Conserved Domain Architecture Retrieval Tool (CDART) identifies all proteins that have the same domain annotation (the same order of conserved domains, and the same type of hit for each domain, i.e., specific or non-specific) and clusters them together in a group.

As noted in the "Compare CDD, CDART, and SPARCLE" section of this document, SPARCLE is built upon CDART. Specifically, SPARCLE contains the subset of subset of domain architectures that include at least one conserved domain model that is a specific hit to at least one protein sequence in the non-redundant ("nr") protein database.

SPARCLE then assigns a name and functional label (a description of the function of protein family that has the architecture) to each conserved domain architecture. As noted below, names are assigned to the architectures either by a manual curation process, or by automated processes that use algorithms to autoname an architecture, or to name an architecture based on the domains it contains. Curated domain architecture records are supported with, and linked to, evidence from high quality sequence data and literature.

In this way, SPARCLE is used to classify proteins, based on the functional characterization and labeling of protein sequences that have been grouped by their characteristic conserved domain architecture.

Three tiers ("review levels") of conserved domain architectures are present in the SPARCLE database:

curated architectures

autonamed architectures

named by domain architectures

To compare these types of records at a glance, click on the following links to retrieve conserved domain architectures that include the term "kinase" in their [Name], and whose names were assigned by the method indicated in [ReviewLevel]:

kinase[Name] AND curated[ReviewLevel]

kinase[Name] AND autonamed[ReviewLevel]

kinase[Name] AND namedbydomain[ReviewLevel]

Below are details about the data processing methods for each subset, including descriptions of the methods used to name the architectures, followed by a note about ongoing research.

Curated architectures

The manual curation process for conserved domain architectures is carried out by the Conserved Domain Database (CDD) curators and includes the steps noted below. Various types of evidence that were used by the curators in naming an architecture and describing its biological function are listed in the supporting evidence section of a SPARCLE record.

Describe protein function:

The domain architecture curation process begins by looking at a cluster of proteins that have the same architecture and asking the question, can we describe them functionally?

The answer depends on how diverse the sequences are and whether scientific experiments have revealed anything about the functions of the proteins in the cluster.

To arrive at an answer, the curators look at the existing names of protein sequences in the set and at the available evidence for the function of that set, such as publications linked to individual sequences, associated 3D structures, and other types of evidence the curators might find in additional NCBI databases such as BioSystems, BioAssay, etc.

Much of the curation process, however, is based on the availability of published literature, 3D structures, and the presence of high quality sequences (e.g., Swiss-Prot, RefSeq) in the cluster that have been functionally characterized.

Assign a name to the architecture: The curators then make a judgement call in assigning a name to the architecture, with the goal of selecting a name that is representative of the whole cluster of proteins. Below are some examples of situations the curators encounter in the process of naming architectures, and how the names are chosen in each case:

The set of protein sequences with the architecture have a SPECIFIC HIT to a conserved domain model that results in a HIGH CONFIDENCE of the architecture's biological function:

In the process of examining the group of sequences that have a given architecture, the curators might find that the proteins have a specific hit to one or more conserved domain models within the architecture. This represents a high confidence level for the inferred function of the protein query sequence, and therefore can influence the name of the architecture.

For example, the query protein sequence human guanylyl cyclase (AAA74451) has a "retinal guanylyl cyclase 2" domain architecture. The domain architecture includes a specific hit to the NCBI-curated domain cd06371, which has a short name and title of: "PBP1_sensory_GC_DEF_like: Ligand-binding domain of membrane guanylyl cyclases (GC-D, GC-E, and GC-F) that are specifically expressed in sensory tissues." The specific hit to the NCBI-curated domain therefore influenced the name that was given to the architecture: "retinal guanylyl cyclase 2."

A subset of high quality protein sequences that have the SAME NAME and ARE REPRESENTATIVE of the whole protein sequence cluster:

In the process of examining the group of sequences that have a given architecture, the curators might find a subset of high quality sequences that all have the same name, and that are representative of the whole cluster of protein sequences of which they are a part. In that case, the conserved domain architecture is given the same name as the subset of high quality sequences.

For example, let's say a cluster of ~200 protein sequences includes a subset of five sequences from a curated database such as Swiss-Prot or RefSeq. If those five sequences have the same name, and if they are reliable and representative of the whole cluster, then their name is given to the domain architecture as well.

A subset of high quality protein sequences that have the SAME NAME but ARE NOT REPRESENTATIVE of the whole protein sequence cluster:

In the process of examining the group of sequences that have a given architecture, the curators might find a subset of high quality sequences that all have the same name, but that do not represent the overall cluster (e.g., the subset represents only a specific taxon or other subgroup within the larger cluster). In that case, we cannot conclude that all of the sequences in the cluster will share the same function.

This is a common situation, and in such a case, the curators try to find a name for the domain architecture that is more generic and that is derived from the types of conserved domain signatures that are present in the protein family.

For example, let's say a big protein family has a hit to an NAD dependent dehydrogenase, and one of the high quality sequences has been named as an alpha ketoglutarate dehydrogenase. Extrapolating that very specific name to all of the sequences which have the same architecture might be a stretch, because we don't have evidence to support such an extrapolation (e.g., the substrates for the other proteins in the family might not yet be known). So the curators make a judgement call to apply a more general name to the domain architecture, and they might simply call the family a dehygrogenase, rather than an alphaketoglutarate dehydrogenase.

A subset of high quality protein sequences that have DIFFERENT NAMES

In the process of examining the group of sequences that have a given architecture, the curators might find a subset of high quality sequences that have different names, indicating functional diversity in that family. In such a case, the CDD curators look for commonalities among the high quality protein names and identify generalities that can be applied to the architecture overall.

For example, if the names of Swiss-Prot records indicate that the proteins are a valine tranporter, isoleucine tranporter, and threonine transporter, then the curators would apply the general name of "amino acid transporter" to the domain architecture.

The curators also take naming rules and standards into consideration. They attempt to find compromises among naming standards (e.g., upper case, lower case, dash, no dash, etc.) that are used by UniProt, Swiss-Prot, and RefSeq, and apply those compromises to the names they apply to SPARCLE domain architectures. There are sometimes differences between American and European naming conventions, and the aim is to minimize or erase those differences over time as naming conventions and standards continue to evolve.

Seemingly redundant SPARCLE architectures that all have the SAME NAME AND FUNCTIONAL LABEL, but they are in fact FUNCTIONALLY DIFFERENT

Sometimes there are seemingly redundant SPARCLE architectures that all have the same name and functional label, but they are in fact functionally different.

For example, many architectures might have the name "sensor histidine kinase" but each of those might be functionally different from the others. The architectures contain the same basic domain signature (i.e., a catalytic domain, an accessory domain that is phosphorylated by the kinase, and a PAS domain), but we don't yet know what signaling pathways they are involved in, and some of the architectures might contain an additional domain whose specific function is unknown. In such cases, if the curators do not have the experimental evidence needed to give the domain architectures a more specific name, they apply the general name to the architectures. The architecture names and descriptions are later refined as additional data and experimental evidence become available.

Search tip:
To retrieve all curated domain architectures, search the SPARCLE database for:
curated[ReviewLevel]
or add that search criterion to a keyword search to limit your retrieval to the desired type of records. For example, a search for:
kinase[Name] AND curated[ReviewLevel]
will retrieve conserved domain architectures that include the term "kinase" in their name, and whose names were assigned manually by NCBI curators.

Autonamed architectures

Autonamed conserved domain architectures use an algorithm to automatically generate an architecture name based on the frequency of terms that are present in the definition lines of the proteins that have the architecture. Proteins that were used in naming the architecture are listed in the supporting evidence section of a SPARCLE record.

The automatically generated name will begin with the phrase "similar to..." followed by a cleaned up definition line (e.g., removal of taxonomy information, etc.) from the set of high quality proteins that were used to generate the name. The algorithm includes:

Protein name analysis:

The definition lines of all protein sequences that have a given architecture are analyzed.

First, the protein names are tokenized into word terms.

Next, the most popular terms will be selected as representatives to form a voting committee.

The voting committee will vote for the most representative name in this architecture.

Consistency score:

A post-processing step subsequently calculates a consistency score to determine the extent to which the name of the representative protein is sharing terms among the other proteins.

The consistency score will be used to decide if this computed name can be selected with enough confidence.

Please note that only a small fraction of architectures can be autonamed in this fashion due to the high confidence level required.

Additionally, architecture names are recalculated with each release of the Conserved Domain Database (CDD). This is because new sequence data are continually added to the Protein database. As a result, the number of protein sequences that have a given architecture might increase, which in turn increases the set of protein names from which an architecture name is computed.

Search tip:
To retrieve all autonamed domain architectures, search the SPARCLE database for:
autonamed[ReviewLevel]
or add that search criterion to a keyword search to limit your retrieval to the desired type of records. For example, a search for:
kinase[Name] AND autonamed[ReviewLevel]
will retrieve conserved domain architectures that include the term "kinase" in their name, and whose names were assigned computationally by the Autonamed algorithm.

NamedByDomain architectures

NamedByDomain conserved domain architectures use an algorithm to automatically generate an architecture name based on the highest scoring conserved domains that are present in the architecture. Domains that were used in naming the architecture are listed in the supporting evidence section of a SPARCLE record.

Architectures that aren't curated, and couldn't be autonamed, are assigned a name based on up to two conserved domain models that are present in the architecture.

While the architecture's name is based on up to two conserved domains, the functional label can be based on up to four of the conserved domains in the architecture.

Sort conserved domains by e-value

If an architecture contains more than two conserved domains, then an algorithm is used to select the two highest-scoring conserved domain models, with a priority given to NCBI-curated domain models.

All of the conserved domain models that appear in the concise view of the architecture are scored based on their E-value.

Technical note: The E-value of a given conserved domain can vary among the proteins that have the architecture in question, because the composition of the protein sequences may vary outside of the conserved domain architecture. To address this issue, the "NamedByDomain" algorithm uses the E-values of the conserved domain models against protein sequences in the oldest protein identity group ("PIG," described below) that has the architecture in question. That is, the algorithm uses the E-values of the domain models on the proteins that have the lowest/oldest PIG ID number.

A protein identity group (PIG) is a cluster of protein sequences that are identical to each other in composition and length, regardless of their taxonomic source. The PIGs are automatically generated by the data processing pipeline at NCBI, which identifies all proteins that are identical to each other, regardless of TaxID, places them together in a protein identity group, and gives each PIG a stable identification number (PIG ID).

Prioritize specific hits and NCBI-curated domain models

Conserved domain models that have specific hits on the proteins are given priority in naming the architecture.

A specific hit represents a very high confidence that the query sequence belongs to the same protein family as the sequences used to create the domain model, and therefore a high confidence level for the inferred function of the protein query sequence.

The CD-Search help document provides additional information about hit types, including details and an illustration about the domain-specific E-value threshold that is used to identify specific hit.

NCBI-curated domain models are given priority:

If an NCBI-curated domain model, and a domain model from an external source database, both have a bit score that meets or exceeds the E-value threshold for a specific hit, then the NCBI-curated domain model is given priority.

Search tip:
To retrieve all domain architectures that have been named by domain, search the SPARCLE database for:
namedbydomain[ReviewLevel]
or add that search criterion to a keyword search to limit your retrieval to the desired type of records. For example, a search for:
kinase[Name] AND namedbydomain[ReviewLevel]
will retrieve conserved domain architectures that include the term "kinase" in their name, and whose names were assigned computationally by the NamedByDomain algorithm.

Two types of conserved domain architectures:

Superfamily architectures

Superfamily architectures consist solely of conserved domain superfamilies. This infers a general functional category for the proteins which have that architecture.

That is, each conserved domain footprint in the architecture has an RPS-BLAST superfamily hit to every protein that has been classified with the architecture. This is designated by the "cl" prefix in the accession number of each conserved domain in the architecture. The "cl" stands for superfamily cluster. (To see the accession numbers, mouse over the conserved domain footprints in the architecture's graphical display.)

One example of a superfamily architecture is:

N-terminus------[cl21514]-------[cl00388]------C-terminus

Proteins with this architecture have an RPS-BLAST hit to accession number cl21514 (TauE Superfamily: Sulfite exporter TauE/SafE), followed by a hit to cl00388 (Thioredoxin_like Superfamily: Protein Disulfide Oxidoreductases and Other Proteins with a Thioredoxin fold).

Specifically, the N-terminal region of each protein with this architecture achieved a statistically significant hit to a conserved domain model that belongs to the TauE superfamily, and the C-terminal region achieved a statistically significant hit to a conserved domain model that belongs to the Thioredoxin_like Superfamily. However, neither hit had a high enough score to be considered a specific hit.

As a result, only the superfamily classification is shown for each region of the protein, and is therefore regarded as a superfamily architecture.

Note: Superfamily architectures are currently found only the CDART resource. A brief description of CDART is provided in the "Compare CDD, CDART, and SPARCLE" section of this document.

Subfamily architectures

Subfamily architectures either contain a mix of conserved domain superfamilies and subfamilies, or consist solely of conserved domain subfamilies.

A subfamily is represented by a conserved domain model that gets a specific hit to the protein query sequence. The specific hits represent a high confidence that the query sequence belongs to the same protein family as the sequences used to create each conserved domain model, and therefore a high confidence level for the inferred function of the protein query sequence.

To see if a conserved domain is a superfamily or subfamily, mouse over a conserved domain's footprint in the architecture's graphical display. A superfamily will have a "cl" prefix in the accession number; the "cl" stands for superfamily cluster. A subfamily will have an accession number prefix other than "cl".

One example of a subfamily architecture that consists solely of subfamilies is:

N-terminus------[COG0785]-------[cd03012]------C-terminus

Here, the accession number prefixes are "COG" and "cd," indicating that both conserved domains are specific hits. This architecture can be seen, for example, in the CD-Search results for the query protein NP_217390: integral membrane C-type cytochrome biogenesis protein DipZ [Mycobacterium tuberculosis H37Rv]. In the "Protein Classification" section of the CD-Search results, click on the link for "domain architecture ID 10002697" to open the corresponding SPARCLE record for that conserved domain architecture, if desired.

Whether you view the CD-Search results for NP_217390, or the SPARCLE record for domain architecture ID 10002697, you will see that each conserved domain in the architecture achieves a specific hit to the query protein. This can be viewed on the CD-Search results page, in the "Specific Hits" line of the "Graphical Summary." It can also be viewed in the corresponding architecture record, by mousing over the conserved domain cartoons in the architecture's graphic to see that the accession number of each graphic begins with a prefix other than "cl".

Note: Subfamily architectures are currently found only the SPARCLE resource. A brief description of SPARCLE is provided in the Compare CDD, CDART, and SPARCLE section of this document.

Architectures with single conserved domain footprint:

It is also possible for a domain architecture to consist of a single conserved domain footprint. That footprint can represent either a superfamily architecture or a subfamily architecture.

Each architecture receives a unique and stable architecture ID:

Each conserved domain architecture receives a unique and stable architecture ID, which reflects the set of conserved domain models that are top-scoring hits (as determined by the CD-Search service) on the proteins that possess the architecture, the sequential order of those domains, and the type of hit each domain has to the proteins. Architectures that consist of a single conserved domain footprint also receive an architecture ID.

Additional information about conserved domains:

The Conserved Domain Database (CDD) help document provides additional information about domain family hierarchies, including superfamilies and subfamilies. It also provides additional information about the companion CD-Search tool, including the hit types displayed in CD-Search results, such as specific hits, non-specific hits, the superfamily(ies) to which those hits belong, and multi-domain models. Each superfamily on a CD-Search results page is represented by a cartoon with a distinct color/shape combination, in order to distinguish domains from each other.

Ongoing Research

Please note that conserved domain models, architectures, and the resulting protein sequence clusters, continue to evolve as new data become available and as research progresses. As a result, the domain architecture annotated on a protein sequence, and the members of a protein sequence cluster, might change over time.

Specifically, the CDD curation project refines conserved domain models as new protein sequences and publications become available, and through closer analysis of existing clusters.

For example, when the CDD curators see a cluster of protein sequences in SPARCLE that is functionally diverse and that can be broken up into subclusters with more precise function, they do that by creating the appropriate domain models that will reflect the diverse functions. The refined domain models are then added to the data processing pipeline that defines conserved domain architectures and corresponding groups of protein sequences.

Additionally, an architecture that is composed of several individual conserved domain models might later be superceded by a multi-domain model that represents the full-length protein.

As an example, in January 2017, the protein sequence NP_387887 was initially annotated with architecture ID 10647733 (as shown in the illustrated example in the "input sequence data" section of this document). That architecture is named "DNA gyrase subunit B" and includes four distinct conserved domains.

In March 2017, when a new build of CDD/SPARCLE was released, the conserved domain architecture annotation for NP_387887 was revised to architecture ID 11481348, which is a multi-domain that encompasses the four original conserved domains, and which can be seen in the current CD-Search results for NP_387887. That architecture has a more specific and precise name, "type IIA DNA topoisomerase subunit B," and reflects the full length protein model.

To see the four distinct conserved domains that compose the full length protein model, simply change the CD-Search display option on the live CD-Search results for NP_387887 from "View Concise Results" to "View Full Results" (using the "View" menu near the upper right hand corner of the CD-Search results page). The Full Results display will show the four conserved domains that compose the full length protein model.

As the available data and understanding of conserved domain architectures continue to evolve, the domain architectures that are annotated on proteins may evolve as well, as shown in this example. Comments about the data are welcome and can be sent to the NCBI Support Center/Help Desk, which is accessible as a link in the footer of NCBI web pages.

In this way, as the available data and understanding of conserved domain architectures continue to evolve, the domain architectures that are annotated on proteins may evolve as well.

Comments about the data are welcome and can be sent to the NCBI Support Center/Help Desk, which is accessible as a link in the footer of NCBI web pages.

Links from architectures to other data types

The SPARCLE data processing pipeline calculates two types of direct links:

sparcle_protein: each conserved domain architecture in the SPARCLE database links to all protein sequences that have the architecture.

sparcle_cdd: each conserved domain architecture in the SPARCLE database links to all of the conserved domain models (specific hits and superfamilies) that compose the architecture. For example, if an architecture contains one specific hit and one superfamily, that SPARCLE record will link to two Conserved Domain Database (CDD) records -- one for the specific hit and one for the superfamily.

All other links between SPARCLE and other Entrez databases are indirect, created by a join between the proteins that contain the architecture and the other data types. For example:

links from SPARCLE architectures to Gene records are created by a join between the following:
sparcle_protein AND protein_gene → sparcle_gene

links from SPARCLE architectures to BioAssay records are created by a join between the following:
sparcle_protein AND protein_pcassay_target → sparcle_pcassay_target

Log of Changes to SPARCLE

12 OCT 2016 Initial release of the Subfamily Protein Architecture Labeling Engine (SPARCLE).
SPARCLE is a resource for the functional characterization and labeling of protein sequences that have been grouped by their characteristic domain architecture. To use SPARCLE, you can either: (1) enter a query protein sequence into CD-Search, which will display a "Protein Classification" on the results page if the query protein has a hit to a curated domain architecture in the SPARCLE database, or (2) search the SPARCLE database by keyword to retrieve domain architectures that contain the term(s) of interest in their descriptions. With either approach, the corresponding SPARCLE record(s) will display the name and functional label of the architecture, supporting evidence, and links to other proteins with the same architecture. Additional information and illustrated examples are provided on the "About SPARCLE" page and in this help document.

References

Citing SPARCLE:

Marchler-Bauer A, Bo Y, Han L, He J, Lanczycki CJ, Lu S, Chitsaz F, Derbyshire MK, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Lu F, Marchler GH, Song JS, Thanki N, Wang Z, Yamashita RA, Zhang D, Zheng C, Geer LY, Bryant SH. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 2017 Jan 4;45(D1):D200-D203. doi: 10.1093/nar/gkw1129. Epub 2016 Nov 29. [PubMed PMID: 27899674] [Full Text at Oxford Academic]

Additional references:

Lu S, Wang J, Chitsaz F, Derbyshire MK, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Marchler GH, Song JS, Thanki N, Yamashita RA, Yang M, Zhang D, Zheng C, Lanczycki CJ, Marchler-Bauer A. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 2019 Nov 28. pii: gkz991. doi: 10.1093/nar/gkz991. [Epub ahead of print] [PubMed PMID: 31777944] [Full Text at Oxford Academic]

(NOTE: The above reference is for the e-publication ahead of print, and will be updated to reflect the volume, issue, pages, and publication date of the print version, once it becomes available in January 2020.)

A separate page lists all publications about NCBI's Conserved Domains and Protein Classification Resources.

Revised 02 December 2019