SPARCLE Help
 
 
   

This help document describes describes how to use SPARCLE, the Subfamily Protein Architecture Labeling Engine, a resource for protein classification. The Conserved Domains resources page describes additional, related resources and provides "How To" guides that illustrate how those resources can be used.

 
     
 
     DETAILED TABLE OF CONTENTS:  
BRIEF TABLE OF CONTENTS
 
  What is SPARCLE?
Overview
What is a conserved domain architecture?
Types of architectures
How can SPARCLE be used?
Compare CDD, CDART, and SPARCLE
Input options
Enter a query sequence into CD-Search
Illustrated example
Note about ongoing research
CD-Search help
Search SPARCLE database by keyword
Illustrated example
Scope of search
Search tips
Search fields
Output
Sequence search
Keyword search
Sample SPARCLE Record
Classification of proteins by architecture
Description of architecture
Sequences with this architecture
Curated names and labels
Taxonomic Scope
Name
Label
Supporting evidence
Conserved domains in this architecture
Functional sites in this architecture
Data Processing
Data processing overview
Three tiers of data:
Curated architectures
Autonamed architectures
NamedByDomain architectures
Two types of architectures:
Superfamily architectures
Subfamily architectures
Ongoing research
Links from architectures to other data
Log of changes to SPARCLE
References
 
 
 


SAMPLE SPARCLE RECORD
 
Sample SPARCLE record, showing the name and functional label of the conserved domain architecture found in the protein query sequence, NP_387887, DNA gyrase subunit B from Bacillus subtilis. The SPARCLE record also lists supporting evidence and links to other proteins with the same architecture. Click on the image to read more about SPARCLE records.
 
 


 
What is SPARCLE? back to top

overview | what is a conserved domain architecture? | two types of architectures: superfamily architectures, subfamily architectures | single domain architectures | each architecture receives a unique and stable architecture ID |
how can SPARCLE be used to learn more about proteins? | compare CDD, CDART, and SPARCLE

Overview back to top

SPARCLE, the Subfamily Protein Architecture Labeling Engine, is a resource for the functional characterization and labeling of protein sequences that have been grouped by their characteristic conserved domain architecture.

A conserved domain architecture is defined as the sequential order of conserved domains in a protein sequence.

To given an example of proteins that have similar function but different domain architectures:
There are two types of conserved domain architectures:
Architectures with single conserved domain footprint:
  • It is also possible for a domain architecture to consist of a single conserved domain footprint. That footprint can represent either a superfamily architecture or a subfamily architecture.

Each architecture receives a unique and stable architecture ID:
  • Each conserved domain architecture receives a unique and stable architecture ID, which reflects the set of conserved domain models that are top-scoring hits (as determined by the CD-Search service) on the proteins that possess the architecture, the sequential order of those domains, and the type of hit each domain has to the proteins. Architectures that consist of a single conserved domain footprint also receive an architecture ID.

Additional information about conserved domains:

How can SPARCLE be used to learn more about proteins? back to top

Compare CDD, CDART, and SPARCLE back to top

What is the association among the CDD, CDART, and SPARCLE resources?
How are they related to each other, and how do they differ?
For what purpose would you use one versus another?
These questions are answered below.

Conserved Domain Database (CDD) back to top
  Examples of how CDD can be used and the types of information it displays:  
   
     
Conserved Domain Architecture Retrieval Tool (CDART) back to top
  • The Conserved Domain Architecture Retrieval Tool (CDART) is built upon CDD.

  • CDART is a database of conserved domain architectures and a tool for finding protein similarities across significant evolutionary distances using sensitive domain profiles rather than direct sequence similarity, focusing on the overall conserved domain architecture of the protein rather than on in individual conserved domains.

  • A domain architecture is defined as the sequential order of conserved domains in a protein.
    CDART uses purely automated techniques to identify the conserved domain architecture of each sequence in the Entrez Protein database.

  • CDART then uses automated methods to identify domain architectures that are similar to each other.
    A similar domain architecture must include at least one of the conserved domain superfamilies in the query sequence. The similarity score of each domain architecture indicates the number of domain superfamilies in the architecture that match domain superfamilies in the query protein, and is used to rank the search results.

  • Through these methods, CDART makes it possible to retrieve all of the protein sequences with a given conserved domain architecture, and to retrieve proteins with similar domain architectures.

  • Additional details are provided on the "About CDART" page, in the CDART Help Document, and in the CDART publication.

  Examples of how CDART can be used and the types of information it displays:  
   
     
Subfamily Protein Architecture Labeling Engine (SPARCLE) back to top
  Examples of how SPARCLE can be used and the types of information it displays:  
   
     
 
Input Options back to top


To access SPARCLE, you can either:

With either approach, the corresponding SPARCLE record(s) will display the name and functional label of the protein's conserved domain architecture, supporting evidence, and links to other proteins with the same architecture. Details about each approach are below.



Enter a query sequence into CD-Search back to top

  • The most common way to access SPARCLE is to enter a query sequence into CD-Search, either as FASTA-formatted sequence data, or as an accession number of a sequence that is in the protein or nucleotide databases. The search results will include a "Protein Classification" section if the query protein has a hit to a curated domain architecture in the SPARCLE database. In the protein classification section, click on the domain architecture ID in order to open the corresponding SPARCLE record.

  • The illustration below provides an example, using NP_387887, DNA gyrase subunit B, as the protein query sequence.

  • You can click on the individual panels of the illustration to open the corresponding live web page:



Step 1 in using SPARCLE: Enter a query protein sequence into the CD-Search tool. Click on this graphic to open the CD-Search tool and input your own query protein sequence. Step 2 in using SPARCLE: The CD-Search results page will display a Protein Classification section above the graphic summary of conserved domains, if a SPARCLE record exists for the domain architecture in the query protein sequence.  Click on this graphic to open the CD-Search results for NP_387887, DNA gyrase subunit B from Bacillus subtilis. Step 3 in using SPARCLE: The Protein Classification section of the CD-Search results links to the corresponding SPARCLE record, illustrated here. The SPARCLE record shows the name and functional label of the architecture, supporting evidence, and links to other proteins with the same architecture. Click on this graphic to open the SPARCLE record for the domain architecture (architecture ID 10647733) that was found in the protein query sequence, NP_387887, DNA gyrase subunit B from Bacillus subtilis.

  • Ongoing research: The Conserved Domain Database (CDD), as well as the conserved domain architecture annotated on proteins by SPARCLE, continue to evolve as new data become available and as research progresses. Therefore, the live web page views might differ from the illustration above.

    For example, in January 2017, the protein sequence NP_387887 was initially annotated with architecture ID 10647733 (as shown in the illustration above). That architecture is named "DNA gyrase subunit B" and includes four distinct conserved domains.

    In March 2017, when a new build of CDD/SPARCLE was released, the conserved domain architecture annotation for NP_387887 was revised to architecture ID 11481348 (which is a multi-domain that encompasses the four original conserved domains, and which can be seen in the current CD-Search results for NP_387887). That architecture has a more specific and precise name, "type IIA DNA topoisomerase subunit B," and reflects the full length protein model.

    To see the four distinct conserved domains that compose the full length protein model, simply change the CD-Search display option on the live CD-Search results for NP_387887 from "Concise Results" to "Full Results" (using the "View" menu near the upper right hand corner of the CD-Search results page). The Full Results display will show the four conserved domains that compose the full length protein model.

    As the available data and understanding of conserved domain architectures continue to evolve, the domain architectures that are annotated on proteins may evolve as well, as shown in this example. Comments about the data are welcome and can be sent to the NCBI Support Center/Help Desk, which is accessible as a link in the footer of NCBI web pages.

  • Additional details about using the CD-Search tool are provided in the CD-Search Help Document.

Search the SPARCLE database by keyword back to top

Step 1 in searching the SPARCLE database by keyword: Enter the desired search terms in the query box, adding curated[ReviewLevel], if desired, to limit results to curated domain architectures. Click on this graphic to open the SPARCLE home and input your own search terms. Step 2 in searching the SPARCLE database by keyword: View the search results and click on the architecture ID of any domain architecture of interest to open its summary page.  Click on this graphic to open the results of a SPARCLE search for chloride channel AND curated[ReviewLevel]. Step 3 in searching the SPARCLE database by keyword: view the SPARCLE record for the domain architecture of interest.  Click on this graphic to open the SPARCLE record architecture ID 10087058, chloride channel protein. From there, you can view evidence used to curate the domain architecture, retrieve all protein sequences which contain that architecture, and more.

  • Scope of a keyword search: back to top

    • When you search the SPARCLE database by keyword (e.g., gyrase), All Fields are searched by default. This includes looking for your keyword(s) in the name & functional label (description) of the conserved domain architecture. This also includes looking for your keyword(s) in the entities that were used as evidence to give a name to the architecture, such as gene names (names of genes whose protein products have that architecture), protein names (definition lines of proteins used as evidence to support the domain architecture, such as SwissProt records, where protein sequences are named based on literature), conserved domain names (including the short and long names of conserved domains that are present in the architecture), Enzyme Commission (EC) numbers and corresponding EC text descriptions.

  • Search tips for keyword searches: back to top

    all fields are searched by default | how to limit your query to a specific search field | use quotes to force a phrase search | use an asterisk (*) for truncation | compare some sample search strategies

    • By default, All Fields are searched in the SPARCLE database.

    • Limit a query to a specific search field:
      If you prefer to narrow your search to a specific field, you can:

      • Use the "Limits" page or the "Advanced" search page to view a list of available search fields, and select the field of interest from a pull-down menu.

      • Alternatively, you can type the field name, surrounced by square brackets [], directly after your search term, with or without a space between your term and the first bracket. For example:
        • a search for: curated[ReviewLevel] looks for the term "curated" in the "ReviewLevel" search field
        • a search for: bacteria[Organism] looks for the term "bacteria" only in the "Organism" search field. This will retrieve conserved domain architectures whose names and labels are applicable within bacteria but not within other taxonomic nodes.

      • The available search fields are listed in a table below, including a description and search example for each field.
        • A footnote under the table shows how search fields can be specified using either their full spelling or an abbreviation, and in upper case, lower case, or mixed case.

      • The "Show Index" link on the SPARCLE Advanced Search page allows you to browse the index of each search field, where you can see the available terms, the number of records containing each term or phrase, as well as the syntax for entering values in search fields such as CreateDate.

    • Use quotes to search for a phrase:
      Another way to narrow your search is to enclose multiple terms in quotes (e.g., search for "chloride channel").

      • Using quotes will require the system to search for the terms as a phrase. It will therefore only retrieve records where the two words occur together, adjacent to each other.

      • If quotes are not used, the Entrez system may still recognize and handle the terms as a phrase, if they are present in a phrase dictionary used by the search engine. If the terms are not present in the phrase dictionary and are not surrounded by quotes, Entrez will insert a Boolean AND between the terms; in that case, they may or may not appear adjacent to each other in the retrieved records.

      • The "Details" section in the right hand margin of a search results page will show you exactly how the Entrez system parsed your query. More search tips are provided in the PubMed help document and Entrez help document.

    • Use an asterisk (*) for truncation
      To broaden a search, you can use an asterisk (*) as a wild card to search for a word stem.

      • For example, a search for chlori* will retrieve records with terms such as chloride, chlorin, chlorinate, chlorinated, chlorinating, chlorination, chlorine, chlorite, and chloritidismutans.
      • As another example, a search for arachidon* will retrieve records with terms such as arachidonate, arachidonic, arachidonoyl, and arachidonyl.
      • The Entrez Help document provides additional information about truncating search terms in this way.

    • Compare some sample search strategies:
      As examples of various search strategies, compare the results of the following searches:

  • Search Fields: back to top

    As noted in the Search Tips above, when you search the SPARCLE database by keyword, All Fields are searched by default. If you prefer to restrict your search to a specific data field, you can use the pull-down menus on either the "Limits or the "Advanced" search page to select the desired field. Alternatively, you can type the desired field directly in your query, surrounding field name with square brackets [].*
  The available search fields include:

  All Fields
BiosystemsDescription
CDDDescription
CDDShortname
CDDTitle
Comment
CreateDate
Defline
ECNumber
Filter
GeneDescription
GeneSymbol
Label
Name
Organism
PDBTitle
ReviewLevel
Status
UID
 
Field name Abbreviation* Description Sample Search
All Fields [All]
[All Fields]
Searches all of the indexed fields in the SPARCLE database.

If no field specifier is included in a query, the system searches [All] fields by default, as happens with the first sample search shown at the right. Click on that search to open the corresponding results page. The "Search Details" box that appears in the right hand margin of the search results page shows that the query was translated by the system to:
chloride[All Fields] AND channel[All Fields]
back to top chloride channel

The basic search above, in which the query terms are entered without quotes, will retrieve the architecture(s) that contain the word "chloride" and the word "channel" in any field of the record. The words do not have to be adjacent to each other in the record (i.e., they do not have to appear as a phrase), and they do not have to appear in the same field.


"chloride channel"[all]

The search above, which surrounds the search terms with quotes, will retrieve the architecture(s) that contain the phrase "chloride channel" in any field of the record. (The quotes surrounding the search terms ensure they are searched as a phrase.)


Note: Compare the results of the above search, which looks for the phrase "chloride channel" in any field of the record, with the more specific results obtained by the sample [Name] field search:
"chloride channel"[Name]
which retrieves records containing the phrase "chloride channel" only in the name of the conserved domain architecture.
(The data processing section of this document describes how architectures are named.)


BiosystemsDescription [BiosystemsDescription] Descriptions of BioSystems that are listed as supporting evidence for conserved domain architectures in the SPARCLE database.

As noted on the About BioSystems page, a biosystem is a group of molecules that interact in a biological system. One type of biosystem is a biological pathway, which can consist of interacting genes, proteins, and small molecules. Another type of biosystem is a disease, which can involve components such as genes, biomarkers, and drugs.

back to top "folate biosynthesis"[BiosystemsDescription]

will retrieve architecture(s) that list, as supporting evidence, biosystems whose descriptions contain the phrase "folate biosynthesis."



CDDDescription [CDDDescription] Description of conserved domain models that are components of, or that are listed as supporting evidence for, conserved domain architectures in the SPARCLE database. back to top "transport proteins"[CDDDescription]

will retrieve architecture(s) that contain conserved domain models whose description includes the phrase "transport proteins."

CDDShortname [CDDShortname] Short names of conserved domain models that are components of, or that are listed as supporting evidence for, conserved domain architectures in the SPARCLE database.

The short name is the label that appears on the conserved domain's cartoon in a CD-Search results display.

Note: This field can only be searched by entering the complete short name, surrounded by quotes. Entering a single term or other fragment from the short name will not retrieve results. (See examples below.)

Because of this, it is better to search the [CDDDescription] field because it offers more comprehensive searches.

--------------------
Examples: To illustrate the use of the [CDDShortname] field:

A search for the following complete string: "voltage gated clc"[CDDShortname] will retrieve architectures that contain a conserved domain model with that short name.

However, a search for the single word: voltage[CDDShortname] will not retrieve any records, because there are no conserved domains that have a short title of the single word "voltage."
--------------------

Tip: The Advanced search page can be used to browse the available terms in any index.
For example, to see a list of short names, use the "Builder" section of the advanced search page, select the CDDShortname search field from the pull-down menu, then click on "Show index list."

Note: If you do not enter any term in the text box beside the selected search field, the system will automatically take you to the top of the index for the selected search field, and you can then scroll through the terms.

If you enter a term in the text box before clicking on "Show index list," the search system will jump to the part of the index that contains your term, then you can scroll up or down.

back to top "voltage gated clc"[CDDShortname]

will retrieve architecture(s) that contain a conserved domain model whose short name is "voltage gated clc".

(The quotes surrounding the search terms ensure they are searched as a phrase.)

CDDTitle [CDDTitle] Title of conserved domain models that are components of, or that are listed as supporting evidence for, conserved domain architectures in the SPARCLE database.

Note: Some older conserved domain models do not have a title. For example, the conserved domain model with accession cd00400 has a short name of "Voltage_gated_ClC" and an extensive description, but it doesn't have a separate title. As a result, those records will not be retrieved by a search of the [CDDTitle] field.

Therefore, is generally better to search for the [CDDDescription] field, rather than the [CDDTitle] field, because the [CDDDescription] field provides a more comprehensive search.

For example, compare the results of the [CDDTitle] and [CDDDescription] searches:

voltage[CDDTitle]
vs.
voltage[CDDDescription]

back to top voltage[CDDTitle]

will retrieve architecture(s) that contain a conserved domain model whose title includes the word "voltage".


"voltage gated chloride channel"[CDDTitle]

will retrieve architecture(s) that contain a conserved domain model whose title includes the phrase "voltage gated chloride channel".

Note: It is generally better to search for the [CDDDescription] field, rather than the [CDDTitle] field, because the [CDDDescription] field provides a more comprehensive search. See the note and examples in the preceding column.

Comment [Comment] The [Comment] field contains free text that was written by curators in the supporting evidence fields of SPARCLE records. It represents something the curators wanted to note about the conserved domain architecture, based on the research they did in curating and naming the architecture.

back to top chloride[Comment]

will retrieve the architectures that contain the word "chloride" in the comments section of a conserved domain architecture's supporting evidence.

CreateDate [CreateDate]
[CDAT]
[PDAT]
[DP]
The date on which the current version of a conserved domain architecture record was published in the SPARCLE curation system.

This is referred to as the Create Date [CDAT]. Alternatively, it is sometimes referred to as the Publication Date, or Date of Publication, hence the alternative abbreviations of [PDAT] or [DP].

The architecture subsequently becomes available in the public SPARCLE database, although that might happen a bit later.

Examples:
--------------------
To search for a specific day, month, or year, enter it in any one of the following formats:

YYYY/MM/DD
will retrieve all architectures that were published in the SPARCLE curation system on the specified day

or

YYYY/MM
will retrieve all architectures that were published in the SPARCLE curation system in the specified month

or

YYYY
will retrieve all architectures that were published in the SPARCLE curation system in the specified year

--------------------
To search for a range of dates, enter your in any one of the following formats, using the colon (:) as the range operator:

YYYY/MM/DD[CDAT]:YYYY/MM/DD[CDAT]
will retrieve all architectures that were published in the SPARCLE curation system between the two dates you specified

back to top Single date:

2017/04/20[CDAT]

will retrieve all architectures that were published in the SPARCLE curation system on 20 April 2017.


Date range:

2017/04/20[CreateDate] : 2017/05/18[CreateDate]

will retrieve all architectures that were published in the SPARCLE curation system between 20 April 2017 and 18 May 2017.

In the query above, the colon (:) serves as the range operator.

Defline [Defline] The definition line (description) of any protein sequence that was used as supporting evidence for a conserved domain architecture. back to top chloride[defline]

will retrieve the architectures that list, as supporting evidence, any proteins that have the term "chloride" in their definition line.

ECNumber [ECNumber] The Enzyme Commission (EC) number that is found in the sequence record of any protein that was used as evidence for a conserved domain architecture, or the EC number that is found in a high quality (e.g., curated) sequence record that belongs to the group of proteins annotated with the architecture.

The Enzyme Nomenclature and Classification system is based on the reactions catalyzed by the enzymes. The system is developed by one of the Nomenclature Committees of the International Union of Biochemistry and Molecular Biology (IUBMB). Separate websites enable you to browse enzymes by class, or to search the enzyme nomenclature database by text word or number.

--------------------
Method for assigning EC numbers to conserved domain architecture records in SPARCLE:

Typically, the EC numbers are taken from Swiss-Prot records that belong to the cluster of proteins that have a given architecture.

In addition, the EC number from a Swiss-Prot record might also be applied to other, similar protein clusters that essentially represent the same architecture. Those architectures might have been split into separate SPARCLE records only because they contain slightly different domain models. For example, two or more protein clusters might have top-scoring hits to overalapping/redundant conserved domain models from different source databases, but their architectures are essentially similar, as in the hypothetical example below.

--------------------
As a hypothetical example of how an EC Number from one architecture might be annotated on other architectures:

a) Let's say you have three architectures that are similar to each other:
  • They each have their own SPARCLE record because their top scoring domain models are slightly different from each other:
  1. ------[pfam01]------[pfam05]------
  2. ------[pfam01]------[COG12]------
  3. ------[pfam01]------[cd0008]------
b) Let's also say that:
  • domain models pfam05, COG12, and cd0008 are redundant (i.e., they come from different source databases, but they overlap with each other on protein sequences and are therefore redundant)
  • architecture #2 maps to protein sequence SwissProt P0321
  • SwissProt P0321 has been annotated with an EC number.
c) As a result:
  • architectures #1, 2, and 3 above are essentially the same (due to the redundant nature of pfam05, COG12, and cd0008)
  • all three architectures (all three SPARCLE records) will be indexed with the same EC number that was annotated on SwissProt P0321
back to top 3.6.4.13[ECNumber]

will retrieve architectures that have the Enzyme Commission number of 3.6.4.13, RNA helicase.

Filter [Filter] The [Filter] field can be used to limit your search to conserved domain architectures that have links to another Entrez database of interest, as shown in the search examples to the right.

NCBI uses the following methods to create links between conserved domain architectures and records in other databases:

The SPARCLE data processing pipeline calculates two types of direct links:
  1. sparcle_protein: each conserved domain architecture in the SPARCLE database links to all protein sequences that have the architecture.

  2. sparcle_cdd: each conserved domain architecture in the SPARCLE database links to all of the conserved domain models (specific hits and superfamilies) that compose the architecture. For example, if an architecture contains one specific hit and one superfamily, that SPARCLE record will link to two Conserved Domain Database (CDD) records -- one for the specific hit and one for the superfamily.
All other links between SPARCLE and other Entrez databases are indirect, created by a join between the proteins that contain the architecture and the other data types.
  • For example, links from SPARCLE architectures to Gene records are created by a join between the following:

    sparcle_protein  AND  protein_gene  →  sparcle_gene


back to top "chloride channel"[All] AND "sparcle_gene"[Filter]

will retrieve conserved domain architectures that have the phrase "chloride channel" in any field of the record, and have links to records in the Gene database.


"chloride channel"[All] AND "sparcle_biosystems"[Filter]

will retrieve conserved domain architectures that have the phrase "chloride channel" in any field of the record, and have links to records in the Biosystems database.

(Note: To view the biosystems that are linked to an architecture, click on an architecture of interest in the SPARCLE search results, then click on the "pathways" link in the right hand margin of the architecture's summary page to open the corresponding Biosystems records.)

GeneDescription [GeneDescription] The description of Gene records that were used as supporting evidence for conserved domain architectures.

The [GeneDescription] index includes text terms from the gene's official full name, official symbol, alternative symbols, and gene summary.

back to top "chloride channel"[GeneDescription]

will retrieve the architecture that lists, as supporting evidence, genes that include the phrase "chloride channel" in their description.

GeneSymbol [GeneSymbol] The gene symbol of Gene records that were used as supporting evidence for conserved domain architectures. back to top nat16[GeneSymbol]

will retrieve the architecture that lists, as supporting evidence, genes whose symbol is "nat16."

Label [Label] The functional label (description) of a conserved domain architecture. back to top "chloride channel"[Label]

will retrieve the architecture(s) that contain the phrase "chloride channel" in the functional Label (description) of the architecture.

Name [Name]
[NM]
The name of a conserved domain architecture.

The data processing section of this document describes the three different methods by which conserved domain architectures are named:
  1. Curated architectures
  2. Autonamed architectures
  3. NamedByDomain architectures
These represent three tiers of SPARCLE records, which can be retrieved, if desired, using the [ReviewLevel] search field.

back to top "chloride channel"[Name]

will retrieve the architecture(s) that contain the phrase "chloride channel" in the name of the architecture.
Organism [Organism]
[Orgn]
The taxonomic node to which the name and label of the conserved domain architecture apply.

By default, conserved domain architectures are associated with the root of the taxonomic tree (i.e., all organisms). When an architecture is associated with the root, it means the name/label of the architecture is not specific to any node of the full taxonomic tree. This is true of most architectures in the SPARCLE database.

If the [Organism] classification of an architecture is not root, but is instead a more specific taxonomic node, that means the curator is asserting that the name/label chosen for the architecture is applicable within the specified node, but not necessarily within other taxonomic branches.

--------------------

For example, the total number of architectures in the SPARCLE database was 129405 as of July 13, 2017. (Note: the current total number of architectures might be larger or smaller, if more architectures have been added or removed since that date as a result of ongoing research).

Most of those architectures are assigned, by default, to the root of the taxonomic tree. (As an example, retrieve the architectures that have a taxonomic scope of all organisms.)

A small number of architectures are assigned to more specific taxonomic nodes, as follows: The next column provides examples of search strategies that will retrieve conserved domain architectures that have a taxonomic scope of interest.

The SPARCLE record for each architecture contains a section entitled "Curated names and labels, which includes the architecture's taxonomic scope.

back to top bacteria[Organism]

will retrieve the architectures whose names and labels are applicable within bacteria but not within other taxonomic nodes.


viruses[Organism]

will retrieve the architectures whose names and labels are applicable within viruses but not within other taxonomic nodes.


guanylate cyclase AND bacteria[Organism]

will retrieve the architectures that contain the terms "guanylate" and "cyclase" in any field of the SPARCLE architecture record, and whose names and labels are applicable within bacteria but not within other taxonomic nodes.


guanylate cyclase AND eukaryota[Organism]

will retrieve the architectures that contain the terms "guanylate" and "cyclase" in any field of the SPARCLE architecture record, and whose names and labels are applicable within eukaryota but not within other taxonomic nodes.
PDBTitle [PDBTitle]
[PDBTL]
The title of any Protein Data Bank (PDB) record (3D macromolecular structure) that was used as supporting evidence for the conserved domain architecture. back to top "DNA modification"[PDBTitle]

will retrieve the SPARCLE record that contains the phrase "DNA modification" in the title of any 3D structure record that was used as supporting evidence for the conserved domain architecture.
ReviewLevel [ReviewLevel]
[REV]
The SPARCLE database has three tiers (review levels) of conserved domain architecture records:
  1. Curated architectures
  2. Autonamed architectures
  3. NamedByDomain architectures
The data processing section of this document describes the methods by which architectures in each tier are handled.

The [ReviewLevel] search field can be used to limit retrieval to a specific tier of records, if desired, as shown in the search examples in the next column.

(Note: The [ReviewLevel] field is similar to the [Status] field, described below.)

back to top curated[ReviewLevel]
will retrieve all of the curated architectures from the SPARCLE database.


autonamed[ReviewLevel]
will retrieve all of the autonamed architectures from the SPARCLE database.


namedbydomain[ReviewLevel]
will retrieve all of the architectures from the SPARCLE database that were named by domain.


"chloride channel" AND curated[ReviewLevel]
will retrieve all architectures that contain the phrase "chloride channel" in any field of the record, and will then limit the retrieval to curated architectures.

Status [Status] The [Status] field is similar to the [ReviewLevel] field (described above).

The [Status] field divides the SPARCLE database into two broad subsets:
  • Reviewed (which represents curated records)
  • Provisional (which represents all other SPARCLE records, such as those that were autonamed or namedByDomain)
(In contrast, the [ReviewLevel] field divides the SPARCLE database based on the method by which the data have been processed (i.e., curated, autonamed, namedByDomain).

Because of this, a search for curated[ReviewLevel] will retrieve the same subset of architectures as reviewed[Status].
A search for provisional[Status] will retrieve all architectures that have not been curated.

back to top reviewed[Status]

will retrieve all of the reviewed (i.e., curated) architectures from the SPARCLE database.


"chloride channel" AND reviewed[Status]

will retrieve all architectures that contain the phrase "chloride channel" in any field of the record, and will then limit the retrieval to reviewed (i.e., curated) architectures.
UID [UID]
[ArchID]
The unique identification number (UID) of a conserved domain architecture. It is also referred to as an architecture ID, or archid.

If you enter an integer as a query, the search system will interpret the query by default as a search of the [UID] field.

Additional information about architecture IDs is provided in the section of this document that describes the contents of a conserved domain architecture's summary page.
back to top 10087058[UID]

The search above, which uses the [UID] field specifier, will retrieve the architecture that has the unique identification number (UID) 10087058.


10087058

If you enter the query as just the integer, as shown above, without the [UID] field specifier, the search system will search the [UID] field by default.

Therefore, both of the searches above will retrieve the same architecture.

* In a query, the field name may be typed as the full name or abbreviation, and may be in upper, lower, or mixed case. If more than one abbreviation is shown, any one of them can be used. The field name must be surrounded by square brackets []. A space between the search term and the field specifier is optional. If desired, surround a phrase with quotes to force an adjacency search. For example, all of the sample queries below will work equally:
      "chloride channel"[NAME]
      "chloride channel" [NAME]
      "chloride channel"[name]
      "chloride channel" [name]
      "chloride channel" [NM]
      "chloride channel"[nm]

** The quotes surrounding the query terms in some of the sample searches force the terms to be searched as a phrase. If quotes are not used, the Entrez system may still recognize and handle the terms as a phrase, if they are present in a phrase dictionary used by the search engine. If the terms are not present in the phrase dictionary and are not surrounded by quotes, Entrez will insert a Boolean AND between the terms; in that case, they may or may not appear adjacent to each other in the retrieved records. The "Details" section in the right hand margin of a search results page will show you exactly how the Entrez system parsed your query. More search tips are provided in the PubMed help document and Entrez help document.

It is also possible to search for a word stem by using an asterisk (*) as a wild card; for example, arachidon* will retrieve records with terms such as arachidonate, arachidonic, arachidonoyl. The Entrez Help document provides additional information about truncating search terms in this way.

 
Output back to top

  Output from a sequence search back to top

If you have entered a query sequence into the CD-Search tool, the CD-Search results page will include a "Protein Classification" section if the query sequence maps to a conserved domain architecture in the SPARCLE database. (If a query sequence does not map to any conserved domain architecture in the SPARCLE database, then the CD-Search results will not include a Protein Classification section.)

A sample protein classification section is shown in the illustration at the right, which displays the CD-Search results for the query sequence DNA gyrase B (NP_387887), an antibiotic target. Click on the illustration to see open the live CD-Search results.

Please note that the "Graphical Summary" on the live CD-Search results page might look different from the illustration at the right because conserved domain architecture records in the SPARCLE database continue to evolve with ongoing research.

For example, in January 2017, the protein sequence NP_387887 was initially annotated with architecture ID 10647733 (as shown in the illustration). That architecture is named "DNA gyrase subunit B" and includes four distinct conserved domains.

In March 2017, when a new build of CDD/SPARCLE was released, the conserved domain architecture annotation for NP_387887 was revised to architecture ID 11481348, which is a multi-domain that encompasses the four original conserved domains, and which can be seen in the current CD-Search results for NP_387887. That architecture has a more specific and precise name, "type IIA DNA topoisomerase subunit B," and reflects the full length protein model.

To see the four conserved domains that compose the multi-domain, simply change the CD-Search display option on the live CD-Search results for NP_387887 from "Concise Results" to "Full Results" (using the "View" menu near the upper right hand corner). The Full Results display will show the four conserved domains that compose the multi-domain.

As the available data and understanding of conserved domain architectures continue to evolve, the domain architectures that are annotated on proteins may evolve as well, as shown in this example. Comments about the data are welcome and can be sent to the NCBI Support Center/Help Desk, which is accessible as a link in the footer of NCBI web pages.

Step 2 in using SPARCLE: The CD-Search results page will display a Protein Classification section above the graphic summary of conserved domains, if a SPARCLE record exists for the domain architecture in the query protein sequence.  Click on this graphic to open the CD-Search results for NP_387887, DNA gyrase subunit B from Bacillus subtilis.


  Output from a keyword search back to top

If you are searching for keywords in the SPARCLE database, the SPARCLE search results will display a list of the conserved domain architectures that contain the keyword(s) you specified.

Depending on how you entered the search, the search terms can either appear in any field of a conserved domain architecture record, or in a search field you specify, and they can either appear together as a phrase or separate from each other.

The Search Tips section of this document provides details about the scope of a keyword search, as well as tips on how to limit your query to specific search fields, use quotes to force a phrase search, and use an asterisk (*) for truncation. It also includes a comparison of some sample search strategies.

The illustration at the right shows the results of a sample search for the words chloride and channel in any field of an architecture record, and limited to the subset of architecture records that meet the criterion of curated[ReviewLevel].

Click on the illustration to open the corresponding live search results in the SPARCLE database. (Please note that the second panel of the illustration shows the search results as of March 2, 2017; the corresponding live web page will retrieve a larger number of records, as the SPARCLE database continues to grow.)

A comparison of some sample search strategies shows other ways of constructing the query, with links to the search results in each case.

Step 2 in searching the SPARCLE database by keyword: View the search results and click on the architecture ID of any domain architecture of interest to open its summary page.  Click on this graphic to open the results of a SPARCLE search for chloride channel AND curated[ReviewLevel].

 
Sample SPARCLE Record back to top

  Classification of proteins by domain architecture back to top

A SPARCLE database record is also referred to as a conserved domain architecture's "summary page."

An individual SPARCLE record shows a unique architecture that has been observed in at least one protein sequence.

The summary page displays the name and label of the architecture, along with evidence used to assign that name and label.

Additionally, because SPARCLE is used to classify proteins by their characteristic conserved domain architecture, the summary page includes a list of protein sequences with this architecture.

As noted in the section of the document about ongoing research, the conserved domain models, architectures, and the resulting protein sequence clusters, continue to evolve as new data become available and as research progresses.

The complete contents of a SPARCLE record include the following. Click on any item to read more about it:

Step 3 in using SPARCLE: The Protein Classification section of the CD-Search results links to the corresponding SPARCLE record, illustrated here. The SPARCLE record shows the name and functional label of the architecture, supporting evidence, and links to other proteins with the same architecture. Click on this graphic to open the SPARCLE record for the domain architecture (architecture ID 10647733) that was found in the protein query sequence, NP_387887, DNA gyrase subunit B from Bacillus subtilis.

  Description of the conserved domain architecture back to top

  • Name of architecture:

    The name of a conserved domain architecture is either assigned manually by curation, or computationally by the autoname algorithm or the namedByDomain algorithm.
    The architecture name is displayed in two places on a SPARCLE record: near the top of the record (in bold font), and in the "Curated names and labels" section of the record.

    For example, the name of the conserved domain architecture shown in the illustrated example of a SPARCLE record is:
    "DNA gyrase subunit B." (You can also see this name in the live SPARCLE record for architecture ID 10647733.)

  • Label (description of function):

    The label provides a description of the conserved domain architecture's biological function.
    The label is displayed in two places on a SPARCLE record: near the top of the record (beneath the bold font that shows the architecture's name), and in the "Curated names and labels" section of the record.

    For example, the label of the conserved domain architecture shown in the illustrated example of a SPARCLE record is:
    "DNA gyrase is a type 2 topoisomerase that relaxes supercoils but can also introduce negative supercoils into DNA in an ATP-dependent manner." (You can also see this label in the live SPARCLE record for architecture ID 10647733.)

  • Architecture ID:

    An integer, assigned by NCBI, that uniquely identifies a conserved domain architecture.
    The architecture ID is also referred to as a unique identifier (UID) and can be searched directly in the SPARCLE database.

    Each architecture ID reflects the set of conserved domain models that are top-scoring hits (as determined by the CD-Search service) on the proteins that possess the architecture, the sequential order of those domains, and the type of hit each domain has to the proteins. (As noted in the Overview section of this document, it is also possible for a domain architecture to consist of a single conserved domain footprint. Such architectures also receive an architecture ID.)

    The conserved domain models that compose an architecture are shown in two places on the architecture's summary page: (a) in the graphical display at the top of the page (illustrated example), and (b) in the section labeled "Conserved domains in this architecture."

    In the graphical display of a conserved domain architecture, you can mouse over a conserved domain's cartoon in order to see its accession number, or click on the cartoon to see detailed information about that domain model, including a multiple sequence alignment of its member proteins.

    The accession number prefix for each conserved domain model in the architecture reflects the type of hit it has on the proteins that possess the architecture. Accession numbers that begin with the "cl" prefix indicate a superfamily hit (the "cl" prefix stands for superfamily cluster). All other type of accession numbers (i.e., accessions that begin with any prefix other than "cl") indicate specific hits.
    As an example of the unique composition of each architecture, search the SPARCLE database for:
         tumor[Name]
    That will retrieve conserved domain architectures which contain the term "tumor" in the architecture name, including a number of architectures named "P53 and SAM_tumor-p63 domain-containing protein."
    At first glance, some of the architectures appear similar to each other. Upon closer look, however, you will see that each architecture is comprised of a unique series of conserved domain accession numbers. (To see the accession numbers, open the SPARCLE record for any domain architecture of interest, then either mouse over the cartoon for each domain in the architecture's graphic, or view the tabular list of "Conserved domains in this architecture.") As a result, each architecture receives its own architecture ID.

  • Version:

    Each SPARCLE record is assigned a version of 1 when it is first published (i.e., first released into the public SPARCLE database). If a SPARCLE record is later revised in any way, the version number is incremented when the revised record is published.
    Details: The information within a SPARCLE architecture record can change over time, as new data and publications become available about a given conserved domain architecture. Each time a change is made to a SPARCLE record, and the revised record is then published (i.e., released into the public database), it receives a new version number. The majority of changes are generally minor, such as corrections of typing error or the addition of punctuation, such as a dashes, to protein names. Other changes might be more important, such as the addition of new evidence in support of the domain architecture, or the correction of a protein name.

  • Date Published:

    The date on which the current version of a conserved domain architecture record was published in the SPARCLE curation system.
    The architecture subsequently becomes available in the public SPARCLE database, although that might happen a bit later.

    Search tip: To retrieve architectures by their publication date, use the search field called [CreateDate] on the SPARCLE Advanced Search page.

  • Review Level:

    The SPARCLE database has three tiers (review levels) of conserved domain architecture records:

    1. Curated architectures
    2. Autonamed architectures
    3. NamedByDomain architectures

    Additional details about each tier are provided in the data processing section of this document, including a description of the method by which the architectures in each tier are named.

    Search tip: When doing keyword search of the SPARCLE database, you can limit your search results to architectures that belong to a given tier by using the search field called [ReviewLevel] on the SPARCLE Advanced Search page. Alternatively, you can simply use the "Filter your results" options in the upper right hand margin of a SPARCLE search results page (illustrated example) to select the desired tier.

 
  Sequences with this architecture back to top
Introductory note
Folder tabs: All | Protein with PubMed Reference | 3D Structure | Gene | RefSeq | Swiss-Prot
Filters: Tags | Source | Organism | Description | Gene Symbol
Note: Empty Set
  • Introductory note:

    The "Sequences with this architecture" table lists the sequences from the NCBI Protein database that have the conserved domain architecture currently being viewed.

    A conserved domain architecture is defined as the sequential order of conserved domains in a protein sequence. Additionally, each domain within the architecture can get any one of several hit types against a query protein sequence (e.g., specific hit, non-specific hit, superfamily, multi-domain), as determined by the CD-Search service.

    In order to be listed in the "Sequences with this architecture" table, a protein must have the exact order of conserved domains shown in the graphic at the top of a conserved domain architecture's summary page. Additionally, each conserved domain shown in the graphic must be the top-scoring hit for the corresponding region of the protein sequence, and must be of the same hit type as shown in the architecture's graphic. For example, when you mouse over a conserved domain cartoon in the architecture's graphic, and you and see a conserved domain accession number that begins with the "cl" prefix, that indicates a superfamily hit. A conserved domain accession number that begins with any other prefix indicates a specific hit.

    Therefore, every protein listed in the "sequences with this architecture" table has the exact order of conserved domains, and the exact hit type to each domain, as shown in the graphic at the top of a conserved domain architecture's summary page.

    You can choose to view all proteins that have the architecture, or a pre-defined subset, using the folder tabs and filters described below:

  • Folder Tabs:
    All | Protein with PubMed Reference | 3D Structure | Gene | RefSeq | Swiss-Prot
    The folder tabs under "sequences with this architecture" provide quick access to some commonly used data subsets. (A complete list of available data subsets is provided under "Filters.")

    • "All" folder tab - All proteins in the Protein database that have the conserved domain architecture.

    • "Protein with PubMed Reference" folder tab - The subset of protein sequences that have this conserved domain architecture, and that include reference to a published article in PubMed.

    • "3D Structure" folder tab - The subset of protein sequences that have this conserved domain architecture, and that have an experimentally resolved 3-dimensional structure.

    • "Gene" folder tab - A subset of protein sequences that have this conserved domain architecture, and that have a link to a Gene record. This folder tab shows only one representative protein for each gene to which the architecture is linked, in order to provide a non-redundant view of the genes associated with the architecture.
      Note: The "Gene" folder tab lists the same subset of protein sequences that can be retrieved using the option for "Filters:Tags:Gene Representative." However, the displays are slightly different. The "Gene" folder tab provides a gene-centric view that displays the gene ID, gene symbol, and gene description associated with each protein. In contrast, the "Filters:Tags:Gene Representative" displays the protein ID and description of each sequence that is linked to a gene. Both views include the source organism, protein length (in amino acids), and an "Actions" column that provide access to the protein sequences in FASTA format and links to other tools/resources.
    • "RefSeq" folder tab - The subset of protein sequences that have this conserved domain architecture, and that are from the RefSeq database.

    • "Swiss-Prot" folder tab - The subset of protein sequences that have this conserved domain architecture, and that are from the UniProtKB/Swiss-Prot database.


  • Filters:
    Tags: Annotated | BioAssay | Gene | Gene Representative | NR Representative | PubMed | Reference
    Source | Organism | Description | Gene Symbol
    The "Filters" under "sequences with this architecture" enable you to view a number of pre-defined data subsets. Click on the down arrow (V) beside "Filters" to see the complete list and to activate the check box(es) of the desired filter(s). (Some of the commonly used filters are shown as folder tabs near the top of the "sequences with this architecture" section.)

    After a filter is selected, it will remain active unless/until you deactivate its checkbox or dismiss the filter(s) by clicking the red X in the "Filters" tab.

    The number and types of filters that appear in a SPARCLE record depend on the set of protein sequences with that architecture, and on the information/data links that are available for those proteins. An example architecture that has a wide variety of filters is type IIA DNA topoisomerase subunit B (architecture ID 11481348). Other architectures might have only a small number of filters.
    The various filters you might see on a page are described below:

    • Tags - This filter enables you to view the subset of proteins that have been tagged with various attributes. The available tags include: Annotated, BioAssay, Gene, Gene Representative, NR Representative, PubMed, and Reference.

      • The "Annotated" tag retrieves the subset of protein sequences that have this conserved domain architecture, and that either link to PubMed, BioAssay, Structure, or OMIM, or are considered to be "landmark" sequences by Smart BLAST.

        • The SmartBLAST help document includes a section on the "Landmark Database," which describes how the landmark sequences are seleted. Excerpt:

          "The landmark database includes proteomes from 27 genomes spanning a wide taxonomic range. This search set is produced using the best available genomic assemblies for each organism with the following procedure. First, the most recent representative assembly from each organism is identified. Second, all proteins annotated on each assembly are downloaded and compiled into the landmark BLAST database. The result is a taxonomically diverse non-redundant set of proteins supported by genomic assemblies."

      • The "BioAssay" tag retrieves the subset of protein sequences that have this conserved domain architecture, and that are the targets of BioAssay experiments.

        • These are identified by an automated process that looks at the proteins that have the architecture in question, and finds the subset of proteins whose sequence identifiers are listed as the targets of BioAssay experiments.
        • The section of this document on data processing: links from architectures to other data types describes how links are identified between conserved domain architectures in the SPARCLE database and other data types.

      • The "Gene" tag retrieves the subset of protein sequences that have this conserved domain architecture, and that are linked to a record in the Gene database.

        • This option lists all of the protein sequences that have the architecture in question, and that have links to Gene records.
        • If several protein sequence records have links to the same gene record, all of those protein sequence records will be listed in this view.
        • For example, if 10 protein sequence records link to 2 genes, the "gene" tag will display all 10 protein sequences.
        • The section of this document on data processing: links from architectures to other data types describes how links are identified between conserved domain architectures in the SPARCLE database and other data types.

      • The "Gene representative" tag shows only one representative protein sequence for each gene that is linked to the architecture, in order to provide a non-redundant view of the genes associated with that architecture.

        • For example, if 10 protein sequence records link to 2 genes, the "gene representative" tag will display only 2 proteins -- one representative protein sequence for each gene.
        • Note: The "Gene" folder tab lists the same subset of protein sequences that is retrieved by the "Gene Representative" tag. However, the displays are slightly different. The "Gene" folder tab provides a gene-centric view that displays the gene ID, gene symbol, and gene description associated with each protein. In contrast, the "Gene Representative" filter tag displays the protein ID and description of each sequence that is linked to a gene. Both views include the source organism, protein length (in amino acids), and an "Actions" column that provide access to the protein sequences in FASTA format and links to other tools/resources.

      • An "NR representative" is a protein sequence that has been selected as the representative of a group of identical sequences, for the purpose of creating a protein non-redundant (NR) database. The "NR Representative" tag therefore retrieves a non-redundant list of protein sequences that have this conserved domain architecture. Technical details:

        • To create a non-redundant protein database, the NCBI data processing pipeline organizes protein sequences into protein identity groups (PIGs). A protein identity group contains protein sequences that are identical in length and composition, regardless of taxonomic source (i.e., regardless of TaxID). Each group is given a stable identification number (PIG ID).
        • One protein sequence from each PIG is selected as the representative. If the PIG includes a RefSeq record, that is selected as the representative. If no RefSeq record is present, then a representative is selected from one the following databases: Swiss-Prot, PIR, PDB, GenPept (protein translations of nucleotide sequence records in GenBank that have been annotated with a coding sequence, or CDS, feature), or PRF, respectively.

      • The "PubMed" tag retrieves the subset of protein sequences that have this conserved domain architecture, and that have a link to published literature represented in the PubMed database.

      • The "Reference" tag retrieves the subset of protein sequences that have this conserved domain architecture, and that are considered to be "landmark" sequences by Smart BLAST.

        • The SmartBLAST help document includes a section on the "Landmark Database," which describes how the landmark sequences are seleted. Excerpt:

          "The landmark database includes proteomes from 27 genomes spanning a wide taxonomic range. This search set is produced using the best available genomic assemblies for each organism with the following procedure. First, the most recent representative assembly from each organism is identified. Second, all proteins annotated on each assembly are downloaded and compiled into the landmark BLAST database. The result is a taxonomically diverse non-redundant set of proteins supported by genomic assemblies."

    • Source - The source database from which a protein sequence record came


    • Organism - Enter the scientific name of any organism that appears in the list of "sequences with this architecture" to display only the protein sequences that have this conserved domain architecture, and that come from the organismyou have specified.

      • Note: You can enter a taxonomic node other than Genus species. However, the system will currently retrieve only the proteins that have been classified down to the specified level of the taxonomic tree, but no deeper.
      • For example, open the "Sequences with this architecture" table for architecture ID 11481348: type IIA DNA topoisomerase subunit B.
      • Enter "Pseudobutyrivibrio ruminis" (without the quotes) in the "Filters:Organism" text box. The system will display only the proteins that have been classified with that exact genus and species.
      • Now clear/dismiss the filter you just entered (or simply reload the SPARCLE record for architecture ID 11481348) in order to once again display all of the proteins, before doing the next step below.
      • Enter the taxonomic node "Clostridiales" (without quotes) in the "Organism" filter. The system will display only the proteins that have been classified down to that node of the taxonomic tree, and no deeper.
      • Note: The Organism filter will be enhanced in the future to allow retrieval by any node in an organism's lineage.

    • Description - This filter retrieves the subset of proteins that have this conserved domain architecture, and that have a description (definition line) containing the keyword(s) that you type in the textbox.

      • For example, open the "Sequences with this architecture" table for architecture ID 12201410: hybrid sensor histidine kinase/response regulator.
      • The "Sequences with this architecture" table includes some proteins with the description of "Signal transduction histidine kinase."
      • To view only those proteins, you can enter a single keyword such as "signal" or a phrase such as "signal transduction" (with or without quotes) in the text box beside the "Description" filter.
      • Note: if you enter two or more terms, they must be adjacent to each other in the description of a protein in order for the protein to be retrieved. That is, if you enter two or more words, the system will search for them as a phrase, whether or not you surround them with quotes.
      • For example, the protein sequences with the description "Signal transduction histidine kinase" will not be retrieved if you enter the words "signal histidine" (with or without quotes) in the "description" filter.

    • Gene Symbol - This filter retrieves the subset of proteins that have this conserved domain architecture, and that are linked to genes that have the symbol you specified in the textbox.

      • If a Gene record lists an official symbol as well as aliases (alternative gene symbols), and that gene is associated with the architecture, you can type any one of those symbols into the textbox to retrieve the subset of protein sequences linked to the gene.

Note: Empty set (no links to protein sequences): Occasionally, the "sequences with this architecture" table might display a message that says, "This architecture currently does not link to any protein sequence records." This might be true for either of the following reasons:

  • The original sequence(s) in which the architecture was found are no longer in the public database (e.g., they might have been found to be erroneous and were therefore withdrawn).
    -- or --
  • The scoring used by the CDD/CD-Search systems might have been refined, and the sequences that were originally linked to this architecture are now linked to a different architecture that achieves a higher score. (An example of this is provided in the section of this document about ongoing research.)

In either case, however, the SPARCLE record for the domain architecture is retained in the database, and its architecture ID is also retained (and not re-used for any other architecture), because it is possible that another sequence in the future will map to the architecture.

 

  Curated Names and Labels back to top

The Curated Names and Labels section of a conserved domain architecture's summary page lists the architecture's:
Taxonomic Scope | Name | Label | Supporting evidence: Protein sequences, Conserved domains, Publications, Other

  • Taxonomic Scope

    • The taxonomic scope column indicates the taxonomic node to which the architecture name and label apply.

    • By default, conserved domain architectures are associated with the root of the taxonomic tree (i.e., all organisms). When an architecture is associated with the root, it means the name/label of the architecture is not specific to any node of the full taxonomic tree. This is true of most architectures in the SPARCLE database.

    • If the taxonomic classification of an architecture is not root, but is instead a more specific taxonomic node, that means the curator is asserting that the name/label chosen for the architecture is applicable within the specified node, but not necessarily within other taxonomic branches.

      • For example, a search of the SPARCLE database for:

        guanylate cyclase AND bacteria[Organism]

        will retrieve the architectures that contain the terms "guanylate" and "cyclase" in any field of the SPARCLE architecture record, and whose names and labels are applicable within bacteria but not within other taxonomic nodes.

        guanylate cyclase AND eukaryota[Organism]

        will retrieve the architectures that contain the terms "guanylate" and "cyclase" in any field of the SPARCLE architecture record, and whose names and labels are applicable within eukaryota but not within other taxonomic nodes.

    • The section of this document about Search fields: [Organism] provides additional information about the taxonomic classification of conserved domain architectures and search tips on how to restrict your search to a specific taxonomic node, if desired.


  • Name

    • The name of the conserved domain architecture.
      The name is displayed in two places on a SPARCLE record: near the top of the record (in bold font), and in the "Curated names and labels" section of the record.

    • As an example, the name of the conserved domain architecture shown in the illustration of a sample SPARCLE record is "DNA gyrase subunit B." (You can also see this name in the live SPARCLE record for architecture ID 10647733.)

    • The name of a conserved domain architecture is either assigned manually by curation, or computationally by the autoname algorithm or the namedByDomain algorithm.


  • Label

    • The label provides a description of the conserved domain architecture's biological function.
      The label is displayed in two places on a SPARCLE record: near the top of the record (beneath the bold font that shows the architecture's name), and in the "Curated names and labels" section of the record.

    • As an example, the label of the conserved domain architecture shown in the illustration of a sample SPARCLE record is "DNA gyrase is a type 2 topoisomerase that relaxes supercoils but can also introduce negative supercoils into DNA in an ATP-dependent manner." (You can also see this label in the live SPARCLE record for architecture ID 10647733.)

  • Supporting Evidence:

    The "Curated Names and Labels: Supporting Evidence" section of a conserved domain architecture's summary page lists the evidence that was used by NCBI curators, or by the "autonamed" or "namedbydomain" algorithms, to assign a name to the architecture. Some types of supporting evidence include:

    • Protein sequences

      As described in the data processing section of this document, the names of high quality protein sequences are used by NCBI curators and by the "autonamed" algorithm in assigning a name to the conserved domain architecture (if those proteins are representative of the overall group of sequences that have the architecture in question). The Supporting Evidence: Protein Sequences section of a conserved domain architecture's summary page lists the protein sequence records that were used to name the architecture.

    • Conserved domains

      As described in the data processing section of this document, the names of conserved domain models are used by NCBI curators and by the "namedbydomain" algorithm in assigning a name to the conserved domain architecture.

      The "Supporting Evidence: Conserved Domains" section of a SPARCLE record might list one or more of the domains that are present in the architecture (i.e., one or more of the domains that are listed in the "Conserved domains in this architecture" section of the SPARCLE record). It might also list domain models that are not direct components of the architecture, but that belong to the same superfamily clusters as the components and are useful in helping to name the architecture.

      As an example, see the conserved domain architecture for the PAS and AAA domain-containing protein (architecture ID 11530124), which was namedByDomain. The "Conserved domains in this architecture" section of the SPARCLE record lists the top-scoring domain models (as determined by CD-Search) on the proteins that have the architecture:
      pfam00126: HTH_1 - Bacterial regulatory helix-turn-helix protein, lysR family
      pfam00158: Sigma54_- activat Sigma-54 interaction domain
      smart00091: PAS - PAS domain
      smart00116: CBS - Domain in cystathionine beta-synthase and other proteins
      The "Supporting Evidence: Conserved Domains" section of that SPARCLE record lists one of the domain models above, as well as three other domain models (with conserved domain accession numbers that begin with a "cd" prefix):
      cd02205: CBS_pair
      cd00130: PAS
      cd00009: AAA
      pfam00126: HTH_1
      The domain models with "cd" accessions are not direct components of the architecture (i.e., they are not the top-scoring hits), but they belong to the same clusters as the component domains and are useful in helping to name the architecture because they are curated domains whose names were carefully selected based on published research about protein functions.

    • Publications

      Published articles that describe the function of proteins that contain the conserved domains in the architecture, and that were used in naming the architecture.

    • Other

      Other types of evidence, as available, might also influence the name and functional label that is assigned to a conserved domain architecture. An example of additional evidence could be the biological pathway (biosystem) of which the protein is a part.

 
  Conserved domains in this architecture back to top

  • Each conserved domain architecture reflects the set of conserved domain models that are top-scoring hits (as determined by the CD-Search service) on the proteins that possess the architecture, the sequential order of those domains, and the type of hit each domain has to the proteins. Each architecture is given a unique, stable architecture ID. (As noted in the Overview section of this document, it is also possible for a domain architecture to consist of a single conserved domain footprint.)

  • The "Conserved domains in this architecture" section of a SPARCLE record provides a tabular list of the conserved domain models that compose the architecture.

  • The type of accession number in the architecture reflects the type of hit it has on the proteins that possess the architecture.
    Accession numbers that begin with the "cl" prefix indicate a superfamily hit (the "cl" prefix stands for superfamily cluster).
    All other type of accession numbers indicate specific hits.

  • The order in which the domains are listed in the table does not necessarily reflect their N-terminal to C-terminal order on the proteins that contain the architecture. The graphic near the top of a conserved domain architecture's summary page, however, does show the N-terminal to C-terminal order of the domains (illustrated example).

  • Note: one or more of the conserved domains that compose the architecture might also be listed as supporting evidence that was used in assigning a the name to the architecture. However, the supporting evidence might also (or might instead) list related conserved domains, as explained in the section of this document that describes that part of a SPARCLE record.

 
  Functional sites in this architecture back to top

  • Functional sites are also referred to as conserved features/sites, and typically describe sites such as catalytic residues, binding sites, or motifs commonly referred to in the literature

  • The are generally identified in NCBI-curated domains.

  • Functional sites are listed on a SPARCLE record only if the proteins possessing that conserved domain architecture have a specific hit to the NCBI-curated domain model in which the conserved features/sites have been annotated.

 

 
Data Processing back to top

data processing overview | three tiers of data: curated architectures, autonamed architectures, named by domain architectures | two types of architectures: superfamily architectures, subfamily architectures | single domain architectures | each architecture receives a unique and stable architecture ID | ongoing research | links from architectures to other data types

Data processing overview back to top

As the number of publicly available protein sequences continues to grow exponentially, efforts are underway to organize that data in a biologically meaningful way. This includes identifying relationships among proteins with similar composition and function.

It is possible to cluster proteins by sequence similarity; however, it is computationally costly to compare all proteins against each other. So an all versus all comparison is not an efficient strategy.

Conserved domain annotations on proteins, on the other hand, provide an simple alternative strategy for clustering proteins. NCBI already computes conserved domain annotation on protein sequences as part of the standard data processing pipeline, using the Conserved Domain Database (CDD) and CD-Search tool.

Building on that effort, the Conserved Domain Architecture Retrieval Tool (CDART) identifies all proteins that have the same domain annotation (the same order of conserved domains, and the same type of hit for each domain, i.e., specific or non-specific) and clusters them together in a group.

As noted in the "Compare CDD, CDART, and SPARCLE" section of this document, SPARCLE is built upon CDART. Specifically, SPARCLE contains the subset of subset of domain architectures that include at least one conserved domain model that is a specific hit to at least one protein sequence in the non-redundant ("nr") protein database.

SPARCLE then assigns a name and functional label (a description of the function of protein family that has the architecture) to each conserved domain architecture. As noted below, names are assigned to the architectures either by a manual curation process, or by automated processes that use algorithms to autoname an architecture, or to name an architecture based on the domains it contains. Curated domain architecture records are supported with, and linked to, evidence from high quality sequence data and literature.

In this way, SPARCLE is used to classify proteins, based on the functional characterization and labeling of protein sequences that have been grouped by their characteristic conserved domain architecture.

Three tiers ("review levels") of conserved domain architectures are present in the SPARCLE database: back to top

  1. curated architectures
  2. autonamed architectures
  3. named by domain architectures
To compare these types of records at a glance, click on the following links to retrieve conserved domain architectures that include the term "kinase" in their [Name], and whose names were assigned by the method indicated in [ReviewLevel]:
Below are details about the data processing methods for each subset, including descriptions of the methods used to name the architectures, followed by a note about ongoing research.

Curated architectures back to top

The manual curation process for conserved domain architectures is carried out by the Conserved Domain Database (CDD) curators and includes the steps noted below. Various types of evidence that were used by the curators in naming an architecture and describing its biological function are listed in the supporting evidence section of a SPARCLE record.

Describe protein function:
  • The domain architecture curation process begins by looking at a cluster of proteins that have the same architecture and asking the question, can we describe them functionally?

  • The answer depends on how diverse the sequences are and whether scientific experiments have revealed anything about the functions of the proteins in the cluster.

  • To arrive at an answer, the curators look at the existing names of protein sequences in the set and at the available evidence for the function of that set, such as publications linked to individual sequences, associated 3D structures, and other types of evidence the curators might find in additional NCBI databases such as BioSystems, BioAssay, etc.

  • Much of the curation process, however, is based on the availability of published literature, 3D structures, and the presence of high quality sequences (e.g., Swiss-Prot, RefSeq) in the cluster that have been functionally characterized.

Assign a name to the architecture: The curators then make a judgement call in assigning a name to the architecture, with the goal of selecting a name that is representative of the whole cluster of proteins. Below are some examples of situations the curators encounter in the process of naming architectures, and how the names are chosen in each case:
  • The set of protein sequences with the architecture have a SPECIFIC HIT to a conserved domain model that results in a HIGH CONFIDENCE of the architecture's biological function:

    • In the process of examining the group of sequences that have a given architecture, the curators might find that the proteins have a specific hit to one or more conserved domain models within the architecture. This represents a high confidence level for the inferred function of the protein query sequence, and therefore can influence the name of the architecture.

    • For example, the query protein sequence human guanylyl cyclase (AAA74451) has a "retinal guanylyl cyclase 2" domain architecture. The domain architecture includes a specific hit to the NCBI-curated domain cd06371, which has a short name and title of: "PBP1_sensory_GC_DEF_like: Ligand-binding domain of membrane guanylyl cyclases (GC-D, GC-E, and GC-F) that are specifically expressed in sensory tissues." The specific hit to the NCBI-curated domain therefore influenced the name that was given to the architecture: "retinal guanylyl cyclase 2."

  • A subset of high quality protein sequences that have the SAME NAME and ARE REPRESENTATIVE of the whole protein sequence cluster:

    • In the process of examining the group of sequences that have a given architecture, the curators might find a subset of high quality sequences that all have the same name, and that are representative of the whole cluster of protein sequences of which they are a part. In that case, the conserved domain architecture is given the same name as the subset of high quality sequences.

    • For example, let's say a cluster of ~200 protein sequences includes a subset of five sequences from a curated database such as Swiss-Prot or RefSeq. If those five sequences have the same name, and if they are reliable and representative of the whole cluster, then their name is given to the domain architecture as well.

  • A subset of high quality protein sequences that have the SAME NAME but ARE NOT REPRESENTATIVE of the whole protein sequence cluster:

    • In the process of examining the group of sequences that have a given architecture, the curators might find a subset of high quality sequences that all have the same name, but that do not represent the overall cluster (e.g., the subset represents only a specific taxon or other subgroup within the larger cluster). In that case, we cannot conclude that all of the sequences in the cluster will share the same function.

    • This is a common situation, and in such a case, the curators try to find a name for the domain architecture that is more generic and that is derived from the types of conserved domain signatures that are present in the protein family.

    • For example, let's say a big protein family has a hit to an NAD dependent dehydrogenase, and one of the high quality sequences has been named as an alpha ketoglutarate dehydrogenase. Extrapolating that very specific name to all of the sequences which have the same architecture might be a stretch, because we don't have evidence to support such an extrapolation (e.g., the substrates for the other proteins in the family might not yet be known). So the curators make a judgement call to apply a more general name to the domain architecture, and they might simply call the family a dehygrogenase, rather than an alphaketoglutarate dehydrogenase.

  • A subset of high quality protein sequences that have DIFFERENT NAMES

    • In the process of examining the group of sequences that have a given architecture, the curators might find a subset of high quality sequences that have different names, indicating functional diversity in that family. In such a case, the CDD curators look for commonalities among the high quality protein names and identify generalities that can be applied to the architecture overall.

    • For example, if the names of Swiss-Prot records indicate that the proteins are a valine tranporter, isoleucine tranporter, and threonine transporter, then the curators would apply the general name of "amino acid transporter" to the domain architecture.

    • The curators also take naming rules and standards into consideration. They attempt to find compromises among naming standards (e.g., upper case, lower case, dash, no dash, etc.) that are used by UniProt, Swiss-Prot, and RefSeq, and apply those compromises to the names they apply to SPARCLE domain architectures. There are sometimes differences between American and European naming conventions, and the aim is to minimize or erase those differences over time as naming conventions and standards continue to evolve.

  • Seemingly redundant SPARCLE architectures that all have the SAME NAME AND FUNCTIONAL LABEL, but they are in fact FUNCTIONALLY DIFFERENT

    • Sometimes there are seemingly redundant SPARCLE architectures that all have the same name and functional label, but they are in fact functionally different.

    • For example, many architectures might have the name "sensor histidine kinase" but each of those might be functionally different from the others. The architectures contain the same basic domain signature (i.e., a catalytic domain, an accessory domain that is phosphorylated by the kinase, and a PAS domain), but we don't yet know what signaling pathways they are involved in, and some of the architectures might contain an additional domain whose specific function is unknown. In such cases, if the curators do not have the experimental evidence needed to give the domain architectures a more specific name, they apply the general name to the architectures. The architecture names and descriptions are later refined as additional data and experimental evidence become available.
Search tip:
To retrieve all curated domain architectures, search the SPARCLE database for:
curated[ReviewLevel]
or add that search criterion to a keyword search to limit your retrieval to the desired type of records. For example, a search for:
kinase[Name] AND curated[ReviewLevel]
will retrieve conserved domain architectures that include the term "kinase" in their name, and whose names were assigned manually by NCBI curators.


Autonamed architectures back to top


Autonamed conserved domain architectures use an algorithm to automatically generate an architecture name based on the frequency of terms that are present in the definition lines of the proteins that have the architecture. Proteins that were used in naming the architecture are listed in the supporting evidence section of a SPARCLE record.

The automatically generated name will begin with the phrase "similar to..." followed by a cleaned up definition line (e.g., removal of taxonomy information, etc.) from the set of high quality proteins that were used to generate the name. The algorithm includes:
  • Protein name analysis:

    • The definition lines of all protein sequences that have a given architecture are analyzed.
    • First, the protein names are tokenized into word terms.
    • Next, the most popular terms will be selected as representatives to form a voting committee.
    • The voting committee will vote for the most representative name in this architecture.

  • Consistency score:

    • A post-processing step subsequently calculates a consistency score to determine the extent to which the name of the representative protein is sharing terms among the other proteins.
    • The consistency score will be used to decide if this computed name can be selected with enough confidence.
Please note that only a small fraction of architectures can be autonamed in this fashion due to the high confidence level required.

Additionally, architecture names are recalculated with each release of the Conserved Domain Database (CDD). This is because new sequence data are continually added to the Protein database. As a result, the number of protein sequences that have a given architecture might increase, which in turn increases the set of protein names from which an architecture name is computed.

Search tip:
To retrieve all autonamed domain architectures, search the SPARCLE database for:
autonamed[ReviewLevel]
or add that search criterion to a keyword search to limit your retrieval to the desired type of records. For example, a search for:
kinase[Name] AND autonamed[ReviewLevel]
will retrieve conserved domain architectures that include the term "kinase" in their name, and whose names were assigned computationally by the Autonamed algorithm.

NamedByDomain architectures back to top


NamedByDomain conserved domain architectures use an algorithm to automatically generate an architecture name based on the highest scoring conserved domains that are present in the architecture. Domains that were used in naming the architecture are listed in the supporting evidence section of a SPARCLE record.
  • Architectures that aren't curated, and couldn't be autonamed, are assigned a name based on up to two conserved domain models that are present in the architecture.

    • While the architecture's name is based on up to two conserved domains, the functional label can be based on up to four of the conserved domains in the architecture.

  • Sort conserved domains by e-value

    • If an architecture contains more than two conserved domains, then an algorithm is used to select the two highest-scoring conserved domain models, with a priority given to NCBI-curated domain models.

    • All of the conserved domain models that appear in the concise view of the architecture are scored based on their E-value.

      Technical note: The E-value of a given conserved domain can vary among the proteins that have the architecture in question, because the composition of the protein sequences may vary outside of the conserved domain architecture. To address this issue, the "NamedByDomain" algorithm uses the E-values of the conserved domain models against protein sequences in the oldest protein identity group ("PIG," described below) that has the architecture in question. That is, the algorithm uses the E-values of the domain models on the proteins that have the lowest/oldest PIG ID number.

      A protein identity group (PIG) is a cluster of protein sequences that are identical to each other in composition and length, regardless of their taxonomic source. The PIGs are automatically generated by the data processing pipeline at NCBI, which identifies all proteins that are identical to each other, regardless of TaxID, places them together in a protein identity group, and gives each PIG a stable identification number (PIG ID).

  • Prioritize specific hits and NCBI-curated domain models

Search tip:
To retrieve all domain architectures that have been named by domain, search the SPARCLE database for:
namedbydomain[ReviewLevel]
or add that search criterion to a keyword search to limit your retrieval to the desired type of records. For example, a search for:
kinase[Name] AND namedbydomain[ReviewLevel]
will retrieve conserved domain architectures that include the term "kinase" in their name, and whose names were assigned computationally by the NamedByDomain algorithm.


Two types of conserved domain architectures: back to top

  • Superfamily architectures

    Superfamily architectures consist solely of conserved domain superfamilies. This infers a general functional category for the proteins which have that architecture.

    That is, each conserved domain footprint in the architecture has an RPS-BLAST superfamily hit to every protein that has been classified with the architecture. This is designated by the "cl" prefix in the accession number of each conserved domain in the architecture. The "cl" stands for superfamily cluster. (To see the accession numbers, mouse over the conserved domain footprints in the architecture's graphical display.)

    One example of a superfamily architecture is:

         N-terminus------[cl21514]-------[cl00388]------C-terminus

    Proteins with this architecture have an RPS-BLAST hit to accession number cl21514 (TauE Superfamily: Sulfite exporter TauE/SafE), followed by a hit to cl00388 (Thioredoxin_like Superfamily: Protein Disulfide Oxidoreductases and Other Proteins with a Thioredoxin fold).

    Specifically, the N-terminal region of each protein with this architecture achieved a statistically significant hit to a conserved domain model that belongs to the TauE superfamily, and the C-terminal region achieved a statistically significant hit to a conserved domain model that belongs to the Thioredoxin_like Superfamily. However, neither hit had a high enough score to be considered a specific hit.

    As a result, only the superfamily classification is shown for each region of the protein, and is therefore regarded as a superfamily architecture.

    Note: Superfamily architectures are currently found only the CDART resource. A brief description of CDART is provided in the "Compare CDD, CDART, and SPARCLE" section of this document.


  • Subfamily architectures

    Subfamily architectures either contain a mix of conserved domain superfamilies and subfamilies, or consist solely of conserved domain subfamilies.

    A subfamily is represented by a conserved domain model that gets a specific hit to the protein query sequence. The specific hits represent a high confidence that the query sequence belongs to the same protein family as the sequences used to create each conserved domain model, and therefore a high confidence level for the inferred function of the protein query sequence.

    To see if a conserved domain is a superfamily or subfamily, mouse over a conserved domain's footprint in the architecture's graphical display. A superfamily will have a "cl" prefix in the accession number; the "cl" stands for superfamily cluster. A subfamily will have an accession number prefix other than "cl".

    One example of a subfamily architecture that consists solely of subfamilies is:

         N-terminus------[COG0785]-------[cd03012]------C-terminus

    Here, the accession number prefixes are "COG" and "cd," indicating that both conserved domains are specific hits. This architecture can be seen, for example, in the CD-Search results for the query protein NP_217390: integral membrane C-type cytochrome biogenesis protein DipZ [Mycobacterium tuberculosis H37Rv]. In the "Protein Classification" section of the CD-Search results, click on the link for "domain architecture ID 10002697" to open the corresponding SPARCLE record for that conserved domain architecture, if desired.

    Whether you view the CD-Search results for NP_217390, or the SPARCLE record for domain architecture ID 10002697, you will see that each conserved domain in the architecture achieves a specific hit to the query protein. This can be viewed on the CD-Search results page, in the "Specific Hits" line of the "Graphical Summary." It can also be viewed in the corresponding architecture record, by mousing over the conserved domain cartoons in the architecture's graphic to see that the accession number of each graphic begins with a prefix other than "cl".

    Note: Subfamily architectures are currently found only the SPARCLE resource. A brief description of SPARCLE is provided in the Compare CDD, CDART, and SPARCLE section of this document.

Architectures with single conserved domain footprint:
  • It is also possible for a domain architecture to consist of a single conserved domain footprint. That footprint can represent either a superfamily architecture or a subfamily architecture.

Each architecture receives a unique and stable architecture ID:
  • Each conserved domain architecture receives a unique and stable architecture ID, which reflects the set of conserved domain models that are top-scoring hits (as determined by the CD-Search service) on the proteins that possess the architecture, the sequential order of those domains, and the type of hit each domain has to the proteins. Architectures that consist of a single conserved domain footprint also receive an architecture ID.

Additional information about conserved domains:

Ongoing Research back to top

Please note that conserved domain models, architectures, and the resulting protein sequence clusters, continue to evolve as new data become available and as research progresses. As a result, the domain architecture annotated on a protein sequence, and the members of a protein sequence cluster, might change over time.
  • Specifically, the CDD curation project refines conserved domain models as new protein sequences and publications become available, and through closer analysis of existing clusters.

    • For example, when the CDD curators see a cluster of protein sequences in SPARCLE that is functionally diverse and that can be broken up into subclusters with more precise function, they do that by creating the appropriate domain models that will reflect the diverse functions. The refined domain models are then added to the data processing pipeline that defines conserved domain architectures and corresponding groups of protein sequences.

  • Additionally, an architecture that is composed of several individual conserved domain models might later be superceded by a multi-domain model that represents the full-length protein.

    • As an example, in January 2017, the protein sequence NP_387887 was initially annotated with architecture ID 10647733 (as shown in the illustrated example in the "input sequence data" section of this document). That architecture is named "DNA gyrase subunit B" and includes four distinct conserved domains.

    • In March 2017, when a new build of CDD/SPARCLE was released, the conserved domain architecture annotation for NP_387887 was revised to architecture ID 11481348, which is a multi-domain that encompasses the four original conserved domains, and which can be seen in the current CD-Search results for NP_387887. That architecture has a more specific and precise name, "type IIA DNA topoisomerase subunit B," and reflects the full length protein model.

    • To see the four distinct conserved domains that compose the full length protein model, simply change the CD-Search display option on the live CD-Search results for NP_387887 from "View Concise Results" to "View Full Results" (using the "View" menu near the upper right hand corner of the CD-Search results page). The Full Results display will show the four conserved domains that compose the full length protein model.

    • As the available data and understanding of conserved domain architectures continue to evolve, the domain architectures that are annotated on proteins may evolve as well, as shown in this example. Comments about the data are welcome and can be sent to the NCBI Support Center/Help Desk, which is accessible as a link in the footer of NCBI web pages.

In this way, as the available data and understanding of conserved domain architectures continue to evolve, the domain architectures that are annotated on proteins may evolve as well.

Comments about the data are welcome and can be sent to the NCBI Support Center/Help Desk, which is accessible as a link in the footer of NCBI web pages.

Links from architectures to other data types back to top

The SPARCLE data processing pipeline calculates two types of direct links:
  1. sparcle_protein: each conserved domain architecture in the SPARCLE database links to all protein sequences that have the architecture.

  2. sparcle_cdd: each conserved domain architecture in the SPARCLE database links to all of the conserved domain models (specific hits and superfamilies) that compose the architecture. For example, if an architecture contains one specific hit and one superfamily, that SPARCLE record will link to two Conserved Domain Database (CDD) records -- one for the specific hit and one for the superfamily.
All other links between SPARCLE and other Entrez databases are indirect, created by a join between the proteins that contain the architecture and the other data types. For example:
  • links from SPARCLE architectures to Gene records are created by a join between the following:
    sparcle_protein  AND  protein_gene  →  sparcle_gene

  • links from SPARCLE architectures to BioAssay records are created by a join between the following:
    sparcle_protein  AND  protein_pcassay_target  →  sparcle_pcassay_target


 
Log of Changes to SPARCLE back to top
12 OCT 2016 Initial release of the Subfamily Protein Architecture Labeling Engine (SPARCLE).
SPARCLE is a resource for the functional characterization and labeling of protein sequences that have been grouped by their characteristic domain architecture. To use SPARCLE, you can either: (1) enter a query protein sequence into CD-Search, which will display a "Protein Classification" on the results page if the query protein has a hit to a curated domain architecture in the SPARCLE database, or (2) search the SPARCLE database by keyword to retrieve domain architectures that contain the term(s) of interest in their descriptions. With either approach, the corresponding SPARCLE record(s) will display the name and functional label of the architecture, supporting evidence, and links to other proteins with the same architecture. Additional information and illustrated examples are provided on the "About SPARCLE" page and in this help document.

 
References back to top


Citing SPARCLE: back to top

Marchler-Bauer A, Bo Y, Han L, He J, Lanczycki CJ, Lu S, Chitsaz F, Derbyshire MK, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Lu F, Marchler GH, Song JS, Thanki N, Wang Z, Yamashita RA, Zhang D, Zheng C, Geer LY, Bryant SH. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 2017 Jan 4;45(D1):D200-D203. doi: 10.1093/nar/gkw1129. Epub 2016 Nov 29. [PubMed PMID: 27899674] [Full Text at Oxford Academic] Click here to read

Additional references: back to top

Lu S, Wang J, Chitsaz F, Derbyshire MK, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Marchler GH, Song JS, Thanki N, Yamashita RA, Yang M, Zhang D, Zheng C, Lanczycki CJ, Marchler-Bauer A. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 2019 Nov 28. pii: gkz991. doi: 10.1093/nar/gkz991. [Epub ahead of print] [PubMed PMID: 31777944] [Full Text at Oxford Academic]

(NOTE: The above reference is for the e-publication ahead of print, and will be updated to reflect the volume, issue, pages, and publication date of the print version, once it becomes available in January 2020.)
A separate page lists all publications about NCBI's Conserved Domains and Protein Classification Resources.

 
Revised 02 December 2019