CDART Help
 
 
   

This help document describes how to use CDART, including detailed descriptions of the input required, output displays, and the program's features and functions. The Conserved Domains resources page describes additional, related resources and provides "How To" guides that illustrate how those resources can be used.

 
     
 
BRIEF TABLE OF CONTENTS
 
  What is CDART?
Conserved Domain Architecture Retrieval Tool

Quick start guide
1-2-3 step process (illustration)

Input options
Enter query into CDART home page
- protein sequence
- set of conserved domains (CDs)
- multiple queries
Retrieve Entrez Protein sequence record
- follow "Domain Relatives" link

Output Display
Graphical summary of similar architectures
Filter your results
Details for individual domain architecture

References
 
 
 

OUTPUT DISPLAY
Thumbnail image showing the domain relatives for a protein query sequence (NP_081086, mouse DNA mismatch repair protein Mlh1). Domain relatives are protein sequences that contain one or more of the conserved domains found in the query sequence. Click on the image to open the CDART help document for more information about the tool.
 


 
What is CDART? back to top

The Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the Entrez Protein database based on domain architecture. A domain architecture is defined as the sequential order of conserved domains (functional units) in a protein sequence.

In this way, CDART finds protein similarities across significant evolutionary distances using sensitive domain profiles rather than direct sequence similarity.

Given a query sequence, CDART shows the conserved domains that make up a protein and then lists proteins with a similar domain architecture. The conserved domains in a sequence are found by RPS-BLAST, which defines a domain by a PSSM (Position-specific scoring matrices), a set of probabilities of amino acids existing at each position of the domain. RPS-BLAST is known as a "profile" search, which is a sensitive way to look for sequence homologues. Proteins similar to the query are then grouped and scored by domain architecture.

You can either search CDART directly with a query protein sequence, or retrieve a protein sequence record from the Entrez Protein database and select "Domain Relatives" from the "Related Information" menu in the right margin of the page to see the precalculated CDART results. Relying on domain profiles allows CDART to be fast and, because it relies on annotated functional domains, informative.

(A related tool, SPARCLE, the Subfamily Protein Architecture Labeling Engine, is a resource for the functional characterization and labeling of protein sequences that have been grouped by their characteristic conserved domain architecture. The SPARCLE Help document provides a comparison of CDD, CDART, and SPARCLE and includes examples of how each resource can be used.)


 
 
Quick Start Guide back to top

  The illustration below shows the easy, 1-2-3 step process for using the Conserved Domain Architecture Retrieval Tool (CDART). Click on any frame of the image below to link to subsequent sections in this help document, which provide additional details about the input options and output display.



  Illustration of the CDART home page, where you can input a query either as protein, a set of conserved domains, or as multiple queries. Click on this image for details and examples. Illustration of a sample protein sequence record (mouse DNA mismatch repair protein Mlh1, NP_081086) from the Entrez Protein database, where you can follow the link for Domain Relatives to view a list of proteins with similar domain architectures. Click on this graphic to read more about the various links that exist from a protein record to conserved domains.
 
  Illustration of CDART search results, which list proteins that have domain architectures similar to your query protein sequence (NP_081086, mouse DNA mismatch repair protein, in this example). Click on this graphic to read more about the output display.
 
  The expanded view of a domain architecture displays a list of representative, non-redundant protein sequences which have that architecture. Click on this graphic to read more about the information provided for each domain architecture.

 
If you would like to try this example yourself, open the CDART home page and enter NP_081086 (mouse DNA mismatch repair protein Mlh1) as the query, or retrieve the sequence record from the Entrez Protein database and then follow the link for "Domain Relatives" that appears under "Related Information" in the right margin. Click on any frame of the image above to link to subsequent sections in this help document, which provide additional details about the input options and output display.



 
 
Input Options back to top

| enter query directly into CDART home page as a  protein sequence, set of conserved domains, or multiple queries |
| retrieve sequence record from Entrez Protein → follow "Domain Relatives" link |

Enter query directly into CDART home page back to top


Illustration of the CDART home page, where you can input a query either as protein, a set of conserved domains, or as multiple queries. See the corresponding text for details and examples. Click on this image to see the complete illustration of the steps in using CDART, featured in the Quick Start Guide.
One way to retrieve proteins with similar domain architectures is to enter your query as a protein sequence, or as a set of conserved domains, directly into the CDART home page in any of the following formats:
  • Protein sequence back to top


  • You can submit a protein sequence as:
    The CDART results will show the functional domains found in the query protein and will list proteins with a similar domain architecture. The similar proteins must include at least one of the conserved domain superfamilies in the query sequence. The similarity score of each domain architecture indicates the number of domain superfamilies in the architecture that match domain superfamilies in the query protein, and is used to rank the search results.
  • Set of conserved domains (CDs) back to top


  • As an alternative to submitting a protein sequence as a query, you can you can specify a query as a set of one or more conserved domains*, using any of the identifiers below to specify your domains of interest. They should be entered on a single line, separated by commas, and surrounded by square brackets [], as in the examples below:

    • conserved domain superfamily cluster IDs - As explained in the Conserved Domain Database help document, a superfamily cluster is a set of conserved domain models that generate overlapping annotation on the same protein sequences. These models are assumed to represent evolutionarily related domains and may be redundant with each other.

      A superfamily ID (accession number) begins with the prefix "cl" for "cluster," and can be entered in CDART as the complete alphanumeric accession or as digits only (with or without the leading zeros). For example, a query to retrieve proteins with domain architectures that include superfamilies cl00075 (HATPase_c Superfamily) and cl02783 (TopoII_MutL_Trans Superfamily) can be entered in any of the following ways:

      [cl00075,cl02783]
      [00075,02783]
      [75,2783]
      [cl00075,2783]
      etc.


    • Accession numbers or PSSM IDs for specific domain models - If you are interested in a specific conserved domain model, you can enter its conserved domain accession number or position specific scoring matrix ID (PSSM ID). If you enter a PSSM ID, be sure to include a leading "p" so it won't be interpreted as a cluster ID. Note: The PSSM ID is displayed in the "Statistics" box of a domain model's summary page in the Conserved Domain Database. For example, a query for the domain models pfam02518 (whose current PSSM ID is 190334) and cd03483 (whose current PSSM ID is 48471) can be entered as:

      [pfam02518,cd03483]
      or
      [p190334,p48471]


    • mix of superfamily cluster IDs, conserved domain accessions, PSSM IDs - Use the same syntax rules as above. For example, a query to retrieve proteins with domain architectures that include superfamily cl00075 and domain model cd03483 (whose current PSSM ID is 48471) can be entered in any of the following ways:

      [cl00075,cd03483]
      [cl00075,p48471]
      [00075,p48471]
      [75,p48471]
      etc.


 

Note: The proteins that are returned by CDART will include at least one of the domains you have specified. The similarity score of each domain architecture indicates the number of domain superfamilies in the architecture that match domain superfamilies in the query protein, and is used to rank the search results.

Regardless of how you specify the conserved domains in your query (as superfamily cluster IDs or as the accessions or PSSM IDs of individual domain models), the CDART search results will display the superfamilies to which those models belong, and not the individual domain models themselves. However, you can see superfamilies and individual domain models by following the "domain details" link that appears in the expanded view of any domain architecture.

If you enter a single conserved domain as a query, you will retrieve all the domain architectures that contain the domain, ranked by the number of non-redundant proteins that have a given architecture.

 

  • Multiple queries back to top


    • You can enter multiple queries using any of the formats above (i.e., as protein sequences or sets of conserved domains) or a mix of those formats. Note that:


      • Each protein Accession or GI number should be on a separate line


      • FASTA formatted sequence data and bare sequences can occupy multiple lines. (The FASTA format definition line, however, should occupy a single line).


      • If one of your queries is a set of conserved domains, they should be entered on a single line, separated by commas, and surrounded by square brackets [], as in the third line of the example below.


      • If you include a bare sequence as one of the queries, use a blank line to separate it from the query that precedes it, as in the last part of the example below.


    • Example: - The example below includes six queries, in the following order: (1) protein GI number; (2) protein accession number; (3) set of conserved domain superfamily cluster IDs; (4) protein sequence in FASTA format; (5) another protein sequence in FASTA format; (6) protein as bare sequence data:

         269849668
         EDV04934
         [75,2783]
         >gi|239592572|gb|EEQ75153.1| asparaginase [Ajellomyces dermatitidis SLH14081]
         MSPPIPQPRQRTRSQPLFKPAVILHGGAGNIQHSRLPPELYKQYRTSLLTYLRSTTALLNADIEEEEPSI
         NAKNDAVDDNMRISPASALNAAVHAVSLMEDNELFNCGRGSVFTSAGTIEMEASVMVASLLNDEDSVDDF
         NNSEVNCLASEKTPGSIKRGAGVMLVRNVRHPIQLAKEVLLRTGYASDGDGDGGNMHSQLSGEYVEGLAR
         DWGMEFCPDDWFWTKKRWDEHRRGLKKGKTRGRMTDGRNMGADVEVRGEGEADDGDGLYLSQGTVGCVCL
         DRWGNIAVATSTGGLTNKCPGRIGDTPTLGAGFWAEAWDVEGVEGLSNMSDSSNSVCASGRDRSKGCIQL
         KRDTMNYQTQDGRDNLLAYQASSSTTTTTSSYRMGSQWRSDFDSNSAFTLIRDCFSSSPPPPGYAALEPS
         KYPVEKFPLGKSTSSPHTDFNPHRYSQPQRRRILALSGTGNGDSFLRTAATRTAAAMVRFGSAQNSISLA
         QAVTAVAGPGGELQRSAGRRWGKTGEGEGGIIGIEAEVETDEQTLGEGKLRRGKVVFDFNSTGMFRAWME
         EKDGKDVERMMVFRDDYE
         >gi|336020358|ref|NP_001229488.1| mitogen-activated protein kinase kinase kinase kinase 4 isoform 4 [Homo sapiens]
         MANDSPAKSLVDIDLSSLRDPAGIFELVEVVGNGTYGQVYKGRHVKTGQLAAIKVMDVTEDEEEEIKLEI
         NMLKKYSHHRNIATYYGAFIKKSPPGHDDQLWLVMEFCGAGSITDLVKNTKGNTLKEDWIAYISREILRG
         LAHLHIHHVIHRDIKGQNVLLTENAEVKLVDFGVSAQLDRTVGRRNTFIGTPYWMAPEVIACDENPDATY
         DYRSDLWSCGITAIEMAEGAPPLCDMHPMRALFLIPRNPPPRLKSKKWSKKFFSFIEGCLVKNYMQRPST
         EQLLKHPFIRDQPNERQVRIQLKDHIDRTRKKRGEKDETEYEYSGSEEEEEEVPEQEGEPSSIVNVPGES
         TLRRDFLRLQQENKERSEALRRQQLLQEQQLREQEEYKRQLLAERQKRIEQQKEQRRRLEEQQRREREAR
         RQQEREQRRREQEEKRRLEELERRRKEEEERRRAEEEKRRVEREQEYIRRQLEEEQRHLEVLQQQLLQEQ
         AMLLECRWREMEEHRQAERLQRQLQQEQAYLLSLQHDHRRPHPQHSQQPPPPQQERSKPSFHAPEPKAHY
         EPADRAREVEDRFRKTNHSSPEAQSKQTGRVLEPPVPSRSESFSNGNSESVHPALQRPAEPQVPVRTTSR
         SPVLSRRDSPLQGSGQQNSQAGQRNSTSIEPRLLWERVEKLVPRPGSGSSSGSSNSGSQPGSHPGSQSGS
         GERFRVRSSSKSEGSPSQRLENAVKKPEDKKEVFRPLKPADLTALAKELRAVEDVRPPHKVTDYSSSSEE
         SGTTDEEDDDVEQEGADESTSGPEDTRAASSLNLSNGETESVKTMIVHDDVESEPAMTPSKEGTLIVRQT
         QSASSTLQKHKSSSSFTPFIDPRLLQISPSSGTTVTSVVGFSCDGMRPEAIRQDPTRKGSVVNVNPTNTR
         PQSDTPEIRKYKKRFNSEILCAALWGVNLLVGTESGLMLLDRSGQGKVYPLINRRRFQQMDVLEGLNVLV
         TISGKKDKLRVYYLSWLRNKILHNDPEVEKKQGWTTVGDLEGCVHYKVVKYERIKFLVIALKSSVEVYAW
         APKPYHKFMAFKSFGELVHKPLLVDLTVEEGQRLKVIYGSCAGFHAVDVDSGSVYDIYLPTHIQCSIKPH
         AIIILPNTDGMELLVCYEDEGVYVNTYGRITKDVVLQWGEMPTSVAYIRSNQTMGWGEKAIEIRSVETGH
         LDGVFMHKRAQRLKFLCERNDKVFFASVRSGGSSQVYFMTLGRTSLLSW
      
         MEQDPKPPRLRLWALIPWLPRKQRPRISQTSLPVPGPGSGPQRDSDEGVLKEISITHHVKAGSEKADPSH
         FELLKVLGQGSFGKVFLVRKVTRPDSGHLYAMKVLKKATLKVRDRVRTKMERDILADVNHPFVVKLHYAF
         QTEGKLYLILDFLRGGDLFTRLSKEVMFTEEDVKFYLAELALGLDHLHSLGIIYRDLKPENILLDEEGHI
         KLTDFGLSKEAIDHEKKAYSFCGTVEYMAPEVVNRQGHSHSADWWSYGVLMFEMLTGSLPFQGKDRKETM
         TLILKAKLGMPQFLSTEAQSLLRALFKRNPANRLGSGPDGAEEIKRHVFYSTIDWNKLYRREIKPPFKPA
         VAQPDDTFYFDTEFTSRTPKDSPGIPPSAGAHQLFRGFSFVATGLMEDDGKPRAPQAPLHSVVQQLHGKN
         LVFSDGYVVKETIGVGSYSECKRCVHKATNMEYAVKVIDKSKRDPSEEIEILLRYGQHPNIITLKDVYDD
         GKHVYLVTELMRGGELLDKILRQKFFSEREASFVLHTIGKTVEYLHSQGVVHRDLKPSNILYVDESGNPE
         CLRICDFGFAKQLRAENGLLMTPCYTANFVAPEVLKRQGYDEGCDIWSLGILLYTMLAGYTPFANGPSDT
         PEEILTRIGSGKFTLSGGNWNTVSETAKDLVSKMLHVDPHQRLTAKQVLQHPWVTQKDKLPQSQLSHQDL
         QLVKGAMAATYSALNSSKPTPQLKPIESSILAQRRVRKLPSTTL
         

Search Entrez Protein → link to "Domain Relatives" back to top


Illustration of a sample protein sequence record (mouse DNA mismatch repair protein Mlh1, NP_081086) from the Entrez Protein database, where you can follow the link for Domain Relatives to view a list of proteins with similar domain architectures. Click on this image to see the complete illustration of the steps in using CDART, featured in the Quick Start Guide.
A second way to access CDART is to start by retrieving a record of interest from the Entrez Protein database, then follow the "Domain Relatives" link in the right margin of the sequence record. That will open the precalculated CDART results for the protein.

Note that the "Domain Relatives" is one of four links available from a protein sequence record to conserved domain annotations, allowing you to choose: (a) the format in which you want to view the conserved domains (e.g., in graphical format as domain footprints aligned to the protein sequence; as a list of records from the Conserved Domain Database, each of which includes a multiple sequence alignment of the proteins used to create the domain model; or as a list of proteins with similar domain architectures), and (b) the level of redundancy in the list of conserved domain models (e.g., a concise list of the top scoring models or a full list of all models that have a statistically significant RPS-BLAST hit to the protein).

The number of conserved domain models retrieved, and the order in which they are sorted/presented, depends upon the view you select:

  • Domain Relatives -- opens a graphical display of similar domain architectures, as determined by the CDART tool. A domain architecture is defined as the sequential order of conserved domains in a protein query sequence. The score for each CDART hit represents the number of domains that match those found in the query protein. (The CDART paper provides additional details.
  • )

  • CDD Search Results -- opens a graphical display (illustrated example) of conserved domain model footprints on the query protein, ranked by their RPS-BLAST score and hit type. A model may appear more than once if it aligns to multiple regions of the query sequence. A concise display showing only the top-scoring hits is presented by default, and it can be changed to a full display of all hits if desired. (The CDD help document provides additional details.)


  • Conserved Domains (Concise) -- opens a concise list of the conserved domain models that are the top-scoring RPS-BLAST hits to the query protein. Each domain model is listed only once, even if a model had a hit to more than one region on the query sequence. (The CDD help document provides additional details.)


  • Conserved Domains (Full) -- opens a full list of all the conserved domain models that have a statistically significant RPS-BLAST hit to the query protein. Each domain model is listed only once, even if a model had a hit to more than one region on the query sequence. (The CDD help document provides additional details.)


 
 
Output back to top

Graphical summary of similar domain architectures back to top


| query | list of similar domain architectures | filter your results | information provided for each domain architecture |
Illustration of CDART search results, which list proteins that have domain architectures similar to your query protein sequence (NP_081086, mouse DNA mismatch repair protein, in this example). Click on this image to see the complete illustration of the steps in using CDART, featured in the Quick Start Guide.

  • Query back to top


    • The query you entered is displayed in a yellow background at the top of the CDART search results.


      • If your query was a protein sequence, the graphic shows the length of the protein in amino acids and the footprints of the highest scoring conserved domain superfamily(ies) found in the query sequence by RPS-BLAST. On the left side of the graphic is the description of the query and the total number of domain architectures found in the Entrez Protein database that contain at least one superfamily from the query. If you used any search result filters, a second number will show the number of remaining architectures after applying the filter. The two numbers may be the same.


      • If your query was a set of conserved domains rather than a protein sequence, the domains will be shown in the same order in which they were input, without a scale showing length. The "total architectures" statistic will indicate the number of domain architectures found in the Entrez Protein database that contain all of the domain superfamilies in your query.


  • List of similar domain architectures back to top


    • A domain architecture is defined as the sequential order of conserved domains in protein queries. Each domain architecture displayed by CDART therefore represents a unique set and order of conserved domain superfamilies found among sequences in the Entrez Protein database.

    • The CDART results list the proteins with a similar domain architecture to your query. The similar proteins must include at least one of the conserved domain superfamilies in the query sequence. The similarity score of each domain architecture indicates the number of domain superfamilies in the architecture that match domain superfamilies in the query protein, and is used to rank the search results.

    • Architectures are displayed as graphs. Different Superfamilies are displayed as different shape/color combinations. Mouse over a superfamily footprint to open a pop-up that shows the superfamily accession number (cluster ID), title, and description. Click on a footprint to open the Conserved Domain Database summary page for that superfamily, which lists the individual conserved domain models that belong to the superfamily.

    • Protein sequences that share the same domain architecture are grouped together, and only one from each group is shown in the default display. Additional proteins can be viewed by clicking on the [+] beside a domain architecture of interest. (See additional details under "Information provided for each domain architecture," which includes an illustration of the expanded view of the architecture circled in red in the image above.)

  • Filter your results back to top


    • The "Filter your results" bar at the top of a CDART search results page allow you to refine the results by including/excluding proteins from specific taxonomic groups, or including/excluding architectures that contain specific domain superfamily(ies). The contents of the "Filter your results" dialog box are generated dynamically and represent the organisms and domain superfamilies that were found in your CDART search results. Select the desired taxa or superfamilies from the lists provided in the dialog box to include or exclude specific taxa or superfamilies from your CDART search results display.


      For taxonomic filtering, you can choose to see only the sentinel taxonomy nodes, or a tree view of all taxonomy nodes, that were found in your CDART resuts:


      • Sentinel Taxonomy Nodes -- are generally ancient taxonomic nodes which are represented by a large number and diversity of sequences in the Entrez Protein database. A sentinel tax node needs to be separated in evolutionary time from the other sentinel tax nodes by several hundred million years. In addition, a sentinel tax node needs to have a sufficient amount and diversity of sequence data representing a wide range of proteins within the node, and preferably at least one complete genome. Sentinel tax nodes are used by CDD curators to estimate the evolutionary age of a protein subfamily. (The main CDD help document provides additional details about NCBI-curated domain models and domain family hierarchies.) Note that the "Sentinel Taxonomy Nodes" option in the "Filter your results" menu provides access to the subset of proteins in your CDART search results that belong to a sentinel tax node; any proteins in your search results that do not belong to a sentinel tax node will be accessible only through the "NCBI Taxonomy Tree" option in the "Filter your results" menu.

      • NCBI Taxonomy Tree -- provides a tree view of all the taxonomic nodes (from the NCBI taxonomy project) that are represented by proteins in your CDART search results. Selecting a tax node from the tree means that you are selecting all of its sub-nodes.


    • As an alternative to selecting taxa and/or domain superfamilies from the lists and then using the "Include" or "Exclude" buttons, you can type the desired parameters directly into the "Filter your results" text box, in a format such as:

      • INCLTAX[xxx,yyy,zzz]
        Include only sequences that belongs to taxonomy ID (TaxID) xxx OR yyy OR zzz.

      • EXCLTAX[xxx,yyy,zzz]
        Exclude sequences that belongs to taxonomy ID (TaxID) xxx OR yyy OR zzz.

      • INCLDOM[xxx,yyy,zzz]
        Include only sequences that contain the conserved domain superfamilies with cluster IDs xxx AND yyy AND zzz.

      • EXCLDOM[xxx,yyy,zzz]
        Exclude sequences that contain the conserved domain superfamilies with cluster IDs xxx AND yyy AND zzz.

      The above operations can be combined using the logical operators NOT, AND and OR, as well as parentheses. For example:

      INCLTAX[aaa,bbb] AND NOT INCLDOM[xxx]

      will filter your search results so they include only sequences that belong to taxonomy ID (TaxID) aaa or bbb, and exclude sequences that contain the conserved domain superfamily with cluster ID xxx.

      The logical operators are executed with the following precedence:

      () > NOT > AND > OR

  • Information provided for each domain architecture back to top


    • Title - The title of the representative protein sequence that contains this domain architecture.

    • Taxonomy span - The highest taxonomic node common to all sequences with this domain architecture.

    • Similarity score - The number of conserved domain superfamilies in this domain architecture that match superfamilies in the query sequence. The score does not count repeats of superfamilies within a domain architecture; rather, it counts each superfamily only once, regardless of how many times that superfamily appears in a protein sequence.

      The domain architectures shown on a CDART search results page are ranked/sorted by score. If two or more architectures have the same score, they are ranked by the number of non-redundant protein sequences that contain the architecture (so hits that may be spurious hits are at the bottom).

    • Total nr (non-redundant) sequences -

      • Protein sequences that share the same domain architecture are grouped together.

      • If two or more of the proteins are identical in length and composition, they are placed in the same protein identity group ("PIG"). CDART displays one representative sequence from each PIG in order to produce a non-redundant (nr) list of sequences that contain a specific domain architecture.

      • The "total nr sequences" statistic represents the total number of PIGs with the domain architecture.

    • Lookup sequences in Entrez - Retrieves the non-redundant set of sequence records that contain the domain architecture.

    • [+]/[-] to expand/contract the display for an individual domain architecture - Press the [+] beside a domain architecture to view:

      • list of representative, non-redundant protein sequences that have this domain architecture. (See additional details under "total nr sequences," above.)

        • Each sequence in the list may be a representative of a larger group of identical proteins, called a protein identity group ("PIG").

        • To retrieve all proteins that contain the domain architecture, follow the "Lookup sequences in Entrez" link for a domain architecture of interest (which will retrieve the non-redundant set of proteins). Then follow the link for "Find related data|Database:Protein|Option:Identical Proteins" in the right margin of the resultant Entrez Protein search results page (or follow the link for "Related information:Identical Proteins" link in the right hand margin of an individual protein sequence record to see all members of its PIG).

      • "domain details" link opens the interactive CD-Search results for the sequence, which shows not only conserved domain superfamilies that have been mapped onto the protein, but also the individual conserved domain models that align to the protein, with options to view those results as a concise display or full display.

The expanded view of a domain architecture displays a list of representative, non-redundant protein sequences which have that architecture. Click on this image to see the complete illustration of the steps in using CDART, featured in the Quick Start Guide.
 
 
References back to top


Citing CDART: back to top

Geer LY, Domrachev M, Lipman DJ, Bryant SH. CDART: protein homology by domain architecture. Genome Res. 2002 Oct;12(10):1619-23.
 
Revised 09 August 2017