CDTree: protein domain hierarchy viewer & editor
 
 
 
Fasta-to-CD Converter
 
   
 
 
back to top Overview
 
  Also distributed with CDTree is a utility called 'fa2cd'.

This is a stand-alone command-line utility that can convert a multiple alignment in FASTA format (also called 'mFASTA' format) into a conserved domain 'CD file' that can be used as input to CDTree or Cn3D. The standard file extension for such files is '.cn3'. The mFASTA input file should contain, sequentially, one FASTA formatted record for each aligned row and a gap is indicated with the '-' character.

By default, fa2cd performs a formal convertion of the input mFASTA alignment, assigns a sequential, numeric identifier to each input sequence and chooses a 'master' sequence. It does not attempt to perform validation of accessions or sequence data in the input against NCBI or other database resources.

The following sections provide some information on modifying the program's default functionality, and hints for the more prominent command-line options. A description of all command-line options is obtained via the '-help' flag, and a short summary is available with '-h'.

NOTE: fa2cd does not do the reverse conversion and create a FASTA file from a CD file. However, you may open a CD file in Cn3D and in the alignment window use the 'View->Export' menu command for this purpose.

 
 
 
back to top Specifying a Master Sequence (the '-mr' option)
 
  For a file read by CDTree or Cn3D, the first sequence in the alignment is the reference, or 'master' sequence. Whenever possible, CDD domain models choose a sequence with an experimentally determined 3D structure as the master so as to better inform the resulting multiple alignment.

By default, the converter chooses the 'best' possible master sequence as determined by a heuristic that seeks to maximize the number of residues on the master aligned to all sequences, while minimizing the number of gaps. The '-mr' and '-keepMaster' options (the latter being equivalent to -mr 1) allow the user to force a specific master to be chosen. If both are provided, the -mr flag is used.

 
 
 
back to top Sequence Identifiers (the '-parseIds' option)
 
  By default, the converter simply assigns a numeric identifier to each sequence in the input file, and the defline for each sequence (the preamble line starting with a '>' character) is preserved as a plain-text description elsewhere in the output CD file. If you provide the '-parseIds' option, the converter will instead try to interpret the identifiers it finds in the defline based on the standard formatting used at NCBI. (Additional non-NCBI sequence identifier formats will not be supported by this utility.)

When using '-parseIds', if an identifier is not parsed as a recognizable type, it will simply be treated as free-form text. Should there be no text found that can be used as an identifier, one will be assigned as in the default situation. You may encounter parser warnings in this mode; if the output is unacceptable in such cases you may have to use the default behavior instead.

To reiterate the above caution: no validation is performed that an identifier in fact corresponds to a valid accession for any sequence database, nor that the input sequence data is in fact the same sequence associated with the identifier in the source sequence database.

 
 
 
back to top Intersection by Master (the '-ibm' option)
 
  One important distinction between mFASTA-represented alignments and the alignment representation in a CD file should be kept in mind in when using this converter. In general, an mFASTA alignment may have a residue in the first (i.e., reference) sequence aligned to a gap in each sequence in the alignment. All conserved domains in the CDD, however, have a 'block structure' in which every aligned residue in the reference sequence is aligned to a residue on each sequence in the multiple alignment. An algorithm termed 'intersection by master' (IBM) can generate a multiple alignment obeying the CDD constraints from one that does not by simply truncating any alignment column containing a gap. (Clearly, the reverse transformation is not possible.)

The fa2cd converter can output a CD file from the input mFASTA with or without running the IBM algorithm. By default, IBM is not run the input alignment is preserved without truncation. The '-ibm' flag truncates the input alignment to remove gap-containing columns consistent with CDD conventions. Neither CDTree nor Cn3D require that an input multiple alignment have a valid block structure in the sense described. However note that a) Cn3D's alignment viewer displays alignments as if IBM were run, and b) those CDTree functions that assume a CDD-style alignment may fail and generate warning messages.

 
 
 
 
Revised 26 September 2016