Submission of Annotation Using a Table

Introduction

The five-column, tab-delimited feature table format allows different kinds of features (e.g., gene, mRNA, coding region, tRNA) and qualifiers (e.g., /product, /note) to be annotated. The valid features and qualifiers are restricted to those approved by the International Nucleotide Sequence Database Collaboration. The entire process can be automated with the utility tbl2asn, which is a command line program that automates parts of the submission process and is available via ftp. tbl2asn reads a template, along with the sequence and table files, and outputs ASN.1 for submission to GenBank.

When submitting an annotated prokaryotic or eukaryotic genome, please review the genome guidelines and appropriate annotation details for prokaryotes or eukaryotes.

Table Layout

The five-column, tab-delimited feature table specifies the location and type of each feature. The first line of the table contains the following basic information:

›Feature SeqId table_name

The sequence identifier (SeqId) must be the same as that used on the sequence. The table_name is optional. Subsequent lines of the table list the features. Each feature is on a separate line. Qualifiers describing that feature are on the line below. Columns are separated by tabs.

Column 1: Start location of feature 
Column 2: Stop location of feature 
Column 3: Feature key 
Line2: 
Column 4: Qualifier key 
Column 5: Qualifier value

Figure 1 shows a sample table and illustrates a number of points about the table format. The GenBank flatfile corresponding to this table is shown in Figure 2.

Features that are on the complementary strand, such as the gene YPR027C and tRNA-Phe, are indicated by reversing the interval locations.
Locations of partial (incomplete) features are indicated with a ">" or "<" in front of the nucleotide location. The “<” symbol always appears in column 1 and “>” always appears in column 2, regardless of the strandedness of the feature. In this example, the first gene, CDS, and mRNA all begin upstream of the start of the nucleotide sequence. The "<" symbol indicates that they are 5' partial features.
For the protein of a CDS that is partial at its 5’ end to translate correctly, the first nucleotide of the CDS that is the first base of the first complete codon must be indicated with the qualifier "codon_start". This is not the reading frame of the entire sequence; it is just the nucleotide position within the CDS. In the example, nucleotide 2 begins the first complete codon of the acid trehalase CDS. The default situation is that the codon_start is 1. There is no need to indicate the codon_start on complete CDSs, as the translation always begins at the first nucleotide of the interval.
If a feature contains multiple intervals, like the spliced tRNA-Phe or the Yip2p CDS, each interval is listed on a separate line by its start and stop position before subsequent qualifier lines.
Gene features are always a single interval, and their location should cover the intervals of all the relevant features. For example, the gene YIP2 is as long as its mRNA, and is thus longer than its CDS.
If the gene feature spans the intervals of the CDS or mRNA features for that gene, there is no need to include gene qualifiers on those features in the table, because they will be picked up by overlap. For example, in the flatfile, the gene names ATH1 and YPR027C are present as /gene on the overlapping CDS, even though they are not explicitly listed as gene qualifiers on those CDSs in the table. This option can be suppressed by adding a gene qualifier with the value '-' to the feature. Suppressing the overlapping /gene is important when, for example, a tRNA is encoded within an intron of a housekeeping gene.
If a protein has more than one name, each can be listed in the table as a separate product qualifier on the CDS in the table. The value of the first product qualifier will become the /product on the CDS in the flatfile, and any additional product qualifiers will be shown as a /note on the CDS in the flatfile. See the first CDS, which has two product qualifiers, acid trehalase and Ath1p. All CDS features must have at least one product.
A flatfile /note can be added to any feature using the qualifier note in the table. A note has been added to the second CDS.
Published citations are added using the REFERENCE feature. For most publications, the start and stop of the feature are the first and last nucleotides of the sequence. The qualifier key is PubMed, and the value is the PubMed Identifier (PMID), which can be found in PubMed.
The [offset] is used to add a specified number to all subsequent nucleotide intervals. In this example, the record was annotated in two pieces, each piece starting from residue number 1. The sequences themselves were joined together in the FASTA file. The [offset=2000] adds 2000 nt to the location of all features that follow it, sparing the submitter the need to recalculate the location of each feature. This option could be used if the feature intervals for two arms of a chromosome or adjacent contigs are stored separately, but needs to be joined for the final submission.

Figure 1 : Feature Table Format

>Feature Sc_16
1    7000    REFERENCE
                        PubMed         8849441
<1    1050    gene
                        gene           ATH1
<1    1009    CDS
                        product        acid trehalase
                        product        Ath1p
                        codon_start    2
<1    1050    mRNA
                        product        acid trehalase
[offset=2000]
1253    420    gene
                        gene           YPR027C
1253    420    CDS
                        product        Ypr027cp
                        note           hypothetical protein
1253    420    mRNA
                        product        Ypr027cp
2626    2535    gene
                        gene           trnF
2626    2590    tRNA
2570    2535
                        product        tRNA-Phe
2626    2590    exon
                        number         1
2570    2535    exon
                        number         2
3450    4536    gene
                        gene           YIP2
3522    3572    CDS
3706    4197
                        product        Yip2p
                        prot_desc      similar to human polyposis locus protein 1 (YPD)
3450    3572    mRNA
3706    4536
                        product        Yip2p

Figure 2 : GenBank Flatfile

LOCUS       Sc_16        7000 bp    DNA             PLN       08-MAY-2000
DEFINITION  Saccharomyces cerevisiae strain S288C chromosome XVI, partial sequence.
ACCESSION   Sc_16
VERSION
KEYWORDS    .
SOURCE      baker's yeast.
  ORGANISM  Saccharomyces cerevisiae
            Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;
            Saccharomycetaceae; Saccharomyces.
REFERENCE   1  (bases 1 to 7000)
  AUTHORS   Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B.,
            Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M.,
            Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and
            Oliver,S.G.
  TITLE     Life with 6000 genes
  JOURNAL   Science 274 (5287), 546 (1996)
   PUBMED   8849441
REFERENCE   2  (bases 1 to 7000)
  AUTHORS   Ouellette,B.F.F.
  TITLE     Direct Submission
  JOURNAL   Submitted (08-MAY-2000) NCBI/NLM, National Institutes of Health,
            Building 38A, Room 8N805, Bethesda, MD 20894, USA
FEATURES             Location/Qualifiers
     source          1..7000
                     /organism="Saccharomyces cerevisiae"
                     /strain="S288C"
                     /chromosome="XVI"
     mRNA            <1..1050
                     /gene="ATH1"
                     /product="acid trehalase"
     gene            <1..1050
                     /gene="ATH1"
     CDS             <1..1009
                     /gene="ATH1"
                     /note="Ath1p"
                     /codon_start=2
                     /product="acid trehalase"
                     /translation="DHNGTIVHKSGDVPIHIKIPNRSLIHDQDINFYNGSENERKPNL
                     ERRDVDRVGDPMRMDRYGTYYLLKPKQELTVQLFKPGLNARNNIAENKQITNLTAGVP
                     GDVAFSALDGNNYTHWQPLDKIHRAKLLIDLGEYNEKEITKGMILWGQRPAKNISISI
                     LPHSEKVENLFANVTEIMQNSGNDQLLNETIGQLLDNAGIPVENVIDFDGIEQEDDES
                     LDDVQALLHWKKEDLAKLIEQIPRLNFLKRKFVKILDNVPVSPSEPYYEASRNQSLIE
                     ILPSNRTTFTIDYDKLQVGDKGNTDWRKTRYIVVAVQGVYDDYDDDNKGATIKEIVLN
                     D"
     mRNA            complement(2420..3253)
                     /gene="YPR027C"
                     /product="Ypr027cp"
     gene            complement(2420..3253)
                     /gene="YPR027C"
     CDS             complement(2420..3253)
                     /gene="YPR027C"
                     /note="hypothetical protein"
                     /codon_start=1
                     /product="Ypr027cp"
                     /translation="MVGIYRILASFVPLLGLLFAFHDDDMIDTVTIIKTVYETVTSTS
                     TAPAPAATKSVSEKKLDDTKLTLQVIQTMVSCFSVGENPANMISCGLGVVILMFSLII
                     ELINKLENDGINEPQRLYDLIKPKYVELPSNYVNEKIKTTFEPLDLYLGVNMNTSGSE
                     LNQNCLILKLGEKTALPFPGLAQQICYTKGASNEFTNYKLSDIQGNLNENSQGIANGV
                     FQKISNIRKISGNFKSQLYQISEKITDENWDGSAVGFTAHGREKGPNKSQISVSFYRD
                     N"
     gene            complement(4535..4626)
                     /gene="trnF"
     tRNA            complement(join(4535..4570,4590..4626))
                     /product="tRNA-Phe"
                     /gene="trnF"
     exon            complement(4535..4570)
                     /number=1
     exon            complement(4590..4626)
                     /number=2
     mRNA            join(5450..5572,5706..6536)
                     /gene="YIP2"
                     /product="Yip2p"
     gene            5450..6536
                     /gene="YIP2"
     CDS             join(5522..5572,5706..6197)
                     /gene="YIP2"
                     /note="similar to human polyposis locus protein 1 (YPD)"
                     /codon_start=1
                     /product="Yip2p"
                     /translation="MSEYASSIHSQMKQFDTKYSGNRILQQLENKTNLPKSYLVAGLG
                     FAYLLLIFINVGGVGEILSNFAGFVLPAYLSLVALKTPTSTDDTQLLTYWIVFSFLSV
                     IEFWSKAILYLIPFYWFLKTVFLIYIALPQTGGARMIYQKIVAPLTDRYILRDVSKTE
                     KDEIRASVNEASKATGASVH"
BASE COUNT     2201 a   1276 c   1255 g   2268 t
ORIGIN
        1 cgaccacaat ggtacgattg ttcataaatc aggagatgtt cctattcata taaagatacc
       61 aaacagatct ctaatacatg accaggatat caacttctat aatggttccg aaaacgaaag
      121 aaaaccaaat ctagagcgta gagacgtcga ccgtgttggt gatccaatga ggatggatag [etc.]

GenBank

Public nucleic acid sequence repository

Submission of Annotation Using a Table

Introduction

Table Layout

Figure 1 : Feature Table Format

Figure 2 : GenBank Flatfile