Gapped Format for Genome Submissions
If your contig sequences include runs of N's that represent gaps, you will need to include assembly_gap features with the appropriate linkage evidence. Generally, you can generate a gapped submission with table2asn (the replacement of the now-obsolete tbl2asn), available from FTP, as described below. Note that the table2asn arguments to convert Ns to gaps are different from those of tbl2asn.
Note that every run of 10 or more Ns is recognized as a gap when the assembly statistics are calculated in NCBI's Assembly resource.
Requirements
- Each record must represent a sequence that occurs biologically in the organism. Do NOT manually use N's to randomly combine the contigs to create a single sequence; you must know the order and orientation of the contigs.
- Do not include any artificial sequences, such as linkers with multiple stop codons in the submitted genome.
- Do not add assembly_gaps for Ns that represent ambiguous base calls, so you may need to check the parameters of the assembler that was used to determine what the N's represent.
To convert the runs of Ns to assembly_gaps, you need to know:
- the linkage evidence for each gap
- the minimum number of N's in a row (ie 'run of Ns') that represents a gap
- if any runs of Ns represent gaps of unknown size
- if the sequences also include N's that are ambiguous base calls, then what is the length of the longest run of ambiguous bases. To use these simple instructions, the maximum number of Ns in a row that are ambiguous bases must be less than the minimum number of N's in a row that represents a gap.
Gap Details
There are two types of gap lengths:
- Estimated length: The approximate gap size is known. This is also used if the gap is known to be small (e. g. gap could be between 10-50 N's).
- Unknown length: The gap size is not known (e.g. gap could be 50 or 50000 N's) but the order and orientation of the contigs are known. We suggest using 100 N's to represent gaps of unknown length when possible.
To interpret what the runs of Ns in a sequence represent, use these arguments:
- -gaps-min : the minimum number of Ns in a row that represents a gap
- -gaps-unknown : exact number of Ns in a row that represents a gap of completely unknown length
-
-l (lowercase 'l' as in 'linkage') : type of evidence used to assert linkage across the gaps. These are the available options (they correspond to the options for column 9 of an AGP file):
- paired-ends (ie, for paired ends or mate pairs, eg from Illumina or PacBio)
- map
- proximity-ligation (ie, from Hi-C)
- align-genus
- align-xgenus
- align-trnscpt (ie, the evidence is a transcript)
-
-linkage-evidence-file : to call a 2-column file of gap lengths and their linkage evidences when gaps of specific lengths have different linkage evidences
Common Cases
- All the gaps are of estimated lengths
- Both estimated length and unknown length gaps are present
- Different linkage evidences have unique estimated lengths
- Complex cases
1. All the gaps are of estimated lengths
Use –gaps-min to indicate the minimum run of N's to be converted to an estimated length gap. For example, if all of the gaps are estimated length (there are no unknown length gaps) and runs of 5 or more N's are estimated gaps and shorter runs of N's are ambiguous bases, then use -gaps-min 5. Similarly, if every N represents an estimated length gap, use -gaps-min 1.
Example: Every run of 5 or more Ns represents a gap of estimated length, and the linkage evidence is paired-ends:
- table2asn -indir path_to_fsa_files -t template -M n -Z –gaps-min 5 -l paired-ends
Note that you should only include an assembly_gap for runs of N's that represent gaps. Do not add assembly_gaps for single or short runs of N's that represent ambiguous bases. You will need to check your assembly parameters to determine what the N's represent.
2. Both estimated length and unknown length gaps are present
When there are both estimated length and unknown length gaps with the same linkage evidence and the unknown length gaps are a single unique length, you need to include -l for the linkage evidence, -gaps-unknown for the length of runs of N's that represent unknown length gaps, and -gaps-min for the minimum runs of N's to convert to estimated length gaps.
Example: if runs of 10 or more N's are estimated gaps, and shorter runs of N's are just ambiguous bases, and all runs of exactly 100 N's are unknown gaps, and the linkage evidence is align-genus:
- table2asn –indir path_to_fsa_files -t template -M n -Z –gaps-min 10 –gaps-unknown 100 -l align-genus
3. Different linkage evidences have unique estimated lengths
If there are gaps of specific lengths with different linkage evidences, then you can make a 2-column file of those lengths and their linkage evidences, and call that file with the -linkage-evidence-file argument. Note that -gaps-min and -l can also be included to provide a single linkage evidence for all runs of Ns other than the lengths in that file.
For example, if runs of 100 Ns have proximity-ligation evidence, and runs of 150 Ns are linked by map evidence, and all other runs of 10 or more Ns have paired-end evidence, then you would have a file like this:
gaps_file:
100 proximity-ligation
150 map
`and a command line like this:
- table2asn -indir path_to_fsa_files -t template -M n -Z -gaps-min 10 -linkage-evidence-file gaps_file -l paired-ends
4. Complex cases
If there are different kinds of linkage evidence for gaps of the same lengths, then the assembly_gap features need to be included in an annotation .tbl file. They are set up with the appropriate gap-type and linkage evidence as in the example, where the first gap of 100 Ns has align-genus evidence and the second run of 100 Ns has paired-ends evidence:
100 201 assembly_gap
gap_type within scaffold
linkage_evidence align-genus
420 521 assembly_gap
gap_type within scaffold
linkage_evidence paired-ends
In this case, the command line does not include any gap information because all of the information is in the .tbl file.
Annotation FYI
Annotation is not required. However, if you would like to annotate the gapped sequences, you need to be careful about crossing gaps.
The exon(s) of a CDS may not cross the gap if the gap size is unknown. Instead, you could have two partial CDS features (and mRNAs in eukaryoties) that abut the gap, with a single gene over the whole locus. Alternatively, one of the partial CDS/mRNA features may be deleted if it is very short and there is little or no supporting evidence. If you have a single gene and two partial CDS/mRNA features, you should: (1) add a note to each CDS referencing the other half of the gene, (2) add a note to the gene and CDS features stating, "gap found within coding sequence."
A CDS can cross the gap if the gap size is estimated; however, a CDS (or mRNA) should not cross a gap such that over 50% of the translation is X (ie, in the gap). This situation will generate an error. Again, the CDS/mRNA should either be partial up to the gap or split into two partial CDS/mRNA features on either side of the gap, depending upon your confidence in the translation on each side of the gap.
In addition, no feature should begin or end inside a gap. Instead, the feature should abut the gap and be partial.
For more information about splitting CDS features, see either the eukaryotic annotation guidelines or the prokaryotic annotation guidelines .
table2asn arguments
-indir path : Path to Files
-t template.sbt : Template File
-M n : Genome Flags Normal
-Z : Discrepancy Report
-l (lowercase 'l', as in 'linkage') Evidence : type of evidence used to assert linkage across assembly_gaps. Must be one of the following:
paired-ends
map
proximity-ligation
align-genus
align-xgenus
align-trnscpt
-linkage-evidence-file File : File listing linkage evidence for gaps of different unique lengths
Genome Resources
- About WGS
- WGS Browser
- Genome Submission Guide
- Genome Submission Portal
- Update Genome Records
- FAQ
- table2asn
- Submitting Multiple Haplotype Assemblies
- Create Submission Template
- Eukaryotic Annotation Guide
- Prokaryotic Annotation Guide
- Annotation Example Files
- Annotating Genomes with GFF3 or GTF files
- Validation Error Explanations for Genomes
- Discrepancy Report
- NCBI Prokaryotic Genome Annotation Pipeline
- AGP Format
- Metagenome Submission Guide
- Structured Comment
- BioProject
- BioSample