Discrepancy Report

Introduction

The Discrepancy Report is an evaluation of a single or multiple ASN.1 files, looking for suspicious annotation or annotation discrepancies that NCBI staff has noticed commonly occur in genome submissions, both complete and incomplete (WGS). A few of the problems that this function was written to find include inconsistent locus_tag prefixes, missing gene features, and suspect product names. The function is available in specially configured GenomeWorkbench, as an argument for table2asn_gff and tbl2asn, or with the command-line program asndisc. Note that the test names were improved in 2019, so those in the output from the newer tools differ slightly from those produced by tbl2asn.

Categories prefaced with FATAL should almost always be corrected before submitting to GenBank to avoid processing delays. (The exceptions are FATALs about bacteria when the genome is not bacterial.) Some of the categories are informational. Reports that are not flagged as fatal should be examined to determine if they represent annotation artifacts that need to be corrected or if they are acceptable due to the biology of the genome.

If you have questions about the Discrepancy Report, please contact us by email at genomes@ncbi.nlm.nih.gov prior to sending us your submission.

For more information about annotation requirements, be sure to read the appropriate annotation guidelines:

You may be interested to know that NCBI has a publicly available Prokaryotic Genomes Annotation Pipeline, which you can request during submission of the genome(s) to GenBank or run yourself.

Using GenomeWorkbench
Using tbl2asn or table2asn_GFF
Using asndisc
Evaluating the output
- Fatal reports
Common Reports
- From GenomeWorkbench, table2asn_GFF, and Submission Portal
- From tbl2asn

Using Genome Workbench

Genome Workbench is available here. To use the Discrepancy Report in Genome Workbench, you must enable the Sequence Editing Package, as described in these instructions.

Once the Sequence Editing Package has been enabled and you have restarted Genome Workbench, open your files and use the Submission->Reports->Submitter Report to generate the report. From the Submitter Report dialog, you can double-click on the test names on the left panel to launch bulk editors for fixing problems, or you can double-click on items in the right panel to launch individual editors.

Using table2asn_GFF

To run the discrepancy report using table2asn_gff, include the argument -Z. When run on a single file, the output will be a file with the same base name as the input file but with the suffix ".dr". When run on a directory, the output will be a file with the suffix ".dr" and the basename of the directory on which it was run. For example, a typical command when there is no annotation or when the annotation is in .tbl files would look like this:

table2asn -indir path_to_fsa_files -t template -M n -Z

and would produce an output file named "path_to_fsa_files.dr".

For more information, see the table2asn_gff instructions. Examine the contents of the ".dr" output file: Evaluating the output

Using tbl2asn

To run the discrepancy report using tbl2asn, include the argument -Z with the name of the output file. For example, a typical command would look like this:

tbl2asn -p path_to_fsa_files -t template -M n -Z discrep

For more information, see the tbl2asn instructions . Examine the contents of the output file, "discrep": Evaluating the output

Using asndisc

The commandline program asndisc is available by anonymous FTP . Copy the right version for your platform, then uncompress the file, rename it to "asndisc", and set the permissions, as necessary for the platform.

asndisc examines all the files with a common suffix in a directory and collates all the discrepancies into an output file. The standard usage runs all of the tests, but specific tests can be enabled or disabled. In addition, expanded reports of particular tests can be generated. Running "asndisc -help" provides the list of arguments and tests. We are actively updating asndisc to reflect what we see in submitted annotation. Please download the most recent version to be sure all of the latest tests are included. Note that the format of the version changed in the fall of 2019.

This is the recommended usage for ASN files that were created using table2asn or GenomeWorkbench:

asndisc  -P u -indir path_to_files -x .file_suffix -o output_file

        -indir    Path to Files [String]
        -x    File Selection Substring [String] (default = '.sqn')
        -o    Single Output File [File Out]
        -X    Expand Report Categories (comma-delimited list of test names or ALL)
        -P u  Run 'genome submitter' set of tests and include FATAL tags in output

For example the following commandline will run asndisc on all the .sqn files in the directory named DIR and will put the output in a file named discrep:

asndisc -P u -indir DIR/ -x .sqn -o discrep

NOTE: The output file will have the same content as the output file from the table2asn_gff commandline "table2asn -indir path_to_fsa_files -t template -M n -Z"

Examine the contents of the output file, "discrep": Evaluating the output

Evaluating the output

In the output file, test results are sorted by category. The top-level categories will be listed in the summary at the top of the file. Some of the reports also have subcategories that contain more descriptive information.

The Discrepancy Report is something of a blunt instrument that reports everything that fails its tests; it does not consider whether those failures are real problems or just a reflection of the biology. Look at the problematic features in the output file and examine those features in the .sqn files to determine whether the problems are real and need to be corrected, or can be ignored because the situation reflects the biology.

Fatal reports

LIST of FATAL categories:

BACTERIAL_PARTIAL_NONEXTENDABLE_PROBLEMS *
BAD_LOCUS_TAG_FORMAT
CITSUBAFFIL_CONFLICT
CONTAINED_CDS
EC_NUMBER_ON_UNKNOWN_PROTEIN
EUKARYOTE_SHOULD_HAVE_MRNA
INCONSISTENT_PROTEIN_ID
MAP_CHROMOSOME_CONFLICT
MICROSATELLITE_REPEAT_TYPE
MISSING_AFFIL
MISSING_GENES
MISSING_GENOMEASSEMBLY_COMMENTS
MISSING_PROTEIN_ID
MRNA_SHOULD_HAVE_PROTEIN_TRANSCRIPT_IDS
N_RUNS
OVERLAPPING_RRNAS
PARTIAL_PROBLEMS
PSEUDO_MISMATCH
RBS_WITHOUT_GENE
RIBOSOMAL_SLIPPAGE
RNA_CDS_OVERLAP (when coding regions are completely contained in RNAs)
RRNA_NAME_CONFLICTS
SHORT_RRNA
SHOW_HYPOTHETICAL_CDS_HAVING_GENE_NAME
SOURCE_QUALS (eg, when taxname differs)
SUSPECT_PRODUCT_NAMES
- "Remove organism from product name" category
- "Possible parsing error or incorrect formatting; remove inappropriate symbols\" category
TERMINAL_NS
TITLE_AUTHOR_CONFLICT
UNCULTURED_NOTES
UNPUB_PUB_WITHOUT_TITLE

* These are only FATAL for prokaryotes. However, the report appears with the FATAL tag when the files do not include the full taxonomy lookup, which often does not happen until processing here. We have kept them as FATAL so that submitters see the error and can decide whether it is relevant for that particular submission or not.

BACTERIAL_JOINED_FEATURES_NO_EXCEPTION
BACTERIAL_PARTIAL_NONEXTENDABLE_PROBLEMS
BACTERIA_SHOULD_NOT_HAVE_MRNA

These three categories are suspicious but weren't marked as FATAL because the situation is sometimes valid. They will always require confirmation from the submitter that they are biologically correct. These are categories that find CDS and/or RNAs overlapping or contained within each other:

OVERLAPPING_CDS
CONTAINED_CDS
RNA_CDS_OVERLAP

Common reports

The discrepancy report test names were improved in the newest asndisc, so those in the output from the newer tools differ slightly from those in tbl2asn. We hope that the newer names are easier to understand. Here are explanations of some common discrepancy report categories, for the newer and older test names:

Newer test names, from GenomeWorkbench, table2asn_GFF, and Submission Portal
Older test names, from tbl2asn

Here is a summary of the analysis of a submission, performed with the default settings of asndisc:

Summary

SOURCE_QUALS:strain (all present, all same)
SOURCE_QUALS:taxname (all present, all same)
FEATURE_COUNT:gene: 15712 present
FEATURE_COUNT:CDS: 15708 present
FEATURE_COUNT:mRNA: 15708 present
FEATURE_COUNT:misc_RNA: 1 present
FEATURE_COUNT:rRNA: 3 present
JOINED_FEATURES:14502 features have joined locations
COUNT_NUCLEOTIDES:209 nucleotide Bioseqs are present
EC_NUMBER_NOTE:2 features have EC numbers in notes or products.
FEATURE_LOCATION_CONFLICT:13007 features have inconsistent gene locations.
CONTAINED_CDS:3 coding regions are completely contained in another coding region.
SUSPECT_PRODUCT_NAMES:25 product_names contain 'suspect phrase or characters'
- 6 product names contain 'Brackets or parenthesis [] ()'
- 4 product names contain 'Mitochondrial'
- 4 product names contain 'N-term'
- 2 product names contain 'Related to'
- 2 product names contain 'Similar to'
- 2 product names contain 'gene'
- 2 product names contain 'partial'
- 2 product names contain 'similar'
- 1 product names end with like
SHORT_INTRON:221 introns are shorter than 10 nt
NO_ANNOTATION:12 bioseqs have no features
UNUSUAL_MISC_RNA:1 unexpected misc_RNA features found.
FATAL: OVERLAPPING_RRNAS:3 rRNA features overlap another rRNA feature.
FATAL: SHOW_HYPOTHETICAL_CDS_HAVING_GENE_NAME:1 hypothetical coding regions have a gene name
QUALITY_SCORES:Quality scores are missing on all sequences.

Since this was a eukaryotic organism with introns, the "features have joined locations" is expected. Similarly, since the submitters have UTR information for some mRNAs, those mRNAs (and, therefore, their genes) will extend beyond their CDS, generating "features have inconsistent gene locations" reports. However, the other reports need to be investigated to determine whether they indicate a real problem with the annotation. For example, EC numbers need to be fielded in the EC_number qualifier, unless they are within a note about similarity to another protein. However, since that function looks for #.#.#.# in product names and notes, non-EC numbers that have that format will appear in that report. Similarly, short introns may be an indication that artificial introns were inserted to correct a frameshift, which is not biologically valid.

The discrepancy report also looks for common errors in product names. Review the names and fix those that are incorrect. Since this is a eukaryote, it is possible that some of these are nuclear genes encoding organellar proteins, so perhaps the mitochondrial reports should be ignored. In contrast, no product name should contain the word 'partial'. Keep in mind that the discrepancy report will not catch every bad product name. See the product name guidelines in the Prokaryotic and Eukaryotic annotation guidelines for recommended and inappropriate product name formats.

If a subset of the submission is not annotated, as in this sample (reported as "NO_ANNOTATION:12 bioseqs have no features"), please let us know when you submit. We occasionally find sequences with missing annotation caused by incorrectly formatted table headers. Therefore, we will ask you to verify that unannotated records are expected, particularly if they are large sequences.

With regards to the QUALITY_SCORES output, we encourage all submitters to provide quality scores for contig sequences, when possible, for example when all the sequencing was done by Sanger-style sequencing on an ABI machine. We realize that for NextGen sequencing technologies (e.g., 454 or Illumina) the normalization of quality scores across different platforms is still being discussed, so we are not currently expecting quality scores on those submissions. In addition, scaffold objects do not contain quality scores. For more information on how to include quality scores, see the genome submission instructions .

After you've run the Discrepancy Report and fixed the problem annotation, let us know when you submit your genome about reports that you think can be ignored and why. If you are not certain whether a particular test is important for your genome, please ask us.

GenBank

Public nucleic acid sequence repository