NCBI Dendrobium catenatum Annotation Release 100

The RefSeq genome records for Dendrobium catenatum were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Dendrobium catenatum Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Apr 11 2017
Date of submission of annotation to the public databases: Apr 18 2017
Software version: 7.3

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ASM160598v1	GCF_001605985.1	The National Orchid Conservation Center of China, and The Orchid Conservation and Research Center of Shenzhen	02-26-2016	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ASM160598v1
Genes and pseudogenes	24,994
protein-coding	23,131
non-coding	1,198
pseudogenes	665
genes with variants	6,261
mRNAs	34,451
fully-supported	29,338
with > 5% ab initio	4,246
partial	2,291
with filled gap(s)	1,146
known RefSeq (NM_)	0
model RefSeq (XM_)	34,451
Other RNAs	3,150
fully-supported	2,934
with > 5% ab initio	0
partial	27
with filled gap(s)	27
known RefSeq (NR_)	0
model RefSeq (XR_)	2,934
CDSs	34,451
fully-supported	29,338
with > 5% ab initio	4,361
partial	2,169
with major correction(s)	271
known RefSeq (NP_)	0
model RefSeq (XP_)	34,451

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	24,329	11,210	4,264	71	245,532
All transcripts	37,601	1,773	1,517	71	16,919
mRNA	34,451	1,795	1,541	126	16,919
misc_RNA	1,576	2,126	1,888	151	7,494
tRNA	216	74	73	71	84
lncRNA	1,358	1,071	687	98	7,863
Single-exon transcripts	4,670	1,172	964	126	9,051
coding transcripts (NM_/XM_ )	4,670	1,172	964	126	9,051
CDSs	34,451	1,372	1,128	108	16,236
Exons	141,247	308	162	2	9,051
in coding transcripts (NM_/XM_ )	135,625	308	162	2	9,051
in non-coding transcripts (NR_/XR_ )	11,116	266	137	2	5,262
Introns	113,667	2,456	312	30	116,533
in coding transcripts (NM_/XM_ )	109,762	2,427	309	30	116,533
in non-coding transcripts (NR_/XR_ )	9,123	2,978	360	31	86,335

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.55	1	1	29
Number of exons per transcript	6.4	5	1	78

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 23131 coding genes, 20373 genes had a protein with an alignment covering 50% or more of the query and 7680 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
ASM160598v1	GCF_001605985.1	2.23%	37.20%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	207	206 (99.52%)	197 (95.17%)	99.16%	93.83%
Same-species EST	800	618 (77.25%)	582 (72.75%)	98.99%	98.73%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	NA	Aggregate of all aligned samples	853,438,544	71%	24%	142,303
SAMN03389126	26904032	young root,stem and leaf (Dendrobium catenatum, young, SAMN03389126)	114,970,718	87%	18%	120,142
SAMN03389127	26904032	mature root,stem and leaf (Dendrobium catenatum, mature, SAMN03389127)	102,340,806	85%	15%	116,108
SAMN03453574	NA	seed (Dendrobium catenatum, SAMN03453574)	51,663,326	89%	18%	110,244
SAMN03610523	NA	stem (Dendrobium catenatum, SAMN03610523)	158,301,429	10%	0%	64,646
SAMN03610524	NA	leaf (Dendrobium catenatum, SAMN03610524)	165,671,543	7%	1%	87,492
SAMN03610525	NA	flower (Dendrobium catenatum, SAMN03610525)	128,826,447	19%	4%	113,618
SAMN04534727	NA	leaf, cold acclimation (Dendrobium catenatum, 120 d, SAMN04534727)	61,948,262	87%	25%	116,988
SAMN04534728	NA	leaf, cold acclimation (Dendrobium catenatum, 120 d, SAMN04534728)	55,567,960	86%	25%	116,045
SAMN04534729	NA	leaf, cold acclimation (Dendrobium catenatum, 120 d, SAMN04534729)	59,680,636	86%	24%	115,156
SAMN04534730	NA	leaf, control (Dendrobium catenatum, 120 d, SAMN04534730)	60,239,910	86%	23%	116,475
SAMN04534731	NA	leaf, control (Dendrobium catenatum, 120 d, SAMN04534731)	58,048,160	85%	24%	115,409
SAMN04534732	NA	leaf, control (Dendrobium catenatum, 120 d, SAMN04534732)	58,881,372	86%	24%	116,294

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
SRR1909494	SRX951120	SRP055853	SAMN03389126	114,970,718	87%	18%
SRR1909493	SRX951121	SRP055853	SAMN03389127	102,340,806	85%	15%
SRR1951787	SRX977847	SRP056844	SAMN03453574	51,663,326	89%	18%
SRR2012531	SRX1020452	SRP058073	SAMN03610523	19,639,265	85%	7%
SRR2012580	SRX1020467	SRP058073	SAMN03610524	11,429,449	81%	7%
SRR2014297	SRX1021786	SRP058073	SAMN03610524	77,740,646	4%	0%
SRR2014396	SRX1021811	SRP058073	SAMN03610525	66,698,392	23%	5%
SRR2014476	SRX1021811	SRP058073	SAMN03610525	54,589,642	17%	3%
SRR3210613	SRX1618941	SRP071172	SAMN04534727	61,948,262	87%	25%
SRR3210621	SRX1618948	SRP071172	SAMN04534728	55,567,960	86%	25%
SRR3210626	SRX1618950	SRP071172	SAMN04534729	59,680,636	86%	24%
SRR3210630	SRX1618954	SRP071172	SAMN04534730	60,239,910	86%	23%
SRR3210635	SRX1618957	SRP071172	SAMN04534731	58,048,160	85%	24%
SRR3210636	SRX1618958	SRP071172	SAMN04534732	58,881,372	86%	24%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species GenBank	173	168 (97.11%)	168 (97.11%)	78.55%	86.17%
Arabidopsis thaliana GenBank	53,436	47,007 (87.97%)	47,007 (87.97%)	67.61%	69.39%
Arabidopsis thaliana known RefSeq (NP_)	48,148	39,365 (81.76%)	39,365 (81.76%)	65.23%	64.12%
Liliopsida GenBank	45,538	39,052 (85.76%)	39,052 (85.76%)	69.11%	72.55%
Liliopsida known RefSeq (NP_)	176	161 (91.48%)	161 (91.48%)	71.36%	73.94%
Oryza sativa GenBank	20,378	17,428 (85.52%)	17,428 (85.52%)	68.15%	69.09%
Oryza sativa high-quality model RefSeq (XP_)	24,853	20,734 (83.43%)	20,734 (83.43%)	65.64%	63.53%
Zea mays GenBank	50,477	40,268 (79.77%)	40,268 (79.77%)	69.33%	70.43%
Zea mays known RefSeq (NP_)	21,098	18,182 (86.18%)	18,182 (86.18%)	67.49%	67.79%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences