NCBI Nicotiana tabacum Annotation Release 100

The RefSeq genome records for Nicotiana tabacum were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Nicotiana tabacum Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Apr 26 2016
Date of submission of annotation to the public databases: May 4 2016
Software version: 7.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Ntab-TN90	GCF_000715135.1	Philip Morris International R&D	05-29-2014	Reference	2 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Ntab-TN90
Genes and pseudogenes	73,946
protein-coding	61,526
non-coding	9,019
pseudogenes	3,401
genes with variants	14,549
mRNAs	84,001
fully-supported	68,794
with > 5% ab initio	13,558
partial	5,502
with filled gap(s)	2,006
known RefSeq (NM_)	0
model RefSeq (XM_)	84,001
Other RNAs	17,907
fully-supported	16,461
with > 5% ab initio	0
partial	21
with filled gap(s)	20
known RefSeq (NR_)	0
model RefSeq (XR_)	16,461
CDSs	84,001
fully-supported	68,794
with > 5% ab initio	13,760
partial	5,317
with major correction(s)	267
known RefSeq (NP_)	0
model RefSeq (XP_)	84,001

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	70,545	4,229	2,986	70	88,899
All transcripts	101,908	1,744	1,476	70	14,970
mRNA	84,001	1,768	1,527	138	14,950
misc_RNA	4,805	2,338	1,980	129	14,549
tRNA	1,446	74	73	70	88
lncRNA	11,656	1,533	999	94	14,970
Single-exon transcripts	10,397	1,092	873	138	8,129
coding transcripts (NM_/XM_ )	10,397	1,092	873	138	8,129
CDSs	84,001	1,254	1,041	96	14,630
Exons	371,350	334	177	1	11,970
in coding transcripts (NM_/XM_ )	336,746	323	171	1	11,970
in non-coding transcripts (NR_/XR_ )	46,207	395	203	2	11,096
Introns	289,737	775	291	30	52,250
in coding transcripts (NM_/XM_ )	266,880	734	278	30	52,250
in non-coding transcripts (NR_/XR_ )	33,719	1,110	413	32	37,584

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.45	1	1	45
Number of exons per transcript	5.63	4	1	81

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 61526 coding genes, 51484 genes had a protein with an alignment covering 50% or more of the query and 23115 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Ntab-TN90	GCF_000715135.1	3.00%	51.43%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	4,609	4,058 (88.05%)	3,586 (77.80%)	99.25%	95.23%
Same-species EST	332,909	291,296 (87.50%)	263,564 (79.17%)	99.32%	98.55%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	4,593,804,793	91%	22%	346,198
SAMEA3338465	Nicotiana tabacum; WTmock (Nicotiana tabacum, SAMEA3338465)	34,100,756	92%	29%	242,634
SAMEA3338466	Nicotiana tabacum; TGmock (Nicotiana tabacum, SAMEA3338466)	33,576,210	92%	29%	243,077
SAMEA3338467	Nicotiana tabacum; WTinfect (Nicotiana tabacum, SAMEA3338467)	34,990,926	92%	28%	244,130
SAMEA3338468	Nicotiana tabacum; TGinfect (Nicotiana tabacum, SAMEA3338468)	37,536,016	92%	28%	245,343
SAMN00849941	Nicotiana tabacum L. cv. SR1 stem mRNA 5 developmental stages pooling (Nicotiana tabacum, SAMN00849941)	51,688,888	91%	17%	242,446
SAMN00854692	Nicotiana tabacum L.cv.SR1 stem mRNA 5 developmental stages before flowering (RNA-seq) (Nicotiana tabacum, SAMN00854692)	38,831,377	92%	8%	217,736
SAMN02217015	eukaryotic cell, (Nicotiana tabacum, SAMN02217015)	65,666,670	90%	18%	232,447
SAMN02217020	eukaryotic cell, (Nicotiana tabacum, SAMN02217020)	65,666,670	90%	18%	233,604
SAMN02316609	Leaf (Nicotiana tabacum, SAMN02316609)	109,139,490	89%	22%	278,126
SAMN02316610	Leaf (Nicotiana tabacum, SAMN02316610)	164,957,186	91%	20%	284,585
SAMN02316611	Leaf (Nicotiana tabacum, SAMN02316611)	119,836,094	92%	23%	277,996
SAMN02316612	Root (Nicotiana tabacum, SAMN02316612)	98,714,710	90%	21%	280,154
SAMN02316613	Root (Nicotiana tabacum, SAMN02316613)	91,788,176	89%	20%	282,993
SAMN02316614	Root (Nicotiana tabacum, SAMN02316614)	104,367,224	89%	21%	281,753
SAMN02316615	Leaf (Nicotiana tabacum, SAMN02316615)	87,830,686	91%	22%	268,229
SAMN02316616	Leaf (Nicotiana tabacum, SAMN02316616)	118,106,170	91%	23%	267,394
SAMN02316617	Leaf (Nicotiana tabacum, SAMN02316617)	182,490,838	91%	23%	274,716
SAMN02316618	Root (Nicotiana tabacum, SAMN02316618)	134,638,766	88%	20%	287,825
SAMN02316619	Root (Nicotiana tabacum, SAMN02316619)	140,027,656	89%	21%	288,148
SAMN02316620	Root (Nicotiana tabacum, SAMN02316620)	134,701,754	88%	20%	287,200
SAMN02316621	Leaf (Nicotiana tabacum, SAMN02316621)	115,007,276	91%	22%	262,378
SAMN02316622	Leaf (Nicotiana tabacum, SAMN02316622)	99,846,164	91%	22%	265,789
SAMN02316623	Leaf (Nicotiana tabacum, SAMN02316623)	107,517,004	90%	22%	263,590
SAMN02316624	Root (Nicotiana tabacum, SAMN02316624)	166,749,150	89%	21%	288,443
SAMN02316625	Root (Nicotiana tabacum, SAMN02316625)	141,555,306	89%	21%	284,460
SAMN02316626	Root (Nicotiana tabacum, SAMN02316626)	176,144,178	89%	21%	290,013
SAMN02645674	Immature Flower (Nicotiana tabacum, missing, SAMN02645674)	97,903,872	94%	24%	291,801
SAMN02645675	Mature Flower (Nicotiana tabacum, missing, SAMN02645675)	100,846,380	94%	22%	288,829
SAMN02645676	Senescent Flower (Nicotiana tabacum, missing, SAMN02645676)	57,299,122	94%	20%	255,038
SAMN02645677	Dry Capsule (Nicotiana tabacum, missing, SAMN02645677)	43,539,758	91%	21%	226,543
SAMN02645678	Stem (Nicotiana tabacum, missing, SAMN02645678)	52,035,046	94%	21%	244,430
SAMN02645679	Root (Nicotiana tabacum, missing, SAMN02645679)	77,425,142	90%	20%	259,021
SAMN02645680	Young Leaf (Nicotiana tabacum, missing, SAMN02645680)	54,262,284	94%	24%	247,697
SAMN02645681	Mature Leaf (Nicotiana tabacum, missing, SAMN02645681)	69,984,574	94%	23%	250,838
SAMN02645682	Senescent Leaf (Nicotiana tabacum, missing, SAMN02645682)	80,142,054	94%	23%	261,072
SAMN02645683	Immature Flower (Nicotiana tabacum, missing, SAMN02645683)	92,768,572	93%	24%	288,434
SAMN02645684	Mature Flower (Nicotiana tabacum, missing, SAMN02645684)	86,486,460	94%	22%	276,503
SAMN02645685	Senescent Flower (Nicotiana tabacum, missing, SAMN02645685)	46,017,156	93%	21%	254,554
SAMN02645686	Dry Capsule (Nicotiana tabacum, missing, SAMN02645686)	57,343,858	91%	21%	239,634
SAMN02645687	Stem (Nicotiana tabacum, missing, SAMN02645687)	54,634,530	94%	19%	251,028
SAMN02645688	Root (Nicotiana tabacum, missing, SAMN02645688)	23,162,812	89%	21%	229,830
SAMN02645689	Young Leaf (Nicotiana tabacum, missing, SAMN02645689)	69,627,388	93%	23%	251,818
SAMN02645690	Mature Leaf (Nicotiana tabacum, missing, SAMN02645690)	57,260,944	93%	24%	248,801
SAMN02645691	Senescent Leaf (Nicotiana tabacum, missing, SAMN02645691)	79,186,800	94%	23%	257,889
SAMN02645692	Immature Flower (Nicotiana tabacum, missing, SAMN02645692)	81,592,536	94%	25%	283,837
SAMN02645693	Mature Flower (Nicotiana tabacum, missing, SAMN02645693)	88,309,184	93%	23%	285,013
SAMN02645694	Dry Capsule (Nicotiana tabacum, missing, SAMN02645694)	43,118,862	91%	20%	223,206
SAMN02645695	Stem (Nicotiana tabacum, missing, SAMN02645695)	86,850,900	94%	21%	272,063
SAMN02645696	Root (Nicotiana tabacum, missing, SAMN02645696)	86,243,630	90%	19%	269,032
SAMN02645697	Young Leaf (Nicotiana tabacum, missing, SAMN02645697)	73,650,890	94%	24%	259,161
SAMN02645698	Mature Leaf (Nicotiana tabacum, missing, SAMN02645698)	76,538,320	94%	24%	257,573
SAMN02645699	Senescent Leaf (Nicotiana tabacum, missing, SAMN02645699)	93,215,816	94%	22%	251,709
SAMN03280119	Whole plant (Nicotiana tabacum, 21-day-old, SAMN03280119)	7,613,340	66%	23%	165,789
SAMN04009965	young primary leaf, Wov transgenic (Nicotiana tabacum, SAMN04009965)	17,367,232	73%	15%	174,902
SAMN04009966	young primary leaf, Wov transgenic (Nicotiana tabacum, SAMN04009966)	20,029,812	81%	17%	190,998
SAMN04009967	young primary leaf, wild-type (Nicotiana tabacum, SAMN04009967)	20,992,196	81%	18%	191,221
SAMN04009968	young primary leaf, wild-type (Nicotiana tabacum, SAMN04009968)	19,779,274	81%	18%	189,085
SAMN04386829	leaf (Nicotiana tabacum, 21 days, SAMN04386829)	23,100,708	93%	30%	216,199

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
ERR850841	ERX931570	ERP010141	SAMEA3338465	34,100,756	92%	29%
ERR850842	ERX931571	ERP010141	SAMEA3338466	33,576,210	92%	29%
ERR850843	ERX931572	ERP010141	SAMEA3338467	34,990,926	92%	28%
ERR850844	ERX931573	ERP010141	SAMEA3338468	37,536,016	92%	28%
SRR458498	SRX136990	SRP012108	SAMN00849941	51,688,888	91%	17%
SRR475644	SRX139858	SRP012144	SAMN00854692	38,831,377	92%	8%
SRR924285	SRX316946	SRP026451	SAMN02217015	65,666,670	90%	18%
SRR924313	SRX317021	SRP026451	SAMN02217020	65,666,670	90%	18%
SRR955761	SRX338101	SRP029183	SAMN02316609	109,139,490	89%	22%
SRR955762	SRX338102	SRP029183	SAMN02316610	164,957,186	91%	20%
SRR955763	SRX338103	SRP029183	SAMN02316611	119,836,094	92%	23%
SRR955765	SRX338104	SRP029183	SAMN02316612	98,714,710	90%	21%
SRR955766	SRX338105	SRP029183	SAMN02316613	91,788,176	89%	20%
SRR955767	SRX338106	SRP029183	SAMN02316614	104,367,224	89%	21%
SRR1199197	SRX495602	SRP029183	SAMN02645674	97,903,872	94%	24%
SRR1199069	SRX495520	SRP029183	SAMN02645675	100,846,380	94%	22%
SRR1199124	SRX495530	SRP029183	SAMN02645676	57,299,122	94%	20%
SRR1199063	SRX495517	SRP029183	SAMN02645677	43,539,758	91%	21%
SRR1199130	SRX495598	SRP029183	SAMN02645678	52,035,046	94%	21%
SRR1199121	SRX495526	SRP029183	SAMN02645679	77,425,142	90%	20%
SRR1199200	SRX495606	SRP029183	SAMN02645680	54,262,284	94%	24%
SRR1199072	SRX495523	SRP029183	SAMN02645681	69,984,574	94%	23%
SRR1199127	SRX495532	SRP029183	SAMN02645682	80,142,054	94%	23%
SRR1199198	SRX495603	SRP029183	SAMN02645683	92,768,572	93%	24%
SRR1199070	SRX495521	SRP029183	SAMN02645684	86,486,460	94%	22%
SRR1199125	SRX495531	SRP029183	SAMN02645685	46,017,156	93%	21%
SRR1199066	SRX495518	SRP029183	SAMN02645686	57,343,858	91%	21%
SRR1199132	SRX495600	SRP029183	SAMN02645687	54,634,530	94%	19%
SRR1199122	SRX495527	SRP029183	SAMN02645688	23,162,812	89%	21%
SRR1199202	SRX495607	SRP029183	SAMN02645689	69,627,388	93%	23%
SRR1199073	SRX495524	SRP029183	SAMN02645690	57,260,944	93%	24%
SRR1199128	SRX495534	SRP029183	SAMN02645691	79,186,800	94%	23%
SRR1199199	SRX495605	SRP029183	SAMN02645692	81,592,536	94%	25%
SRR1199071	SRX495522	SRP029183	SAMN02645693	88,309,184	93%	23%
SRR1199068	SRX495519	SRP029183	SAMN02645694	43,118,862	91%	20%
SRR1199135	SRX495601	SRP029183	SAMN02645695	86,850,900	94%	21%
SRR1199123	SRX495529	SRP029183	SAMN02645696	86,243,630	90%	19%
SRR1199203	SRX495608	SRP029183	SAMN02645697	73,650,890	94%	24%
SRR1199074	SRX495525	SRP029183	SAMN02645698	76,538,320	94%	24%
SRR1199129	SRX495535	SRP029183	SAMN02645699	93,215,816	94%	22%
SRR955772	SRX338110	SRP029184	SAMN02316615	87,830,686	91%	22%
SRR955773	SRX338111	SRP029184	SAMN02316616	118,106,170	91%	23%
SRR955774	SRX338112	SRP029184	SAMN02316617	182,490,838	91%	23%
SRR955776	SRX338113	SRP029184	SAMN02316618	134,638,766	88%	20%
SRR955777	SRX338114	SRP029184	SAMN02316619	140,027,656	89%	21%
SRR955778	SRX338115	SRP029184	SAMN02316620	134,701,754	88%	20%
SRR955783	SRX338120	SRP029185	SAMN02316621	115,007,276	91%	22%
SRR955784	SRX338121	SRP029185	SAMN02316622	99,846,164	91%	22%
SRR955785	SRX338122	SRP029185	SAMN02316623	107,517,004	90%	22%
SRR955786	SRX338123	SRP029185	SAMN02316624	166,749,150	89%	21%
SRR955787	SRX338124	SRP029185	SAMN02316625	141,555,306	89%	21%
SRR955788	SRX338125	SRP029185	SAMN02316626	176,144,178	89%	21%
SRR1747973	SRX835497	SRP051965	SAMN03280119	3,133,940	67%	24%
SRR1747974	SRX835497	SRP051965	SAMN03280119	4,479,400	66%	22%
SRR2182161	SRX1163546	SRP062809	SAMN04009965	17,367,232	73%	15%
SRR2182162	SRX1163547	SRP062809	SAMN04009966	20,029,812	81%	17%
SRR2182163	SRX1163548	SRP062809	SAMN04009967	20,992,196	81%	18%
SRR2182164	SRX1163549	SRP062809	SAMN04009968	19,779,274	81%	18%
SRR3099652	SRX1528249	SRP068346	SAMN04386829	3,331,796	93%	28%
SRR3099654	SRX1528249	SRP068346	SAMN04386829	4,806,609	94%	30%
SRR3099655	SRX1528249	SRP068346	SAMN04386829	4,806,609	94%	30%
SRR3100097	SRX1528249	SRP068346	SAMN04386829	3,411,949	92%	31%
SRR3100098	SRX1528249	SRP068346	SAMN04386829	3,411,949	92%	31%
SRR3100099	SRX1528249	SRP068346	SAMN04386829	3,331,796	93%	29%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Arabidopsis thaliana known RefSeq (NP_)	35,173	30,658 (87.16%)	30,658 (87.16%)	68.45%	72.86%
Solanaceae GenBank	9,563	9,381 (98.10%)	9,381 (98.10%)	75.15%	84.18%
Solanaceae known RefSeq (NP_)	2,928	2,914 (99.52%)	2,914 (99.52%)	76.21%	84.85%
Same-species GenBank	2,812	2,767 (98.40%)	2,767 (98.40%)	78.91%	87.90%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences