NCBI Populus alba Annotation Release 100

The RefSeq genome records for Populus alba were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Populus alba Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Jun 17 2020
Date of submission of annotation to the public databases: Jun 23 2020
Software version: 8.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ASM523922v1	GCF_005239225.1	Chinese Academy of Forestry	05-08-2019	Reference	2 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ASM523922v1
Genes and pseudogenes	36,339
protein-coding	30,634
non-coding	2,833
transcribed pseudogenes	0
non-transcribed pseudogenes	2,872
genes with variants	9,289
immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	46,829
fully-supported	44,117
with > 5% ab initio	2,119
partial	187
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	46,829
non-coding RNAs	6,766
fully-supported	5,439
with > 5% ab initio	0
partial	1
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	6,103
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	46,966
fully-supported	44,117
with > 5% ab initio	2,191
partial	188
with major correction(s)	557
known RefSeq (NP_)	0
model RefSeq (XP_)	46,966

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	33,467	3,874	3,002	63	514,548
All transcripts	53,595	2,000	1,769	63	16,778
mRNA	46,829	2,048	1,799	198	16,778
misc_RNA	3,040	2,414	2,145	130	9,071
tRNA	652	74	73	70	90
lncRNA	2,399	1,543	1,101	129	10,839
snoRNA	478	104	97	63	223
snRNA	97	150	156	99	198
rRNA	100	1,312	156	103	3,401
Single-exon transcripts	4,396	1,463	1,289	222	10,836
coding transcripts (NM_/XM_ )	4,396	1,463	1,289	222	10,836
CDSs	46,966	1,487	1,242	90	16,359
Exons	212,597	334	174	1	10,836
in coding transcripts (NM_/XM_ )	201,908	330	171	1	10,836
in non-coding transcripts (NR_/XR_ )	22,141	311	163	2	7,971
Introns	169,662	469	194	30	88,881
in coding transcripts (NM_/XM_ )	162,968	462	192	30	88,881
in non-coding transcripts (NR_/XR_ )	17,906	514	211	30	61,750

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.62	1	1	50
Number of exons per transcript	6.79	5	1	80

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 30497 coding genes, 28952 genes had a protein with an alignment covering 50% or more of the query and 14295 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
ASM523922v1	GCF_005239225.1	3.34%	37.37%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	87	86 (98.85%)	86 (98.85%)	99.29%	99.77%
Same-species EST	162	154 (95.06%)	150 (92.59%)	99.31%	99.55%
Same-species long SRA	10,063,437	8,473,371 (84.20%)	3,989,654 (39.65%)	89.89%	85.99%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	6,199,575,120	84%	30%	204,121
SAMN06463933	NA	shoots (Populus alba var. alba, one month old, SAMN06463933)	48,203,412	86%	34%	165,101
SAMN06463934	NA	shoots (Populus alba var. alba, one month old, SAMN06463934)	58,482,858	87%	35%	169,692
SAMN06463937	NA	shoots (Populus alba var. alba, one month old, SAMN06463937)	46,598,650	88%	35%	165,350
SAMN06463980	NA	shoots (Populus alba var. alba, one month old, SAMN06463980)	51,558,408	87%	34%	166,875
SAMN06603334	NA	Chinese Academy of Forestry (Populus alba, 1 month, SAMN06603334)	237,026,514	87%	29%	174,726
SAMN06603335	NA	Chinese Academy of Forestry (Populus alba, 1 month, SAMN06603335)	184,842,974	85%	30%	173,700
SAMN06603336	NA	Chinese Academy of Forestry (Populus alba, 1 month, SAMN06603336)	166,688,388	82%	30%	171,284
SAMN06603350	NA	Chinese Academy of Forestry (Populus alba, 1 month, SAMN06603350)	177,341,190	79%	32%	166,289
SAMN06603366	NA	Chinese Academy of Forestry (Populus alba, 1 month, SAMN06603366)	277,563,678	85%	29%	176,505
SAMN06603367	NA	Chinese Academy of Forestry (Populus alba, 1 month, SAMN06603367)	326,070,846	87%	31%	178,490
SAMN06603368	NA	Chinese Academy of Forestry (Populus alba, 1 month, SAMN06603368)	172,579,958	80%	29%	169,192
SAMN07485202	29534547	Leaf (Populus alba var. pyramidalis, SAMN07485202)	796,287,448	68%	37%	164,198
SAMN07604232	NA	leaf, phloem, xylem and root (Populus alba, 2 years old, SAMN07604232)	362,749,552	85%	29%	187,128
SAMN08222892	NA	leaf, phloem, xylem and root (Populus alba, 2, SAMN08222892)	3,211,244,994	87%	28%	196,537
SAMN10068022	NA	Plant sample from Populus alba (Populus alba, SAMN10068022)	82,336,250	76%	35%	168,953

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR5310944	SRX2610912	SRP101304	SAMN06463933	48,203,412	86%	34%
SRR5310943	SRX2610911	SRP101304	SAMN06463934	58,482,858	87%	35%
SRR5310941	SRX2610910	SRP101304	SAMN06463937	46,598,650	88%	35%
SRR5310942	SRX2610851	SRP101304	SAMN06463980	51,558,408	87%	34%
SRR5347109	SRX2643414	SRP101893	SAMN06603334	83,370,164	85%	29%
SRR5347155	SRX2643414	SRP101893	SAMN06603334	78,006,990	88%	29%
SRR5347156	SRX2643414	SRP101893	SAMN06603334	75,649,360	88%	29%
SRR5347157	SRX2643418	SRP101893	SAMN06603335	59,444,714	85%	31%
SRR5347158	SRX2643418	SRP101893	SAMN06603335	50,825,768	84%	30%
SRR5347160	SRX2643418	SRP101893	SAMN06603335	74,572,492	87%	30%
SRR5347161	SRX2643419	SRP101893	SAMN06603336	52,148,218	79%	29%
SRR5347162	SRX2643419	SRP101893	SAMN06603336	48,841,074	80%	29%
SRR5347163	SRX2643419	SRP101893	SAMN06603336	65,699,096	86%	30%
SRR5347164	SRX2643420	SRP101893	SAMN06603350	47,901,038	74%	26%
SRR5347165	SRX2643420	SRP101893	SAMN06603350	54,272,158	82%	34%
SRR5347166	SRX2643420	SRP101893	SAMN06603350	75,167,994	80%	33%
SRR5347167	SRX2643421	SRP101893	SAMN06603366	60,421,122	87%	29%
SRR5347168	SRX2643421	SRP101893	SAMN06603366	76,407,532	88%	30%
SRR5347169	SRX2643421	SRP101893	SAMN06603366	140,735,024	83%	29%
SRR5347170	SRX2643422	SRP101893	SAMN06603367	123,475,156	87%	31%
SRR5347171	SRX2643422	SRP101893	SAMN06603367	120,699,396	87%	30%
SRR5347475	SRX2643422	SRP101893	SAMN06603367	81,896,294	87%	31%
SRR5347506	SRX2643424	SRP101893	SAMN06603368	54,033,642	79%	29%
SRR5347531	SRX2643424	SRP101893	SAMN06603368	51,544,416	78%	29%
SRR5347568	SRX2643424	SRP101893	SAMN06603368	67,001,900	81%	30%
SRR5984090	SRX3140063	SRP116302	SAMN07485202	42,248,380	69%	38%
SRR5984086	SRX3140067	SRP116302	SAMN07485202	39,795,332	69%	37%
SRR5984085	SRX3140068	SRP116302	SAMN07485202	66,456,714	68%	36%
SRR5984083	SRX3140070	SRP116302	SAMN07485202	66,706,786	65%	38%
SRR5984082	SRX3140071	SRP116302	SAMN07485202	75,926,010	68%	37%
SRR5984079	SRX3140074	SRP116302	SAMN07485202	109,470,584	70%	37%
SRR5984078	SRX3140075	SRP116302	SAMN07485202	59,014,678	70%	38%
SRR5984077	SRX3140076	SRP116302	SAMN07485202	62,112,734	66%	38%
SRR5984075	SRX3140078	SRP116302	SAMN07485202	88,780,204	67%	38%
SRR5984074	SRX3140079	SRP116302	SAMN07485202	73,458,950	69%	38%
SRR5984073	SRX3140080	SRP116302	SAMN07485202	74,871,338	69%	37%
SRR5984067	SRX3140086	SRP116302	SAMN07485202	37,445,738	69%	38%
SRR6003833	SRX3159115	SRP116769	SAMN07604232	80,964,060	81%	27%
SRR6003834	SRX3159115	SRP116769	SAMN07604232	94,298,520	92%	29%
SRR6003835	SRX3159115	SRP116769	SAMN07604232	90,938,550	92%	29%
SRR6003836	SRX3159115	SRP116769	SAMN07604232	96,548,422	77%	30%
SRR6411264	SRX3504248	SRP116769	SAMN08222892	94,325,484	83%	30%
SRR6411263	SRX3504249	SRP116769	SAMN08222892	84,579,942	91%	28%
SRR6411262	SRX3504250	SRP116769	SAMN08222892	88,157,036	93%	30%
SRR6411261	SRX3504251	SRP116769	SAMN08222892	92,501,072	93%	31%
SRR6411260	SRX3504252	SRP116769	SAMN08222892	117,094,120	93%	31%
SRR6411259	SRX3504253	SRP116769	SAMN08222892	83,682,058	91%	27%
SRR6411258	SRX3504254	SRP116769	SAMN08222892	85,616,296	88%	26%
SRR6411257	SRX3504255	SRP116769	SAMN08222892	92,299,922	92%	30%
SRR6411256	SRX3504256	SRP116769	SAMN08222892	85,817,306	92%	31%
SRR6411255	SRX3504257	SRP116769	SAMN08222892	81,802,490	77%	29%
SRR6411254	SRX3504258	SRP116769	SAMN08222892	97,707,442	85%	30%
SRR6411253	SRX3504259	SRP116769	SAMN08222892	90,938,550	92%	29%
SRR6411252	SRX3504260	SRP116769	SAMN08222892	83,919,196	92%	28%
SRR6411251	SRX3504261	SRP116769	SAMN08222892	85,882,842	79%	29%
SRR6411250	SRX3504262	SRP116769	SAMN08222892	85,055,162	82%	30%
SRR6411249	SRX3504263	SRP116769	SAMN08222892	96,091,952	87%	26%
SRR6411248	SRX3504264	SRP116769	SAMN08222892	94,298,520	92%	29%
SRR6411247	SRX3504265	SRP116769	SAMN08222892	90,612,178	90%	26%
SRR6411246	SRX3504266	SRP116769	SAMN08222892	88,795,828	85%	30%
SRR6411245	SRX3504267	SRP116769	SAMN08222892	106,468,792	86%	29%
SRR6411244	SRX3504268	SRP116769	SAMN08222892	79,207,514	82%	28%
SRR6411243	SRX3504269	SRP116769	SAMN08222892	88,843,710	81%	28%
SRR6411242	SRX3504270	SRP116769	SAMN08222892	67,174,528	90%	28%
SRR6411241	SRX3504271	SRP116769	SAMN08222892	80,964,060	81%	27%
SRR6411240	SRX3504272	SRP116769	SAMN08222892	89,744,764	87%	28%
SRR6411239	SRX3504273	SRP116769	SAMN08222892	75,990,860	74%	28%
SRR6411238	SRX3504274	SRP116769	SAMN08222892	78,564,340	75%	29%
SRR6411237	SRX3504275	SRP116769	SAMN08222892	87,939,468	90%	25%
SRR6411236	SRX3504276	SRP116769	SAMN08222892	95,749,212	93%	30%
SRR6411235	SRX3504277	SRP116769	SAMN08222892	96,548,422	77%	30%
SRR6411234	SRX3504278	SRP116769	SAMN08222892	90,169,234	87%	25%
SRR6411233	SRX3504279	SRP116769	SAMN08222892	93,381,956	92%	28%
SRR6411232	SRX3504280	SRP116769	SAMN08222892	96,363,588	92%	28%
SRR6411231	SRX3504281	SRP116769	SAMN08222892	90,491,342	92%	27%
SRR6411230	SRX3504282	SRP116769	SAMN08222892	81,898,824	90%	25%
SRR6411229	SRX3504283	SRP116769	SAMN08222892	92,566,984	91%	29%
SRR7906166	SRX4742319	SRP162694	SAMN10068022	82,336,250	76%	35%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Salicaceae GenBank	9,443	8,421 (89.18%)	8,421 (89.18%)	78.47%	88.90%
Salicaceae known RefSeq (NP_)	39	39 (100.00%)	39 (100.00%)	71.76%	85.35%
Arabidopsis thaliana GenBank	53,546	12,931 (24.15%)	12,931 (24.15%)	68.78%	78.11%
Arabidopsis thaliana known RefSeq (NP_)	48,147	35,356 (73.43%)	35,356 (73.43%)	67.02%	73.28%
Same-species GenBank	85	77 (90.59%)	77 (90.59%)	79.14%	90.20%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences