NCBI Prunus avium Annotation Release 100

The RefSeq genome records for Prunus avium were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Prunus avium Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Jul 25 2017
Date of submission of annotation to the public databases: Jul 26 2017
Software version: 7.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
PAV_r1.0	GCF_002207925.1	Kazusa DNA Research Institute	06-12-2017	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	PAV_r1.0
Genes and pseudogenes	30,405
protein-coding	25,841
non-coding	2,959
pseudogenes	1,605
genes with variants	6,033
mRNAs	35,009
fully-supported	30,339
with > 5% ab initio	3,500
partial	1,454
with filled gap(s)	214
known RefSeq (NM_)	0
model RefSeq (XM_)	35,009
Other RNAs	5,224
fully-supported	4,805
with > 5% ab initio	0
partial	9
with filled gap(s)	9
known RefSeq (NR_)	0
model RefSeq (XR_)	4,805
CDSs	35,009
fully-supported	30,339
with > 5% ab initio	3,623
partial	1,426
with major correction(s)	306
known RefSeq (NP_)	0
model RefSeq (XP_)	35,009

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	28,800	3,237	2,418	71	89,252
All transcripts	40,233	1,756	1,523	62	16,804
mRNA	35,009	1,849	1,613	165	16,804
misc_RNA	1,410	1,979	1,728	121	8,486
tRNA	419	74	73	71	84
lncRNA	3,395	915	647	62	10,921
Single-exon transcripts	4,542	1,311	1,134	208	7,394
coding transcripts (NM_/XM_ )	4,541	1,311	1,134	208	7,394
non-coding transcripts (NR_/XR_ )	1	681	681	681	681
CDSs	35,009	1,404	1,170	114	16,074
Exons	162,955	320	170	1	7,929
in coding transcripts (NM_/XM_ )	152,626	322	169	1	7,929
in non-coding transcripts (NR_/XR_ )	14,769	261	152	2	5,407
Introns	128,778	411	175	30	84,564
in coding transcripts (NM_/XM_ )	122,185	405	172	30	84,564
in non-coding transcripts (NR_/XR_ )	10,958	476	206	30	54,080

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.4	1	1	32
Number of exons per transcript	6.02	4	1	79

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 25841 coding genes, 23676 genes had a protein with an alignment covering 50% or more of the query and 10588 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
PAV_r1.0	GCF_002207925.1	2.23%	24.40%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	171	168 (98.25%)	155 (90.64%)	99.57%	97.83%
Same-species EST	6,035	3,583 (59.37%)	3,041 (50.39%)	99.36%	95.67%
Rosaceae known RefSeq (NM_/NR_)	954	835 (87.53%)	347 (36.37%)	89.60%	90.20%
Rosaceae Genbank	7,117	5,385 (75.66%)	2,194 (30.83%)	92.02%	92.86%
Rosaceae TSA	345,285	214,660 (62.17%)	63,115 (18.28%)	97.45%	98.23%
Rosaceae EST	533,541	209,992 (39.36%)	153,894 (28.84%)	92.46%	97.53%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	1,865,011,170	76%	22%	155,084
SAMD00016758	NA	Pollen from Prunus avium cv. Satonishiki (Prunus avium, SAMD00016758)	45,228,316	69%	17%	60,527
SAMD00052828	28541388	Prunus avium RNA-Seq (Prunus avium, SAMD00052828)	60,788,404	7%	54%	110,008
SAMN00722067	NA	Bing/Mazzard developing floral buds (Prunus avium, SAMN00722067)	72,325	24%	5%	714
SAMN00722068	NA	Bing/Mazzard Developing floral buds Sept (Prunus avium, SAMN00722068)	80,160	24%	5%	819
SAMN00722069	NA	Rainier/Mazzard Developing floral buds May (Prunus avium, SAMN00722069)	105,510	33%	5%	1,420
SAMN00722070	NA	Rainier/Mazzard Developing floral buds Sep (Prunus avium, SAMN00722070)	40,238	33%	8%	675
SAMN00722071	NA	Bing/Gisela6 Developing floral buds May (Prunus avium, SAMN00722071)	105,128	34%	5%	1,405
SAMN00722072	NA	Bing/Gisela6 Developing floral buds Sep (Prunus avium, SAMN00722072)	46,069	29%	4%	447
SAMN00722073	NA	Rainier/Gisela6 Developing floral buds May (Prunus avium, SAMN00722073)	71,286	37%	6%	1,205
SAMN00722074	NA	Rainier/Gisela6 Developing floral buds Sep (Prunus avium, SAMN00722074)	59,736	23%	3%	499
SAMN01925710	NA	Prunus avium Bing (Prunus avium, SAMN01925710)	374,511	68%	58%	64,190
SAMN01925711	NA	Prunus avium Rainier (Prunus avium, SAMN01925711)	294,129	65%	56%	52,171
SAMN02146682	22610921	ovary (Prunus avium, SAMN02146682)	22,151,715	58%	18%	88,101
SAMN02146683	22610921	exocarp (Prunus avium, SAMN02146683)	43,614,848	65%	15%	85,445
SAMN02146684	22610921	exocarp (Prunus avium, SAMN02146684)	57,130,257	64%	15%	77,868
SAMN02146685	22610921	exocarp (Prunus avium, SAMN02146685)	38,051,582	67%	14%	79,769
SAMN02146686	22610921	mesocarp (Prunus avium, SAMN02146686)	23,645,943	62%	16%	71,445
SAMN02146687	22610921	exocarp (Prunus avium, SAMN02146687)	48,366,434	52%	13%	64,723
SAMN02146688	22610921	exocarp (Prunus avium, SAMN02146688)	67,560,698	51%	15%	92,344
SAMN02146689	22610921	exocarp (Prunus avium, SAMN02146689)	33,357,061	48%	16%	83,584
SAMN02146690	22610921	exocarp (Prunus avium, SAMN02146690)	58,941,025	44%	17%	87,800
SAMN02146691	22610921	exocarp (Prunus avium, SAMN02146691)	28,792,331	51%	16%	73,668
SAMN02146692	22610921	exocarp (Prunus avium, SAMN02146692)	67,833,336	53%	17%	87,090
SAMN02146693	22610921	exocarp (Prunus avium, SAMN02146693)	25,232,045	59%	16%	54,132
SAMN02146694	22610921	exocarp (Prunus avium, SAMN02146694)	56,580,086	69%	15%	84,290
SAMN02146695	22610921	mesocarp (Prunus avium, SAMN02146695)	25,865,574	64%	18%	72,982
SAMN02146696	22610921	exocarp (Prunus avium, SAMN02146696)	25,653,752	68%	15%	74,178
SAMN02146697	22610921	exocarp (Prunus avium, SAMN02146697)	31,068,305	69%	14%	80,010
SAMN02316507	NA	Time Series RNA-Seq for Prunus avium L. cv. F12/1 (Prunus avium, SAMN02316507)	111,148,542	84%	8%	110,095
SAMN02941158	NA	fruit (Prunus avium, SAMN02941158)	70,446,308	90%	23%	118,104
SAMN02941159	NA	fruit (Prunus avium, SAMN02941159)	67,133,404	89%	22%	118,734
SAMN06284304	NA	day 6, hydrogen cyanamide treatment (Prunus avium, SAMN06284304)	33,687,530	93%	27%	122,031
SAMN06284305	NA	day 6, hydrogen cyanamide treatment (Prunus avium, SAMN06284305)	41,185,698	93%	27%	126,130
SAMN06284306	NA	day 6, hydrogen cyanamide treatment (Prunus avium, SAMN06284306)	36,691,426	92%	27%	125,502
SAMN06284307	NA	day 3, hydrogen cyanamide treatment (Prunus avium, SAMN06284307)	39,534,236	92%	27%	126,711
SAMN06284308	NA	day 3, hydrogen cyanamide treatment (Prunus avium, SAMN06284308)	41,703,060	92%	27%	125,793
SAMN06284309	NA	day 3, hydrogen cyanamide treatment (Prunus avium, SAMN06284309)	41,620,576	92%	27%	125,863
SAMN06284310	NA	day 1, hydrogen cyanamide treatment (Prunus avium, SAMN06284310)	42,725,290	92%	26%	123,533
SAMN06284311	NA	day 1, hydrogen cyanamide treatment (Prunus avium, SAMN06284311)	40,747,708	92%	26%	123,351
SAMN06284312	NA	day 1, hydrogen cyanamide treatment (Prunus avium, SAMN06284312)	39,058,558	91%	26%	123,265
SAMN06284313	NA	day 6, water treatment (Prunus avium, SAMN06284313)	48,030,508	91%	26%	128,258
SAMN06284314	NA	day 6, water treatment (Prunus avium, SAMN06284314)	44,590,342	90%	26%	127,490
SAMN06284315	NA	day 6, water treatment (Prunus avium, SAMN06284315)	47,646,660	89%	26%	126,942
SAMN06284316	NA	day 3, water treatment (Prunus avium, SAMN06284316)	43,432,664	90%	26%	125,593
SAMN06284317	NA	day 3, water treatment (Prunus avium, SAMN06284317)	46,769,998	90%	26%	126,660
SAMN06284318	NA	day 3, water treatment (Prunus avium, SAMN06284318)	35,896,830	91%	26%	122,026
SAMN06284319	NA	day 1, water treatment (Prunus avium, SAMN06284319)	38,364,180	90%	26%	123,721
SAMN06284320	NA	day 1, water treatment (Prunus avium, SAMN06284320)	40,484,608	91%	26%	124,099
SAMN06284321	NA	day 1, water treatment (Prunus avium, SAMN06284321)	36,319,592	92%	26%	122,201
SAMN06284322	NA	day 0, before treatment (Prunus avium, SAMN06284322)	37,038,386	90%	25%	119,862
SAMN06284323	NA	day 0, before treatment (Prunus avium, SAMN06284323)	38,778,346	91%	25%	119,495
SAMN06284324	NA	day 0, before treatment (Prunus avium, SAMN06284324)	40,865,916	92%	25%	120,491

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
DRR002282	DRX001700	DRP000624	SAMD00016758	45,228,316	69%	17%
DRR061600	DRX055841	DRP003671	SAMD00052828	8,204,192	7%	53%
DRR061601	DRX055842	DRP003671	SAMD00052828	4,329,310	7%	55%
DRR061602	DRX055843	DRP003671	SAMD00052828	6,564,850	7%	49%
DRR061603	DRX055844	DRP003671	SAMD00052828	3,723,188	5%	58%
DRR061604	DRX055845	DRP003671	SAMD00052828	3,420,010	5%	61%
DRR061605	DRX055846	DRP003671	SAMD00052828	14,010,272	8%	54%
DRR061606	DRX055847	DRP003671	SAMD00052828	8,637,484	7%	54%
DRR061607	DRX055848	DRP003671	SAMD00052828	11,899,098	6%	54%
SRR345674	SRX097690	SRP008277	SAMN00722067	72,325	24%	5%
SRR345675	SRX097691	SRP008277	SAMN00722068	80,160	24%	5%
SRR345676	SRX097694	SRP008277	SAMN00722069	105,510	33%	5%
SRR345678	SRX097695	SRP008277	SAMN00722070	40,238	33%	8%
SRR345679	SRX097692	SRP008277	SAMN00722071	105,128	34%	5%
SRR345680	SRX097693	SRP008277	SAMN00722072	46,069	29%	4%
SRR345681	SRX097696	SRP008277	SAMN00722073	71,286	37%	6%
SRR346061	SRX097697	SRP008277	SAMN00722074	59,736	23%	3%
SRR866155	SRX283805	SRP011083	SAMN02146682	22,151,715	58%	18%
SRR866211	SRX283806	SRP011083	SAMN02146683	27,657,098	65%	14%
SRR866212	SRX283807	SRP011083	SAMN02146683	15,957,750	64%	17%
SRR866215	SRX283808	SRP011083	SAMN02146684	25,452,281	67%	14%
SRR866216	SRX283812	SRP011083	SAMN02146684	31,677,976	62%	15%
SRR866217	SRX283813	SRP011083	SAMN02146685	21,085,477	68%	14%
SRR866218	SRX283815	SRP011083	SAMN02146685	16,966,105	66%	14%
SRR866219	SRX283819	SRP011083	SAMN02146686	23,645,943	62%	16%
SRR866223	SRX283833	SRP011083	SAMN02146687	23,537,712	50%	13%
SRR866224	SRX283834	SRP011083	SAMN02146687	24,828,722	53%	13%
SRR866225	SRX283835	SRP011083	SAMN02146688	32,845,285	48%	16%
SRR866227	SRX283836	SRP011083	SAMN02146688	34,715,413	53%	15%
SRR866228	SRX283837	SRP011083	SAMN02146689	33,357,061	48%	16%
SRR866229	SRX283838	SRP011083	SAMN02146690	26,040,130	47%	15%
SRR866230	SRX283839	SRP011083	SAMN02146690	32,900,895	43%	17%
SRR866231	SRX283841	SRP011083	SAMN02146691	28,792,331	51%	16%
SRR866233	SRX283842	SRP011083	SAMN02146692	28,007,802	55%	17%
SRR866234	SRX283843	SRP011083	SAMN02146692	39,825,534	51%	17%
SRR866235	SRX283844	SRP011083	SAMN02146693	25,232,045	59%	16%
SRR866236	SRX283845	SRP011083	SAMN02146694	26,279,661	61%	14%
SRR866238	SRX283846	SRP011083	SAMN02146694	30,300,425	75%	16%
SRR866239	SRX283847	SRP011083	SAMN02146695	25,865,574	64%	18%
SRR866240	SRX283848	SRP011083	SAMN02146696	25,653,752	68%	15%
SRR866241	SRX283849	SRP011083	SAMN02146697	31,068,305	69%	14%
SRR767738	SRX246438	SRP020000	SAMN01925710	165,657	66%	58%
SRR767739	SRX246439	SRP020000	SAMN01925710	208,854	70%	58%
SRR767740	SRX246440	SRP020000	SAMN01925711	143,988	60%	60%
SRR767741	SRX246441	SRP020000	SAMN01925711	150,141	69%	53%
SRR953079	SRX335689	SRP028860	SAMN02316507	30,196,062	83%	8%
SRR953080	SRX335689	SRP028860	SAMN02316507	29,285,087	85%	8%
SRR953081	SRX335689	SRP028860	SAMN02316507	26,533,808	83%	8%
SRR953082	SRX335689	SRP028860	SAMN02316507	25,133,585	84%	8%
SRR1532588	SRX666282	SRP044388	SAMN02941158	70,446,308	90%	23%
SRR1532598	SRX666283	SRP044388	SAMN02941159	67,133,404	89%	22%
SRR5219278	SRX2529419	SRP098624	SAMN06284304	33,687,530	93%	27%
SRR5219277	SRX2529418	SRP098624	SAMN06284305	41,185,698	93%	27%
SRR5219276	SRX2529417	SRP098624	SAMN06284306	36,691,426	92%	27%
SRR5219275	SRX2529416	SRP098624	SAMN06284307	39,534,236	92%	27%
SRR5219274	SRX2529415	SRP098624	SAMN06284308	41,703,060	92%	27%
SRR5219273	SRX2529414	SRP098624	SAMN06284309	41,620,576	92%	27%
SRR5219272	SRX2529413	SRP098624	SAMN06284310	42,725,290	92%	26%
SRR5219271	SRX2529412	SRP098624	SAMN06284311	40,747,708	92%	26%
SRR5219270	SRX2529411	SRP098624	SAMN06284312	39,058,558	91%	26%
SRR5219269	SRX2529410	SRP098624	SAMN06284313	48,030,508	91%	26%
SRR5219268	SRX2529409	SRP098624	SAMN06284314	44,590,342	90%	26%
SRR5219267	SRX2529408	SRP098624	SAMN06284315	47,646,660	89%	26%
SRR5219266	SRX2529407	SRP098624	SAMN06284316	43,432,664	90%	26%
SRR5219265	SRX2529406	SRP098624	SAMN06284317	46,769,998	90%	26%
SRR5219264	SRX2529405	SRP098624	SAMN06284318	35,896,830	91%	26%
SRR5219263	SRX2529404	SRP098624	SAMN06284319	38,364,180	90%	26%
SRR5219262	SRX2529403	SRP098624	SAMN06284320	40,484,608	91%	26%
SRR5219261	SRX2529402	SRP098624	SAMN06284321	36,319,592	92%	26%
SRR5219260	SRX2529401	SRP098624	SAMN06284322	37,038,386	90%	25%
SRR5219259	SRX2529400	SRP098624	SAMN06284323	38,778,346	91%	25%
SRR5219258	SRX2529399	SRP098624	SAMN06284324	40,865,916	92%	25%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Pyrus x bretschneideri high-quality model RefSeq (XP_)	18,833	18,380 (97.59%)	18,380 (97.59%)	74.00%	84.46%
Arabidopsis thaliana GenBank	53,516	49,537 (92.56%)	49,537 (92.56%)	68.62%	74.94%
Arabidopsis thaliana known RefSeq (NP_)	48,148	42,523 (88.32%)	42,523 (88.32%)	66.49%	70.22%
Rosaceae GenBank	5,421	5,314 (98.03%)	5,314 (98.03%)	73.36%	85.43%
Rosaceae known RefSeq (NP_)	857	851 (99.30%)	851 (99.30%)	72.92%	82.78%
Malus domestica high-quality model RefSeq (XP_)	18,864	18,465 (97.88%)	18,465 (97.88%)	73.74%	84.54%
Same-species GenBank	170	168 (98.82%)	168 (98.82%)	81.03%	90.29%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences