NCBI Sipha flava Annotation Release 100

The RefSeq genome records for Sipha flava were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Sipha flava Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Jul 5 2018
Date of submission of annotation to the public databases: Jul 6 2018
Software version: 8.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
YSA_version1	GCF_003268045.1	USDA-ARS Center for Grain and Animal Health Research	06-26-2018	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	YSA_version1
Genes and pseudogenes	15,476
protein-coding	13,575
non-coding	976
transcribed pseudogenes	0
non-transcribed pseudogenes	925
genes with variants	4,088
immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	21,316
fully-supported	18,438
with > 5% ab initio	2,284
partial	813
with filled gap(s)	411
known RefSeq (NM_)	0
model RefSeq (XM_)	21,316
non-coding RNAs	1,591
fully-supported	1,341
with > 5% ab initio	0
partial	3
with filled gap(s)	2
known RefSeq (NR_)	0
model RefSeq (XR_)	1,410
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	21,316
fully-supported	18,438
with > 5% ab initio	2,349
partial	690
with major correction(s)	175
known RefSeq (NP_)	0
model RefSeq (XP_)	21,316

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	14,551	13,001	3,946	41	577,774
All transcripts	22,907	2,112	1,669	41	28,191
mRNA	21,316	2,189	1,727	185	28,191
misc_RNA	388	1,748	1,573	137	8,179
tRNA	181	74	73	71	84
lncRNA	953	1,061	827	98	9,453
snoRNA	13	111	78	41	206
snRNA	38	136	127	91	194
rRNA	18	119	119	119	119
Single-exon transcripts	1,423	1,048	777	220	7,215
coding transcripts (NM_/XM_ )	1,423	1,048	777	220	7,215
CDSs	21,316	1,695	1,260	156	27,333
Exons	107,461	273	179	2	13,352
in coding transcripts (NM_/XM_ )	104,014	272	179	2	13,352
in non-coding transcripts (NR_/XR_ )	4,897	277	171	3	6,521
Introns	90,584	2,393	159	30	477,319
in coding transcripts (NM_/XM_ )	88,170	2,402	157	30	477,319
in non-coding transcripts (NR_/XR_ )	3,770	2,441	252	47	123,328

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.58	1	1	50
Number of exons per transcript	8.33	6	1	146

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Drosophila melanogaster known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 13575 coding genes, 7993 genes had a protein with an alignment covering 50% or more of the query and 2331 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Drosophila melanogaster known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
YSA_version1	GCF_003268045.1	4.58%	43.69%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

No transcript evidence was used in this annotation

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	2,205,510,414	78%	29%	116,908
SAMN08936755	Whole body (Sipha flava, SAMN08936755)	68,484,338	78%	29%	90,629
SAMN08936756	Whole body (Sipha flava, SAMN08936756)	51,471,254	80%	29%	88,801
SAMN08936757	Whole body (Sipha flava, SAMN08936757)	46,798,200	79%	29%	88,851
SAMN08936758	Whole body (Sipha flava, SAMN08936758)	62,085,714	80%	29%	90,699
SAMN08936759	Whole body (Sipha flava, SAMN08936759)	62,752,412	77%	29%	90,502
SAMN08936760	Whole body (Sipha flava, SAMN08936760)	60,171,142	79%	29%	90,498
SAMN08936761	Whole body (Sipha flava, SAMN08936761)	49,748,140	74%	28%	88,943
SAMN08936762	Whole body (Sipha flava, SAMN08936762)	75,351,094	77%	29%	91,705
SAMN08936763	Whole body (Sipha flava, SAMN08936763)	71,024,534	76%	29%	91,954
SAMN08936764	Whole body (Sipha flava, SAMN08936764)	65,824,072	80%	29%	90,808
SAMN08936765	Whole body (Sipha flava, SAMN08936765)	63,329,330	79%	29%	90,885
SAMN08936766	Whole body (Sipha flava, SAMN08936766)	72,079,214	80%	29%	91,331
SAMN08936767	Whole body (Sipha flava, SAMN08936767)	64,153,206	81%	30%	91,823
SAMN08936768	Whole body (Sipha flava, SAMN08936768)	61,461,326	76%	29%	91,122
SAMN08936769	Whole body (Sipha flava, SAMN08936769)	63,223,184	79%	29%	90,896
SAMN08936770	Whole body (Sipha flava, SAMN08936770)	54,162,896	79%	29%	89,983
SAMN08936771	Whole body (Sipha flava, SAMN08936771)	55,210,998	80%	29%	89,459
SAMN08936772	Whole body (Sipha flava, SAMN08936772)	74,166,982	76%	29%	91,801
SAMN08936773	Whole body (Sipha flava, SAMN08936773)	62,351,732	77%	29%	90,445
SAMN08936774	Whole body (Sipha flava, SAMN08936774)	47,615,742	77%	29%	89,309
SAMN08936775	Whole body (Sipha flava, SAMN08936775)	64,653,588	75%	29%	90,198
SAMN08936776	Whole body (Sipha flava, SAMN08936776)	70,696,568	74%	29%	91,386
SAMN08936777	Whole body (Sipha flava, SAMN08936777)	54,035,620	71%	29%	89,378
SAMN08936778	Whole body (Sipha flava, SAMN08936778)	73,786,942	77%	29%	92,307
SAMN08936779	Whole body (Sipha flava, SAMN08936779)	51,366,276	80%	29%	89,647
SAMN08936780	Whole body (Sipha flava, SAMN08936780)	63,319,976	78%	29%	91,147
SAMN08936781	Whole body (Sipha flava, SAMN08936781)	55,011,408	79%	29%	90,175
SAMN08936782	Whole body (Sipha flava, SAMN08936782)	52,607,510	79%	30%	90,057
SAMN08936783	Whole body (Sipha flava, SAMN08936783)	54,082,220	77%	29%	89,985
SAMN08936784	Whole body (Sipha flava, SAMN08936784)	62,943,156	78%	30%	91,665
SAMN08936785	Whole body (Sipha flava, SAMN08936785)	60,398,922	79%	29%	91,133
SAMN08936786	Whole body (Sipha flava, SAMN08936786)	50,828,700	78%	30%	89,405
SAMN08936787	Whole body (Sipha flava, SAMN08936787)	64,203,014	77%	30%	92,002
SAMN08936788	Whole body (Sipha flava, SAMN08936788)	60,927,946	76%	29%	91,081
SAMN08936789	Whole body (Sipha flava, SAMN08936789)	73,186,532	77%	29%	92,706
SAMN08936790	Whole body (Sipha flava, SAMN08936790)	61,996,526	75%	30%	91,091

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR7009168	SRX3941661	SRP140496	SAMN08936755	68,484,338	78%	29%
SRR7009167	SRX3941662	SRP140496	SAMN08936756	51,471,254	80%	29%
SRR7009166	SRX3941663	SRP140496	SAMN08936757	46,798,200	79%	29%
SRR7009165	SRX3941664	SRP140496	SAMN08936758	62,085,714	80%	29%
SRR7009172	SRX3941657	SRP140496	SAMN08936759	62,752,412	77%	29%
SRR7009171	SRX3941658	SRP140496	SAMN08936760	60,171,142	79%	29%
SRR7009170	SRX3941659	SRP140496	SAMN08936761	49,748,140	74%	28%
SRR7009169	SRX3941660	SRP140496	SAMN08936762	75,351,094	77%	29%
SRR7009174	SRX3941655	SRP140496	SAMN08936763	71,024,534	76%	29%
SRR7009173	SRX3941656	SRP140496	SAMN08936764	65,824,072	80%	29%
SRR7009176	SRX3941653	SRP140496	SAMN08936765	63,329,330	79%	29%
SRR7009175	SRX3941654	SRP140496	SAMN08936766	72,079,214	80%	29%
SRR7009178	SRX3941651	SRP140496	SAMN08936767	64,153,206	81%	30%
SRR7009177	SRX3941652	SRP140496	SAMN08936768	61,461,326	76%	29%
SRR7009180	SRX3941649	SRP140496	SAMN08936769	63,223,184	79%	29%
SRR7009179	SRX3941650	SRP140496	SAMN08936770	54,162,896	79%	29%
SRR7009182	SRX3941647	SRP140496	SAMN08936771	55,210,998	80%	29%
SRR7009181	SRX3941648	SRP140496	SAMN08936772	74,166,982	76%	29%
SRR7009184	SRX3941645	SRP140496	SAMN08936773	62,351,732	77%	29%
SRR7009183	SRX3941646	SRP140496	SAMN08936774	47,615,742	77%	29%
SRR7009159	SRX3941670	SRP140496	SAMN08936775	64,653,588	75%	29%
SRR7009160	SRX3941669	SRP140496	SAMN08936776	70,696,568	74%	29%
SRR7009161	SRX3941668	SRP140496	SAMN08936777	54,035,620	71%	29%
SRR7009162	SRX3941667	SRP140496	SAMN08936778	73,786,942	77%	29%
SRR7009155	SRX3941674	SRP140496	SAMN08936779	51,366,276	80%	29%
SRR7009156	SRX3941673	SRP140496	SAMN08936780	63,319,976	78%	29%
SRR7009157	SRX3941672	SRP140496	SAMN08936781	55,011,408	79%	29%
SRR7009158	SRX3941671	SRP140496	SAMN08936782	52,607,510	79%	30%
SRR7009163	SRX3941666	SRP140496	SAMN08936783	54,082,220	77%	29%
SRR7009164	SRX3941665	SRP140496	SAMN08936784	62,943,156	78%	30%
SRR7009186	SRX3941643	SRP140496	SAMN08936785	60,398,922	79%	29%
SRR7009185	SRX3941644	SRP140496	SAMN08936786	50,828,700	78%	30%
SRR7009190	SRX3941639	SRP140496	SAMN08936787	64,203,014	77%	30%
SRR7009189	SRX3941640	SRP140496	SAMN08936788	60,927,946	76%	29%
SRR7009188	SRX3941641	SRP140496	SAMN08936789	73,186,532	77%	29%
SRR7009187	SRX3941642	SRP140496	SAMN08936790	61,996,526	75%	30%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Myzus persicae high-quality model RefSeq (XP_)	11,299	10,183 (90.12%)	10,183 (90.12%)	75.02%	83.26%
Insecta GenBank	95,701	64,914 (67.83%)	64,914 (67.83%)	66.24%	66.32%
Acyrthosiphon pisum high-quality model RefSeq (XP_)	11,047	9,828 (88.97%)	9,828 (88.97%)	73.03%	79.68%
Drosophila melanogaster known RefSeq (NP_)	30,031	17,808 (59.30%)	17,808 (59.30%)	62.45%	53.42%
Melanaphis sacchari high-quality model RefSeq (XP_)	9,388	8,567 (91.25%)	8,567 (91.25%)	77.28%	86.84%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences