NCBI Sitophilus oryzae Annotation Release 100

The RefSeq genome records for Sitophilus oryzae were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Sitophilus oryzae Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Sep 18 2019
Date of submission of annotation to the public databases: Sep 20 2019
Software version: 8.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Soryzae_2.0	GCF_002938485.1	BF2I	09-09-2019	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Soryzae_2.0
Genes and pseudogenes	17,957
protein-coding	15,057
non-coding	2,541
transcribed pseudogenes	4
non-transcribed pseudogenes	355
genes with variants	4,387
immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	23,485
fully-supported	20,331
with > 5% ab initio	2,385
partial	425
with filled gap(s)	127
known RefSeq (NM_)	0
model RefSeq (XM_)	23,485
non-coding RNAs	3,452
fully-supported	3,065
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	3,174
pseudo transcripts	4
fully-supported	3
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	4
CDSs	23,498
fully-supported	20,331
with > 5% ab initio	2,473
partial	401
with major correction(s)	331
known RefSeq (NP_)	0
model RefSeq (XP_)	23,498

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	17,598	24,578	4,663	63	1,088,837
All transcripts	26,937	2,281	1,728	52	77,249
mRNA	23,485	2,467	1,880	280	77,249
misc_RNA	345	2,628	2,119	142	25,664
tRNA	276	73	73	63	84
lncRNA	2,720	941	682	52	6,801
snoRNA	12	126	85	67	216
snRNA	52	139	140	68	194
rRNA	47	200	118	118	2,015
Single-exon transcripts	1,612	1,147	876	293	6,764
coding transcripts (NM_/XM_ )	1,612	1,147	876	293	6,764
CDSs	23,498	1,831	1,332	156	76,149
Exons	117,380	311	195	1	23,878
in coding transcripts (NM_/XM_ )	109,301	313	196	1	23,878
in non-coding transcripts (NR_/XR_ )	9,831	272	173	2	6,403
Introns	98,323	4,866	640	30	591,319
in coding transcripts (NM_/XM_ )	92,804	5,044	747	30	591,319
in non-coding transcripts (NR_/XR_ )	7,185	2,380	208	30	591,319

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.54	1	1	47
Number of exons per transcript	7.79	6	1	89

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Drosophila melanogaster known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 15044 coding genes, 10072 genes had a protein with an alignment covering 50% or more of the query and 3182 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Drosophila melanogaster known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Soryzae_2.0	GCF_002938485.1	1.86%	57.95%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	93	92 (98.92%)	86 (92.47%)	99.38%	97.38%
Same-species EST	25,745	22,079 (85.76%)	20,911 (81.22%)	99.46%	97.05%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	1,104,942,414	75%	17%	135,313
SAMEA104592433	NA	LS2-1 (Sitophilus oryzae, SAMEA104592433)	38,363,456	77%	25%	101,261
SAMEA104592484	NA	QSO335-1 (Sitophilus oryzae, SAMEA104592484)	41,499,116	77%	25%	102,330
SAMEA104592485	NA	Unselected_F6_LS2xQSO335-1 (Sitophilus oryzae, SAMEA104592485)	36,602,724	78%	24%	101,008
SAMEA104592486	NA	Selected_F6_LS2xQSO335-1 (Sitophilus oryzae, SAMEA104592486)	36,326,490	80%	24%	94,899
SAMEA104717792	NA	NNSO7525-1 (Sitophilus oryzae, SAMEA104717792)	18,912,771	157%	25%	101,954
SAMEA104717793	NA	So_Santai-1 (Sitophilus oryzae, SAMEA104717793)	20,806,967	148%	26%	101,527
SAMEA104717794	NA	So_Sangrur-1 (Sitophilus oryzae, SAMEA104717794)	18,035,264	154%	25%	101,412
SAMN00009522	21179425	midgut (Sitophilus oryzae, SAMN00009522)	926,752	66%	59%	57,321
SAMN03703435	NA	midgut (Sitophilus oryzae, SAMN03703435)	53,348,066	79%	9%	79,009
SAMN03703436	NA	midgut (Sitophilus oryzae, SAMN03703436)	43,505,760	78%	9%	75,898
SAMN03703437	NA	midgut (Sitophilus oryzae, SAMN03703437)	74,964,090	79%	9%	79,538
SAMN03703439	NA	midgut (Sitophilus oryzae, SAMN03703439)	57,385,718	79%	9%	76,889
SAMN03703440	NA	midgut (Sitophilus oryzae, SAMN03703440)	42,867,068	80%	9%	71,966
SAMN03703441	NA	midgut (Sitophilus oryzae, SAMN03703441)	41,692,220	80%	9%	71,006
SAMN05790886	NA	Adult (Sitophilus oryzae, SAMN05790886)	79,745,936	80%	38%	105,277
SAMN05790887	NA	Adult (Sitophilus oryzae, SAMN05790887)	54,379,332	80%	37%	108,376
SAMN08564083	NA	whole larvae (Sitophilus oryzae, SAMN08564083)	61,543,562	61%	8%	87,621
SAMN08564084	NA	whole larvae (Sitophilus oryzae, SAMN08564084)	66,720,240	59%	8%	86,159
SAMN08564085	NA	whole larvae (Sitophilus oryzae, SAMN08564085)	54,957,466	61%	7%	84,050
SAMN08564086	NA	whole larvae (Sitophilus oryzae, SAMN08564086)	46,663,572	61%	8%	83,753
SAMN08564087	NA	bacteriome (Sitophilus oryzae, SAMN08564087)	46,776,790	63%	8%	74,499
SAMN08668856	NA	head (Sitophilus oryzae, pooled male and female, SAMN08668856)	168,919,054	54%	17%	100,749

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
ERR2304550	ERX2355860	ERP106257	SAMEA104592433	38,363,456	77%	25%
ERR2304601	ERX2355911	ERP106257	SAMEA104592484	41,499,116	77%	25%
ERR2304602	ERX2355912	ERP106257	SAMEA104592485	36,602,724	78%	24%
ERR2304603	ERX2355913	ERP106257	SAMEA104592486	36,326,490	80%	24%
ERR2442602	ERX2461703	ERP106257	SAMEA104717792	18,912,771	157%	25%
ERR2442603	ERX2461704	ERP106257	SAMEA104717793	20,806,967	148%	26%
ERR2442604	ERX2461705	ERP106257	SAMEA104717794	18,035,264	154%	25%
SRR037006	SRX017240	SRP002043	SAMN00009522	926,752	66%	59%
SRR2034796	SRX1034967	SRP058561	SAMN03703435	53,348,066	79%	9%
SRR2034797	SRX1034968	SRP058561	SAMN03703436	43,505,760	78%	9%
SRR2034798	SRX1034969	SRP058561	SAMN03703437	74,964,090	79%	9%
SRR2034799	SRX1034970	SRP058561	SAMN03703439	57,385,718	79%	9%
SRR2034800	SRX1034971	SRP058561	SAMN03703440	42,867,068	80%	9%
SRR2034801	SRX1034972	SRP058561	SAMN03703441	41,692,220	80%	9%
SRR6748456	SRX3721137	SRP058561	SAMN08564083	61,543,562	61%	8%
SRR6748455	SRX3721138	SRP058561	SAMN08564084	66,720,240	59%	8%
SRR6748458	SRX3721135	SRP058561	SAMN08564085	54,957,466	61%	7%
SRR6748457	SRX3721136	SRP058561	SAMN08564086	46,663,572	61%	8%
SRR6748460	SRX3721133	SRP058561	SAMN08564087	46,776,790	63%	8%
SRR4195872	SRX2147200	SRP087595	SAMN05790886	19,936,484	83%	38%
SRR4195873	SRX2147200	SRP087595	SAMN05790886	19,936,484	76%	38%
SRR4288819	SRX2147200	SRP087595	SAMN05790886	39,872,968	80%	38%
SRR4288810	SRX2147201	SRP087595	SAMN05790887	54,379,332	80%	37%
SRR6830122	SRX3786204	SRP135495	SAMN08668856	168,919,054	54%	17%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Acromyrmex echinatior high-quality model RefSeq (XP_)	8,722	6,792 (77.87%)	6,792 (77.87%)	61.77%	57.52%
Nicrophorus vespilloides high-quality model RefSeq (XP_)	6,488	6,282 (96.82%)	6,282 (96.82%)	65.94%	72.64%
Onthophagus taurus high-quality model RefSeq (XP_)	11,850	10,013 (84.50%)	10,013 (84.50%)	64.04%	63.46%
Insecta GenBank	102,682	75,719 (73.74%)	75,719 (73.74%)	66.28%	65.43%
Same-species GenBank	93	93 (100.00%)	93 (100.00%)	82.69%	91.55%
Tribolium castaneum high-quality model RefSeq (XP_)	7,031	6,788 (96.54%)	6,788 (96.54%)	67.19%	74.22%
Tribolium castaneum known RefSeq (NP_)	627	589 (93.94%)	589 (93.94%)	68.28%	68.24%
Apis mellifera high-quality model RefSeq (XP_)	8,880	7,014 (78.99%)	7,014 (78.99%)	62.17%	58.12%
Apis mellifera known RefSeq (NP_)	528	404 (76.52%)	404 (76.52%)	65.21%	62.52%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences