NCBI Schistocerca cancellata Annotation Release 100

The RefSeq genome records for Schistocerca cancellata were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Schistocerca cancellata Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Jul 13 2022
Date of submission of annotation to the public databases: Aug 5 2022
Software version: 10.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
iqSchCanc2.1	GCF_023864275.1	Behavioral Plasticity Research Institute (BPRI)	06-28-2022	Reference	13 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	iqSchCanc2.1
Genes and pseudogenes	103,533
protein-coding	16,907
non-coding	80,053
Transcribed pseudogenes	0
Non-transcribed pseudogenes	6,571
genes with variants	4,438
Immunoglobulin/T-cell receptor gene segments	0
other	2
mRNAs	26,349
fully-supported	21,108
with > 5% ab initio	4,171
partial	216
with filled gap(s)	7
known RefSeq (NM_)	0
model RefSeq (XM_)	26,349
non-coding RNAs	80,923
fully-supported	2,165
with > 5% ab initio	0
partial	11
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	18,432
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	26,362
fully-supported	21,108
with > 5% ab initio	4,415
partial	217
with major correction(s)	138
known RefSeq (NP_)	0
model RefSeq (XP_)	26,362

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	96,962	25,613	75	63	2,258,738
All transcripts	107,272	865	76	63	61,354
mRNA	26,349	2,802	2,137	90	61,354
misc_RNA	642	2,150	1,730	214	9,350
tRNA	62,489	74	73	63	105
lncRNA	1,523	1,136	645	110	12,047
snoRNA	294	199	207	63	305
snRNA	1,126	135	119	88	200
rRNA	14,847	738	119	117	4,288
Single-exon transcripts	1,206	1,303	912	291	11,334
coding transcripts (NM_/XM_ )	1,206	1,303	912	291	11,334
CDSs	26,362	1,704	1,299	90	60,339
Exons	147,292	325	169	2	16,023
in coding transcripts (NM_/XM_ )	142,180	324	170	2	16,023
in non-coding transcripts (NR_/XR_ )	7,475	333	156	10	9,154
Introns	128,741	21,885	8,982	30	598,784
in coding transcripts (NM_/XM_ )	125,107	21,598	9,034	30	597,304
in non-coding transcripts (NR_/XR_ )	5,757	27,214	7,460	30	598,784

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.3	1	1	50
Number of exons per transcript	5.84	3	1	157

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the insecta_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Drosophila melanogaster known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 16894 coding genes, 11866 genes had a protein with an alignment covering 50% or more of the query and 3163 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Drosophila melanogaster known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with RepeatMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
iqSchCanc2.1	GCF_023864275.1	70.93%	52.98%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez Nucleotide, Entrez Protein, and SRA, and aligned to the genome.

Transcript alignments

The alignments of the following transcripts with Splign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species TSA	621,984	450,448 (72.42%)	306,193 (49.23%)	98.72%	99.04%

RNA-Seq alignments

The alignments of the following RNA-Seq reads with STAR were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	1,537,668,410	78%	32%	176,814
SAMN25046303	Last nymphal instar, Head (Schistocerca cancellata, female, SAMN25046303)	71,991,522	78%	30%	107,979
SAMN25046304	Last nymphal instar, Thorax (Schistocerca cancellata, female, SAMN25046304)	76,033,850	78%	33%	105,135
SAMN25046305	Last nymphal instar, Head (Schistocerca cancellata, female, SAMN25046305)	101,077,396	77%	28%	109,689
SAMN25046306	Last nymphal instar, Thorax (Schistocerca cancellata, female, SAMN25046306)	72,498,590	79%	33%	109,327
SAMN25046307	Last nymphal instar, Head (Schistocerca cancellata, female, SAMN25046307)	72,787,970	79%	33%	110,865
SAMN25046308	Last nymphal instar, Thorax (Schistocerca cancellata, female, SAMN25046308)	83,549,580	81%	34%	109,420
SAMN25046309	Last nymphal instar, Head (Schistocerca cancellata, female, SAMN25046309)	76,672,962	79%	31%	107,004
SAMN25046310	Last nymphal instar, Thorax (Schistocerca cancellata, female, SAMN25046310)	75,665,852	79%	33%	109,513
SAMN25046311	Last nymphal instar, Head (Schistocerca cancellata, female, SAMN25046311)	108,257,446	79%	31%	113,007
SAMN25046312	Last nymphal instar, Thorax (Schistocerca cancellata, female, SAMN25046312)	86,057,888	79%	33%	110,059
SAMN25046313	Last nymphal instar, Head (Schistocerca cancellata, female, SAMN25046313)	65,168,590	78%	30%	100,567
SAMN25046314	Last nymphal instar, Thorax (Schistocerca cancellata, female, SAMN25046314)	67,661,014	79%	35%	103,607
SAMN25046315	Last nymphal instar, Head (Schistocerca cancellata, female, SAMN25046315)	84,365,184	76%	30%	107,977
SAMN25046316	Last nymphal instar, Thorax (Schistocerca cancellata, female, SAMN25046316)	68,578,328	78%	32%	111,330
SAMN25046317	Last nymphal instar, Head (Schistocerca cancellata, female, SAMN25046317)	63,620,268	77%	31%	104,353
SAMN25046318	Last nymphal instar, Thorax (Schistocerca cancellata, female, SAMN25046318)	73,158,878	79%	34%	104,466
SAMN25046319	Last nymphal instar, Head (Schistocerca cancellata, female, SAMN25046319)	78,628,464	77%	31%	111,841
SAMN25046320	Last nymphal instar, Thorax (Schistocerca cancellata, female, SAMN25046320)	81,776,848	80%	34%	115,185
SAMN25046321	Last nymphal instar, Head (Schistocerca cancellata, female, SAMN25046321)	66,630,992	78%	30%	104,395
SAMN25046322	Last nymphal instar, Thorax (Schistocerca cancellata, female, SAMN25046322)	63,486,788	79%	32%	100,145

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR17648061	SRX13816301	SRP355464	SAMN25046303	71,991,522	78%	30%
SRR17648060	SRX13816302	SRP355464	SAMN25046304	76,033,850	78%	33%
SRR17648049	SRX13816313	SRP355464	SAMN25046305	101,077,396	77%	28%
SRR17648048	SRX13816314	SRP355464	SAMN25046306	72,498,590	79%	33%
SRR17648047	SRX13816315	SRP355464	SAMN25046307	72,787,970	79%	33%
SRR17648046	SRX13816316	SRP355464	SAMN25046308	83,549,580	81%	34%
SRR17648045	SRX13816317	SRP355464	SAMN25046309	76,672,962	79%	31%
SRR17648044	SRX13816318	SRP355464	SAMN25046310	75,665,852	79%	33%
SRR17648043	SRX13816319	SRP355464	SAMN25046311	108,257,446	79%	31%
SRR17648042	SRX13816320	SRP355464	SAMN25046312	86,057,888	79%	33%
SRR17648059	SRX13816303	SRP355464	SAMN25046313	65,168,590	78%	30%
SRR17648058	SRX13816304	SRP355464	SAMN25046314	67,661,014	79%	35%
SRR17648057	SRX13816305	SRP355464	SAMN25046315	84,365,184	76%	30%
SRR17648056	SRX13816306	SRP355464	SAMN25046316	68,578,328	78%	32%
SRR17648055	SRX13816307	SRP355464	SAMN25046317	63,620,268	77%	31%
SRR17648054	SRX13816308	SRP355464	SAMN25046318	73,158,878	79%	34%
SRR17648053	SRX13816309	SRP355464	SAMN25046319	78,628,464	77%	31%
SRR17648052	SRX13816310	SRP355464	SAMN25046320	81,776,848	80%	34%
SRR17648051	SRX13816311	SRP355464	SAMN25046321	66,630,992	78%	30%
SRR17648050	SRX13816312	SRP355464	SAMN25046322	63,486,788	79%	32%

Protein alignments

The alignments of the following proteins with ProSplign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Cryptotermes secundus high-quality model RefSeq (XP_)	11,039	9,288 (84.14%)	9,288 (84.14%)	67.29%	66.43%
Zootermopsis nevadensis high-quality model RefSeq (XP_)	10,233	8,660 (84.63%)	8,660 (84.63%)	67.40%	65.96%
Thrips palmi high-quality model RefSeq (XP_)	11,360	8,306 (73.12%)	8,306 (73.12%)	64.94%	59.47%
Insecta known RefSeq (NP_)	39,186	27,830 (71.02%)	27,830 (71.02%)	65.16%	56.99%
Tribolium castaneum high-quality model RefSeq (XP_)	11,487	8,784 (76.47%)	8,784 (76.47%)	63.70%	59.11%
Apis mellifera high-quality model RefSeq (XP_)	8,879	7,196 (81.05%)	7,196 (81.05%)	65.53%	62.26%
Cimex lectularius high-quality model RefSeq (XP_)	11,205	8,392 (74.90%)	8,392 (74.90%)	64.52%	60.33%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences