NCBI Homalodisca vitripennis Annotation Release 100

The RefSeq genome records for Homalodisca vitripennis were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Homalodisca vitripennis Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Jan 28 2022
Date of submission of annotation to the public databases: Feb 21 2022
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
UT_GWSS_2.1	GCF_021130785.1	University of Texas at Austin	01-26-2022	Reference	10 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	UT_GWSS_2.1
Genes and pseudogenes	22,591
protein-coding	19,904
non-coding	2,075
Transcribed pseudogenes	4
Non-transcribed pseudogenes	607
genes with variants	5,208
Immunoglobulin/T-cell receptor gene segments	0
other	1
mRNAs	31,150
fully-supported	25,782
with > 5% ab initio	4,045
partial	929
with filled gap(s)	225
known RefSeq (NM_)	0
model RefSeq (XM_)	31,150
non-coding RNAs	2,846
fully-supported	1,946
with > 5% ab initio	0
partial	2
with filled gap(s)	2
known RefSeq (NR_)	0
model RefSeq (XR_)	2,140
pseudo transcripts	4
fully-supported	4
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	4
CDSs	31,163
fully-supported	25,782
with > 5% ab initio	4,294
partial	915
with major correction(s)	1,080
known RefSeq (NP_)	0
model RefSeq (XP_)	31,163

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	21,980	31,969	13,926	63	1,362,382
All transcripts	33,996	2,682	1,792	50	50,439
mRNA	31,150	2,799	1,897	111	50,439
misc_RNA	606	3,291	2,248	173	18,071
tRNA	704	73	72	61	87
lncRNA	1,340	1,401	708	50	48,102
snoRNA	36	146	129	63	217
snRNA	124	133	123	100	196
rRNA	35	971	156	119	4,303
Single-exon transcripts	2,480	1,148	825	111	17,956
coding transcripts (NM_/XM_ )	2,480	1,148	825	111	17,956
CDSs	31,163	1,623	1,164	111	49,788
Exons	152,606	340	168	1	32,259
in coding transcripts (NM_/XM_ )	148,244	335	167	1	22,617
in non-coding transcripts (NR_/XR_ )	7,290	413	164	2	32,259
Introns	130,813	6,074	2,145	30	597,417
in coding transcripts (NM_/XM_ )	127,900	6,067	2,142	30	597,417
in non-coding transcripts (NR_/XR_ )	5,677	5,882	2,267	30	486,374

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.56	1	1	50
Number of exons per transcript	8.68	6	1	173

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the hemiptera_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Drosophila melanogaster known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 19891 coding genes, 12851 genes had a protein with an alignment covering 50% or more of the query and 3605 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Drosophila melanogaster known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
UT_GWSS_2.1	GCF_021130785.1	39.37%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	25	24 (96.00%)	23 (92.00%)	99.58%	94.07%
Same-species EST	20,030	16,786 (83.80%)	16,042 (80.09%)	99.30%	99.66%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	3,410,135,668	69%	26%	158,093
SAMN03379868	NA	whole organism (Homalodisca vitripennis, female, SAMN03379868)	55,564,294	79%	17%	114,525
SAMN03379869	NA	whole organism (Homalodisca vitripennis, male, SAMN03379869)	33,358,948	78%	18%	118,633
SAMN12306226	NA	Abd (Homalodisca vitripennis, female, SAMN12306226)	49,387,604	74%	21%	102,027
SAMN12306227	NA	Eye (Homalodisca vitripennis, female, SAMN12306227)	46,873,078	85%	14%	105,588
SAMN12306228	NA	Leg (Homalodisca vitripennis, female, SAMN12306228)	70,185,420	84%	25%	107,723
SAMN12306229	NA	Meso (Homalodisca vitripennis, female, SAMN12306229)	54,174,560	80%	21%	103,972
SAMN12306230	NA	Ovi (Homalodisca vitripennis, female, SAMN12306230)	59,249,024	83%	19%	101,293
SAMN12306231	NA	Pro (Homalodisca vitripennis, female, SAMN12306231)	53,740,130	82%	23%	100,404
SAMN12306232	NA	W2 (Homalodisca vitripennis, female, SAMN12306232)	58,785,770	83%	15%	106,173
SAMN12306233	NA	W3 (Homalodisca vitripennis, female, SAMN12306233)	68,344,384	84%	17%	106,758
SAMN12306234	NA	Abd (Homalodisca vitripennis, female, SAMN12306234)	61,943,902	80%	20%	107,761
SAMN12306235	NA	Eye (Homalodisca vitripennis, female, SAMN12306235)	67,353,264	83%	11%	110,493
SAMN12306236	NA	Leg (Homalodisca vitripennis, female, SAMN12306236)	52,086,098	86%	19%	104,042
SAMN12306237	NA	Meso (Homalodisca vitripennis, female, SAMN12306237)	55,465,912	83%	17%	102,149
SAMN12306238	NA	Ovi (Homalodisca vitripennis, female, SAMN12306238)	39,056,424	75%	19%	107,986
SAMN12306239	NA	Pro (Homalodisca vitripennis, female, SAMN12306239)	44,942,256	85%	11%	98,133
SAMN12306240	NA	W2 (Homalodisca vitripennis, female, SAMN12306240)	28,224,200	81%	22%	98,110
SAMN12306241	NA	W3 (Homalodisca vitripennis, female, SAMN12306241)	62,745,538	84%	17%	102,813
SAMN12306242	NA	Abd (Homalodisca vitripennis, female, SAMN12306242)	58,463,650	74%	19%	109,353
SAMN12306243	NA	Leg (Homalodisca vitripennis, female, SAMN12306243)	41,370,914	83%	20%	86,081
SAMN12306245	NA	Pro (Homalodisca vitripennis, female, SAMN12306245)	45,825,226	84%	10%	82,125
SAMN12306246	NA	W2 (Homalodisca vitripennis, female, SAMN12306246)	32,940,812	83%	11%	83,624
SAMN12667715	32076126	body (Homalodisca vitripennis, SAMN12667715)	125,450,630	76%	11%	123,319
SAMN12667716	32076126	red bacteriome (Homalodisca vitripennis, SAMN12667716)	264,273,636	15%	16%	103,262
SAMN12667717	32076126	yellow bacteriome (Homalodisca vitripennis, SAMN12667717)	223,396,392	30%	19%	111,586
SAMN18497903	33423343	Whole insect (Homalodisca vitripennis, SAMN18497903)	86,390,240	86%	41%	122,455
SAMN18497904	33423343	Whole insect (Homalodisca vitripennis, SAMN18497904)	104,839,620	78%	39%	123,527
SAMN18497905	33423343	Whole insect (Homalodisca vitripennis, SAMN18497905)	86,969,016	80%	42%	112,856
SAMN18497906	33423343	Whole insect (Homalodisca vitripennis, SAMN18497906)	87,262,478	77%	33%	118,380
SAMN18497907	33423343	Whole insect (Homalodisca vitripennis, SAMN18497907)	80,867,746	78%	36%	125,946
SAMN18497908	33423343	Whole insect (Homalodisca vitripennis, SAMN18497908)	124,930,436	79%	38%	121,991
SAMN18497909	33423343	Whole insect (Homalodisca vitripennis, SAMN18497909)	103,301,642	76%	31%	125,362
SAMN18497910	33423343	Whole insect (Homalodisca vitripennis, SAMN18497910)	51,510,280	72%	36%	122,052
SAMN18497911	33423343	Whole insect (Homalodisca vitripennis, SAMN18497911)	104,847,644	82%	37%	107,408
SAMN18497912	33423343	Whole insect (Homalodisca vitripennis, SAMN18497912)	85,873,304	80%	35%	124,482
SAMN18497913	33423343	Whole insect (Homalodisca vitripennis, SAMN18497913)	109,636,294	81%	38%	125,996
SAMN18497914	33423343	Whole insect (Homalodisca vitripennis, SAMN18497914)	66,207,018	81%	33%	101,316
SAMN18497915	33423343	Whole insect (Homalodisca vitripennis, SAMN18497915)	66,443,080	82%	39%	123,614
SAMN18497916	33423343	Whole insect (Homalodisca vitripennis, SAMN18497916)	101,730,132	73%	38%	123,709
SAMN18497917	33423343	Whole insect (Homalodisca vitripennis, SAMN18497917)	109,688,832	80%	39%	126,050
SAMN18497918	33423343	Whole insect (Homalodisca vitripennis, SAMN18497918)	79,186,970	79%	38%	120,471
SAMN22783162	NA	adult, whole bodies (Homalodisca vitripennis, pooled male and female, SAMN22783162)	207,248,870	42%	9%	95,049

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR1865088	SRX910971	SRP055946	SAMN03379868	55,564,294	79%	17%
SRR1865089	SRX910972	SRP055946	SAMN03379869	33,358,948	78%	18%
SRR10060919	SRX6794811	SRP089823	SAMN12667715	125,450,630	76%	11%
SRR10060918	SRX6794812	SRP089823	SAMN12667716	264,273,636	15%	16%
SRR10060917	SRX6794813	SRP089823	SAMN12667717	223,396,392	30%	19%
SRR9942940	SRX6691510	SRP152991	SAMN12306226	49,387,604	74%	21%
SRR9942943	SRX6691507	SRP152991	SAMN12306227	46,873,078	85%	14%
SRR9942959	SRX6691491	SRP152991	SAMN12306228	70,185,420	84%	25%
SRR9942933	SRX6691517	SRP152991	SAMN12306229	54,174,560	80%	21%
SRR9942950	SRX6691500	SRP152991	SAMN12306230	59,249,024	83%	19%
SRR9942964	SRX6691486	SRP152991	SAMN12306231	53,740,130	82%	23%
SRR9942949	SRX6691501	SRP152991	SAMN12306232	58,785,770	83%	15%
SRR9942957	SRX6691493	SRP152991	SAMN12306233	68,344,384	84%	17%
SRR9942972	SRX6691478	SRP152991	SAMN12306234	61,943,902	80%	20%
SRR9942932	SRX6691518	SRP152991	SAMN12306235	67,353,264	83%	11%
SRR9942935	SRX6691515	SRP152991	SAMN12306236	52,086,098	86%	19%
SRR9942954	SRX6691496	SRP152991	SAMN12306237	55,465,912	83%	17%
SRR9942960	SRX6691490	SRP152991	SAMN12306238	39,056,424	75%	19%
SRR9942973	SRX6691477	SRP152991	SAMN12306239	44,942,256	85%	11%
SRR9942971	SRX6691479	SRP152991	SAMN12306240	28,224,200	81%	22%
SRR9942963	SRX6691487	SRP152991	SAMN12306241	62,745,538	84%	17%
SRR9942941	SRX6691509	SRP152991	SAMN12306242	58,463,650	74%	19%
SRR9942944	SRX6691506	SRP152991	SAMN12306243	41,370,914	83%	20%
SRR9942967	SRX6691483	SRP152991	SAMN12306245	45,825,226	84%	10%
SRR9942952	SRX6691498	SRP152991	SAMN12306246	32,940,812	83%	11%
SRR14298121	SRX10655946	SRP315823	SAMN18497903	86,390,240	86%	41%
SRR14298120	SRX10655947	SRP315823	SAMN18497904	104,839,620	78%	39%
SRR14298113	SRX10655954	SRP315823	SAMN18497905	86,969,016	80%	42%
SRR14298112	SRX10655955	SRP315823	SAMN18497906	87,262,478	77%	33%
SRR14298111	SRX10655956	SRP315823	SAMN18497907	80,867,746	78%	36%
SRR14298110	SRX10655957	SRP315823	SAMN18497908	124,930,436	79%	38%
SRR14298109	SRX10655958	SRP315823	SAMN18497909	103,301,642	76%	31%
SRR14298108	SRX10655959	SRP315823	SAMN18497910	51,510,280	72%	36%
SRR14298107	SRX10655960	SRP315823	SAMN18497911	104,847,644	82%	37%
SRR14298106	SRX10655961	SRP315823	SAMN18497912	85,873,304	80%	35%
SRR14298119	SRX10655948	SRP315823	SAMN18497913	109,636,294	81%	38%
SRR14298118	SRX10655949	SRP315823	SAMN18497914	66,207,018	81%	33%
SRR14298117	SRX10655950	SRP315823	SAMN18497915	66,443,080	82%	39%
SRR14298116	SRX10655951	SRP315823	SAMN18497916	101,730,132	73%	38%
SRR14298115	SRX10655952	SRP315823	SAMN18497917	109,688,832	80%	39%
SRR14298114	SRX10655953	SRP315823	SAMN18497918	79,186,970	79%	38%
SRR16827083	SRX13020287	SRP344816	SAMN22783162	31,333,586	36%	12%
SRR16827082	SRX13020288	SRP344816	SAMN22783162	36,765,600	26%	13%
SRR16827081	SRX13020289	SRP344816	SAMN22783162	33,757,084	29%	13%
SRR16827080	SRX13020290	SRP344816	SAMN22783162	33,545,302	53%	8%
SRR16827079	SRX13020291	SRP344816	SAMN22783162	38,463,390	54%	6%
SRR16827078	SRX13020292	SRP344816	SAMN22783162	33,383,908	55%	7%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Diuraphis noxia high-quality model RefSeq (XP_)	7,506	5,858 (78.04%)	5,858 (78.04%)	65.32%	63.01%
Same-species GenBank	25	25 (100.00%)	25 (100.00%)	85.63%	91.41%
Halyomorpha halys high-quality model RefSeq (XP_)	11,226	8,598 (76.59%)	8,598 (76.59%)	63.75%	58.41%
Insecta GenBank	114,872	79,607 (69.30%)	79,607 (69.30%)	68.01%	67.23%
Insecta known RefSeq (NP_)	39,088	26,198 (67.02%)	26,198 (67.02%)	67.10%	59.96%
Acyrthosiphon pisum high-quality model RefSeq (XP_)	11,742	7,770 (66.17%)	7,770 (66.17%)	63.99%	56.89%
Bemisia tabaci high-quality model RefSeq (XP_)	11,628	8,504 (73.13%)	8,504 (73.13%)	64.89%	59.78%
Cimex lectularius GenBank	29	29 (100.00%)	29 (100.00%)	75.77%	76.95%
Cimex lectularius high-quality model RefSeq (XP_)	11,205	8,418 (75.13%)	8,418 (75.13%)	66.82%	64.98%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences