NCBI Harmonia axyridis Annotation Release 100

The RefSeq genome records for Harmonia axyridis were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Harmonia axyridis Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Dec 20 2021
Date of submission of annotation to the public databases: Dec 25 2021
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
icHarAxyr1.1	GCF_914767665.1	WELLCOME SANGER INSTITUTE	09-16-2021	Reference	8 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	icHarAxyr1.1
Genes and pseudogenes	18,548
protein-coding	13,899
non-coding	4,236
Transcribed pseudogenes	0
Non-transcribed pseudogenes	413
genes with variants	4,462
Immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	23,861
fully-supported	22,232
with > 5% ab initio	1,315
partial	47
with filled gap(s)	1
known RefSeq (NM_)	0
model RefSeq (XM_)	23,861
non-coding RNAs	5,194
fully-supported	2,519
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	4,247
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	23,861
fully-supported	22,232
with > 5% ab initio	1,360
partial	47
with major correction(s)	44
known RefSeq (NP_)	0
model RefSeq (XP_)	23,861

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	18,135	15,337	2,928	67	980,242
All transcripts	29,055	2,548	1,906	67	68,152
mRNA	23,861	2,856	2,104	267	68,152
misc_RNA	528	2,867	2,185	200	25,093
tRNA	947	74	73	71	86
lncRNA	1,994	1,136	847	122	19,868
snoRNA	59	137	134	67	215
snRNA	105	150	162	92	191
rRNA	1,561	1,280	155	118	4,201
Single-exon transcripts	1,777	1,378	1,083	267	6,871
coding transcripts (NM_/XM_ )	1,777	1,378	1,083	267	6,871
CDSs	23,861	2,059	1,431	198	67,410
Exons	97,118	385	220	2	24,978
in coding transcripts (NM_/XM_ )	91,314	385	220	2	24,978
in non-coding transcripts (NR_/XR_ )	7,800	359	210	2	10,638
Introns	79,389	4,089	418	30	564,007
in coding transcripts (NM_/XM_ )	75,656	4,169	439	30	564,007
in non-coding transcripts (NR_/XR_ )	5,647	2,875	223	30	263,452

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.64	1	1	50
Number of exons per transcript	7.51	5	1	210

BUSCO analysis of gene annotation

BUSCO v4.1.4 (Simão et al 2015, PMID: 26059717) was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the endopterygota_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation (C:complete [S:single-copy, D:duplicated], F:fragmented, M:missing, n:number of genes used).

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Drosophila melanogaster known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 13899 coding genes, 9505 genes had a protein with an alignment covering 50% or more of the query and 3269 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Drosophila melanogaster known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
icHarAxyr1.1	GCF_914767665.1	44.51%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	158	155 (98.10%)	149 (94.30%)	98.71%	99.33%
Same-species EST	25	24 (96.00%)	22 (88.00%)	95.96%	91.62%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	3,535,405,228	85%	29%	100,167
SAMD00079967	30242156	Haxy_hC_1-6R (Harmonia axyridis, SAMD00079967)	31,927,756	87%	21%	57,492
SAMD00079968	30242156	Haxy_hC_1-6B (Harmonia axyridis, SAMD00079968)	20,125,018	87%	20%	54,253
SAMD00079969	30242156	Haxy_hC_1-11R (Harmonia axyridis, SAMD00079969)	14,827,638	85%	20%	52,056
SAMD00079970	30242156	Haxy_hC_1-11B (Harmonia axyridis, SAMD00079970)	22,591,280	86%	20%	54,751
SAMD00079971	30242156	Haxy_hC_1-13R (Harmonia axyridis, SAMD00079971)	30,864,988	87%	20%	58,238
SAMD00079972	30242156	Haxy_hC_1-13B (Harmonia axyridis, SAMD00079972)	16,789,488	87%	19%	53,536
SAMD00079973	30242156	Haxy_hC_3-9R (Harmonia axyridis, SAMD00079973)	60,147,342	70%	19%	55,096
SAMD00079974	30242156	Haxy_hC_3-9B (Harmonia axyridis, SAMD00079974)	31,316,118	84%	19%	55,322
SAMD00079975	30242156	Haxy_hC_3-AR (Harmonia axyridis, SAMD00079975)	80,546,496	86%	19%	62,408
SAMD00079976	30242156	Haxy_hC_3-AB (Harmonia axyridis, SAMD00079976)	42,430,838	78%	19%	55,912
SAMD00079977	30242156	Haxy_hC_3-CR (Harmonia axyridis, SAMD00079977)	58,920,008	82%	19%	57,841
SAMD00079978	30242156	Haxy_hC_3-CB (Harmonia axyridis, SAMD00079978)	27,002,426	81%	19%	53,799
SAMEA3895879	NA	Harmonia transcriptome (Harmonia axyridis, SAMEA3895879)	116,001,642	78%	24%	76,195
SAMEA3895880	NA	Harmonia transcriptome (Harmonia axyridis, SAMEA3895880)	134,977,416	70%	19%	68,037
SAMN03799571	31740605,31830128	whole organism (Harmonia axyridis, not determined, SAMN03799571)	26,826,542	91%	25%	70,063
SAMN06706584	NA	adult's abdomen (Harmonia axyridis, female, SAMN06706584)	41,264,566	77%	27%	66,990
SAMN06706590	NA	adult's abdomen (Harmonia axyridis, female, SAMN06706590)	59,945,000	78%	27%	70,110
SAMN06706605	NA	adult's abdomen (Harmonia axyridis, female, SAMN06706605)	63,971,742	81%	28%	71,880
SAMN06706610	NA	adult's abdomen (Harmonia axyridis, female, SAMN06706610)	70,779,016	79%	29%	72,736
SAMN06706614	NA	adult's abdomen (Harmonia axyridis, female, SAMN06706614)	61,035,698	81%	28%	70,778
SAMN07437134	NA	Testis and MAGs (Harmonia axyridis, first day after emergence, male, SAMN07437134)	33,334,300	85%	28%	77,211
SAMN07437135	NA	Testis and MAGs (Harmonia axyridis, third day after emergence, male, SAMN07437135)	34,521,164	85%	29%	73,928
SAMN07437136	NA	Testis and MAGs (Harmonia axyridis, fourth day after emergence, male, SAMN07437136)	34,719,434	86%	29%	76,462
SAMN07437137	NA	Testis and MAGs (Harmonia axyridis, seventh day after emergence, male, SAMN07437137)	29,084,304	86%	29%	74,272
SAMN08162800	NA	Whole body (Harmonia axyridis, 9 days after emergence, female, SAMN08162800)	46,972,392	90%	32%	77,375
SAMN08162802	NA	Whole body (Harmonia axyridis, 9 days after emergence, female, SAMN08162802)	46,286,422	89%	33%	77,600
SAMN08162804	NA	Whole body (Harmonia axyridis, 9 days after emergence, female, SAMN08162804)	42,987,076	89%	32%	76,998
SAMN08162806	NA	Whole body (Harmonia axyridis, 9 days after emergence, female, SAMN08162806)	43,834,012	76%	32%	74,497
SAMN08162808	NA	Whole body (Harmonia axyridis, 9 days after emergence, female, SAMN08162808)	49,097,298	75%	31%	74,806
SAMN08162810	NA	Whole body (Harmonia axyridis, 9 days after emergence, female, SAMN08162810)	42,457,228	84%	31%	74,100
SAMN12071516	NA	4th instar larva (Harmonia axyridis, SAMN12071516)	52,534,696	84%	28%	76,261
SAMN12071517	NA	4th instar larva (Harmonia axyridis, SAMN12071517)	48,928,466	85%	30%	70,039
SAMN12071518	NA	4th instar larva (Harmonia axyridis, SAMN12071518)	46,078,472	85%	30%	76,304
SAMN12071520	NA	4th instar larva (Harmonia axyridis, SAMN12071520)	49,503,442	71%	30%	68,655
SAMN12071521	NA	4th instar larva (Harmonia axyridis, SAMN12071521)	40,242,122	72%	30%	67,285
SAMN13111588	NA	adult body (Harmonia axyridis, SAMN13111588)	929,471,486	88%	32%	98,447
SAMN16830458	NA	Whole body (Harmonia axyridis, SAMN16830458)	69,405,468	84%	36%	64,627
SAMN18204552	NA	Harmonia axyridis (Harmonia axyridis, SAMN18204552)	221,710,554	82%	30%	71,720
SAMN19931212	NA	whole body (Harmonia axyridis, SAMN19931212)	42,919,514	89%	29%	68,652
SAMN19931213	NA	whole body (Harmonia axyridis, SAMN19931213)	44,468,842	89%	30%	68,750
SAMN19931214	NA	whole body (Harmonia axyridis, SAMN19931214)	47,223,506	89%	31%	69,781
SAMN19931215	NA	whole body (Harmonia axyridis, SAMN19931215)	42,269,732	89%	29%	68,497
SAMN19931216	NA	whole body (Harmonia axyridis, SAMN19931216)	42,512,894	90%	31%	68,851
SAMN19931217	NA	whole body (Harmonia axyridis, SAMN19931217)	45,720,846	89%	30%	73,930
SAMN19931218	NA	whole body (Harmonia axyridis, SAMN19931218)	41,939,810	90%	32%	77,680
SAMN19931219	NA	whole body (Harmonia axyridis, SAMN19931219)	40,367,392	89%	31%	68,878
SAMN19931220	NA	whole body (Harmonia axyridis, SAMN19931220)	42,622,422	90%	29%	73,904
SAMN19931221	NA	whole body (Harmonia axyridis, SAMN19931221)	45,495,758	90%	30%	68,350
SAMN19931222	NA	whole body (Harmonia axyridis, SAMN19931222)	45,964,748	90%	29%	75,901
SAMN19931223	NA	whole body (Harmonia axyridis, SAMN19931223)	43,548,370	90%	30%	75,104
SAMN20395992	NA	fat-body (Harmonia axyridis, 4th larva, SAMN20395992)	136,892,040	91%	27%	78,263

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
DRR092246	DRX085816	DRP004428	SAMD00079967	31,927,756	87%	21%
DRR092247	DRX085817	DRP004428	SAMD00079968	20,125,018	87%	20%
DRR092248	DRX085818	DRP004428	SAMD00079969	14,827,638	85%	20%
DRR092249	DRX085819	DRP004428	SAMD00079970	22,591,280	86%	20%
DRR092250	DRX085820	DRP004428	SAMD00079971	30,864,988	87%	20%
DRR092251	DRX085821	DRP004428	SAMD00079972	16,789,488	87%	19%
DRR092252	DRX085822	DRP004428	SAMD00079973	60,147,342	70%	19%
DRR092253	DRX085823	DRP004428	SAMD00079974	31,316,118	84%	19%
DRR092254	DRX085824	DRP004428	SAMD00079975	80,546,496	86%	19%
DRR092255	DRX085825	DRP004428	SAMD00079976	42,430,838	78%	19%
DRR092256	DRX085826	DRP004428	SAMD00079977	58,920,008	82%	19%
DRR092257	DRX085827	DRP004428	SAMD00079978	27,002,426	81%	19%
ERR1309558	ERX1381181	ERP014561	SAMEA3895879	116,001,642	78%	24%
ERR1309559	ERX1381182	ERP014561	SAMEA3895880	134,977,416	70%	19%
SRR2083667	SRX1078176	SRP060061	SAMN03799571	26,826,542	91%	25%
SRR5451336	SRX2740047	SRP104023	SAMN06706584	41,264,566	77%	27%
SRR5451332	SRX2740043	SRP104023	SAMN06706590	59,945,000	78%	27%
SRR5451335	SRX2740046	SRP104023	SAMN06706605	63,971,742	81%	28%
SRR5451334	SRX2740045	SRP104023	SAMN06706610	70,779,016	79%	29%
SRR5451333	SRX2740044	SRP104023	SAMN06706614	61,035,698	81%	28%
SRR5891403	SRX3057156	SRP114719	SAMN07437134	33,334,300	85%	28%
SRR5891404	SRX3057155	SRP114719	SAMN07437135	34,521,164	85%	29%
SRR5891405	SRX3057154	SRP114719	SAMN07437136	34,719,434	86%	29%
SRR5891406	SRX3057153	SRP114719	SAMN07437137	29,084,304	86%	29%
SRR6378484	SRX3473287	SRP126831	SAMN08162800	46,972,392	90%	32%
SRR6378483	SRX3473288	SRP126831	SAMN08162802	46,286,422	89%	33%
SRR6378482	SRX3473289	SRP126831	SAMN08162804	42,987,076	89%	32%
SRR6378481	SRX3473290	SRP126831	SAMN08162806	43,834,012	76%	32%
SRR6378485	SRX3473286	SRP126831	SAMN08162808	49,097,298	75%	31%
SRR6378480	SRX3473291	SRP126831	SAMN08162810	42,457,228	84%	31%
SRR9649797	SRX6411343	SRP213364	SAMN12071516	52,534,696	84%	28%
SRR9649798	SRX6411342	SRP213364	SAMN12071517	48,928,466	85%	30%
SRR9649795	SRX6411345	SRP213364	SAMN12071518	46,078,472	85%	30%
SRR9649801	SRX6411339	SRP213364	SAMN12071520	49,503,442	71%	30%
SRR9649802	SRX6411338	SRP213364	SAMN12071521	40,242,122	72%	30%
SRR13089467	SRX9535375	SRP293318	SAMN16830458	69,405,468	84%	36%
SRR15696844	SRX11992795	SRP302325	SAMN13111588	43,004,228	87%	33%
SRR15696843	SRX11992796	SRP302325	SAMN13111588	48,573,886	87%	33%
SRR15696842	SRX11992797	SRP302325	SAMN13111588	65,084,648	91%	32%
SRR15696841	SRX11992798	SRP302325	SAMN13111588	60,103,248	91%	32%
SRR15696840	SRX11992799	SRP302325	SAMN13111588	49,305,570	88%	31%
SRR15696839	SRX11992800	SRP302325	SAMN13111588	49,024,762	88%	35%
SRR15696838	SRX11992801	SRP302325	SAMN13111588	49,691,128	89%	31%
SRR15696837	SRX11992802	SRP302325	SAMN13111588	43,008,204	91%	30%
SRR15696836	SRX11992803	SRP302325	SAMN13111588	41,348,590	91%	35%
SRR15696835	SRX11992804	SRP302325	SAMN13111588	50,515,966	91%	31%
SRR15696834	SRX11992805	SRP302325	SAMN13111588	59,637,624	85%	32%
SRR15696833	SRX11992806	SRP302325	SAMN13111588	71,886,552	90%	32%
SRR15696832	SRX11992807	SRP302325	SAMN13111588	38,933,218	91%	32%
SRR15696831	SRX11992808	SRP302325	SAMN13111588	41,785,020	91%	32%
SRR15696830	SRX11992809	SRP302325	SAMN13111588	54,890,894	84%	33%
SRR15696829	SRX11992810	SRP302325	SAMN13111588	64,279,282	84%	34%
SRR15696828	SRX11992811	SRP302325	SAMN13111588	45,194,318	84%	33%
SRR15696827	SRX11992812	SRP302325	SAMN13111588	53,204,348	91%	33%
SRR13893413	SRX10273107	SRP309816	SAMN18204552	221,710,554	82%	30%
SRR14975314	SRX11287852	SRP326063	SAMN19931212	42,919,514	89%	29%
SRR14975313	SRX11287853	SRP326063	SAMN19931213	44,468,842	89%	30%
SRR14975310	SRX11287856	SRP326063	SAMN19931214	47,223,506	89%	31%
SRR14975309	SRX11287857	SRP326063	SAMN19931215	42,269,732	89%	29%
SRR14975308	SRX11287858	SRP326063	SAMN19931216	42,512,894	90%	31%
SRR14975307	SRX11287859	SRP326063	SAMN19931217	45,720,846	89%	30%
SRR14975306	SRX11287860	SRP326063	SAMN19931218	41,939,810	90%	32%
SRR14975305	SRX11287861	SRP326063	SAMN19931219	40,367,392	89%	31%
SRR14975304	SRX11287862	SRP326063	SAMN19931220	42,622,422	90%	29%
SRR14975303	SRX11287863	SRP326063	SAMN19931221	45,495,758	90%	30%
SRR14975312	SRX11287854	SRP326063	SAMN19931222	45,964,748	90%	29%
SRR14975311	SRX11287855	SRP326063	SAMN19931223	43,548,370	90%	30%
SRR15245412	SRX11551236	SRP329855	SAMN20395992	45,699,518	91%	26%
SRR15245411	SRX11551237	SRP329855	SAMN20395992	45,502,098	91%	25%
SRR15245410	SRX11551238	SRP329855	SAMN20395992	45,690,424	90%	29%

SRA Long Read Alignment Statistics

The following long read RNA-Seq reads (PacBio, Oxford Nanopore, 454, or other long-read sequencing technologies) from the Sequence Read Archive were also used for gene prediction:

Run	Sample	Number of reads	Number (%) of sequences aligned by Minimap2	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
All	NA	203849	194486 (95.40%)	164118 (80.50%)	98.13	97.68
SRR9021293	SAMN10643380	36209	34015 (93.94%)	26739 (73.84%)	97.9	97.5
SRR9048326	SAMN10643380	41486	38957 (93.90%)	30943 (74.58%)	97.84	97.36
SRR9048327	SAMN10643380	47775	44733 (93.63%)	37586 (78.67%)	98.16	98.13
SRR9048328	SAMN10643380	78379	76781 (97.96%)	68850 (87.84%)	98.44	97.63

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Acromyrmex echinatior high-quality model RefSeq (XP_)	8,722	6,742 (77.30%)	6,742 (77.30%)	62.35%	60.62%
Nicrophorus vespilloides high-quality model RefSeq (XP_)	10,013	8,765 (87.54%)	8,765 (87.54%)	64.60%	68.54%
Same-species GenBank	154	150 (97.40%)	150 (97.40%)	87.59%	94.09%
Onthophagus taurus high-quality model RefSeq (XP_)	11,850	9,836 (83.00%)	9,836 (83.00%)	64.45%	67.13%
Insecta GenBank	114,308	75,918 (66.42%)	75,918 (66.42%)	67.08%	68.83%
Insecta known RefSeq (NP_)	39,088	9,585 (24.52%)	9,585 (24.52%)	63.16%	54.32%
Tribolium castaneum high-quality model RefSeq (XP_)	11,487	9,776 (85.10%)	9,776 (85.10%)	65.55%	70.16%
Apis mellifera high-quality model RefSeq (XP_)	8,880	7,041 (79.29%)	7,041 (79.29%)	63.20%	61.38%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences