NCBI Gigantopelta aegis Annotation Release 100

The RefSeq genome records for Gigantopelta aegis were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Gigantopelta aegis Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: May 6 2021
Date of submission of annotation to the public databases: May 9 2021
Software version: 8.6

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Gae_host_genome	GCF_016097555.1	Hong Kong University of Science and Technology	12-17-2020	Reference	15 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Gae_host_genome
Genes and pseudogenes	26,793
protein-coding	22,556
non-coding	3,220
Transcribed pseudogenes	0
Non-transcribed pseudogenes	1,017
genes with variants	5,857
Immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	33,249
fully-supported	26,887
with > 5% ab initio	4,861
partial	550
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	33,249
non-coding RNAs	4,216
fully-supported	2,911
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	3,425
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	33,249
fully-supported	26,887
with > 5% ab initio	5,131
partial	550
with major correction(s)	737
known RefSeq (NP_)	0
model RefSeq (XP_)	33,249

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	25,776	23,857	13,230	65	578,322
All transcripts	37,465	2,961	2,136	65	53,076
mRNA	33,249	3,199	2,345	99	53,076
misc_RNA	700	2,809	2,161	141	18,441
tRNA	791	75	73	68	88
lncRNA	2,211	1,091	710	119	9,489
snoRNA	111	121	87	65	312
snRNA	262	149	139	105	202
guide_RNA	1	128	128	128	128
rRNA	140	404	119	117	4,200
Single-exon transcripts	2,201	1,385	1,095	261	15,698
coding transcripts (NM_/XM_ )	2,201	1,385	1,095	261	15,698
CDSs	33,249	1,764	1,257	99	52,929
Exons	207,348	334	137	1	21,274
in coding transcripts (NM_/XM_ )	200,278	331	137	1	21,274
in non-coding transcripts (NR_/XR_ )	11,336	335	138	2	10,748
Introns	183,105	3,466	1,418	30	451,910
in coding transcripts (NM_/XM_ )	178,354	3,433	1,414	30	451,910
in non-coding transcripts (NR_/XR_ )	8,832	4,057	1,569	30	236,840

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.47	1	1	30
Number of exons per transcript	9.51	6	1	235

BUSCO analysis of gene annotation

BUSCO v4.0.2 (Simão et al 2015, PMID: 26059717) was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the mollusca_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation (C:complete [S:single-copy, D:duplicated], F:fragmented, M:missing, n:number of genes used).

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 22556 coding genes, 15263 genes had a protein with an alignment covering 50% or more of the query and 3367 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Gae_host_genome	GCF_016097555.1		46.19%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

No transcript evidence was used in this annotation

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	1,834,942,118	74%	16%	201,230
SAMN16905113	cephalic tentacles, ctenidium, digestive gland, foot, nephridium, mantle, operculum, oesophageal gland, epipodial tentacles, testis (Gigantopelta aegis, SAMN16905113)	533,343,224	75%	20%	176,275
SAMN16909869	oesophageal gland, cephalic tentacles, ctenidium, ventricle heart, epipodial tentacles, foot, ovary, internal mantle, mantle edge, auricle heart (Gigantopelta aegis, SAMN16909869)	384,318,396	76%	16%	179,397
SAMN16909871	cephalic tentacles, ctenidium, digestive gland, foot, gonad, internal mantle, mantle edge, oesophageal gland (Gigantopelta aegis, SAMN16909871)	416,351,082	73%	16%	189,931
SAMN16909873	cephalic tentacles, ctenidium, digestive gland, foot, ovary, epipodial tentacles, mantle edge, oesophageal gland, ventricle heart (Gigantopelta aegis, SAMN16909873)	500,929,416	74%	11%	186,212

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR13131273	SRX9571883	SRP293858	SAMN16905113	37,519,994	72%	17%
SRR13131272	SRX9571884	SRP293858	SAMN16905113	86,327,678	65%	24%
SRR13131271	SRX9571885	SRP293858	SAMN16905113	46,056,578	84%	40%
SRR13131270	SRX9571886	SRP293858	SAMN16905113	36,688,636	79%	5%
SRR13131269	SRX9571887	SRP293858	SAMN16905113	37,601,686	84%	41%
SRR13131268	SRX9571888	SRP293858	SAMN16905113	62,871,456	77%	30%
SRR13131267	SRX9571889	SRP293858	SAMN16905113	33,303,586	71%	7%
SRR13131266	SRX9571890	SRP293858	SAMN16905113	101,244,992	72%	6%
SRR13131265	SRX9571891	SRP293858	SAMN16905113	46,924,072	79%	16%
SRR13131264	SRX9571892	SRP293858	SAMN16905113	44,804,546	76%	10%
SRR13131425	SRX9572034	SRP293858	SAMN16909869	38,765,348	71%	14%
SRR13131424	SRX9572035	SRP293858	SAMN16909869	42,849,664	63%	16%
SRR13131423	SRX9572036	SRP293858	SAMN16909869	61,665,090	84%	3%
SRR13131422	SRX9572037	SRP293858	SAMN16909869	40,563,874	76%	17%
SRR13131421	SRX9572038	SRP293858	SAMN16909869	38,479,932	70%	17%
SRR13131420	SRX9572039	SRP293858	SAMN16909869	45,663,166	87%	14%
SRR13131419	SRX9572040	SRP293858	SAMN16909869	35,283,212	75%	23%
SRR13131418	SRX9572041	SRP293858	SAMN16909869	40,839,008	73%	27%
SRR13131415	SRX9572044	SRP293858	SAMN16909869	40,209,102	82%	26%
SRR13131414	SRX9572025	SRP293858	SAMN16909871	39,289,210	69%	14%
SRR13131413	SRX9572026	SRP293858	SAMN16909871	40,755,872	69%	14%
SRR13131412	SRX9572027	SRP293858	SAMN16909871	39,928,094	85%	17%
SRR13131411	SRX9572028	SRP293858	SAMN16909871	38,047,576	72%	16%
SRR13131410	SRX9572029	SRP293858	SAMN16909871	42,979,606	84%	19%
SRR13131409	SRX9572030	SRP293858	SAMN16909871	40,992,658	80%	24%
SRR13131408	SRX9572031	SRP293858	SAMN16909871	38,131,842	80%	24%
SRR13131407	SRX9572032	SRP293858	SAMN16909871	136,226,224	66%	11%
SRR13131435	SRX9572045	SRP293858	SAMN16909873	45,453,406	66%	18%
SRR13131434	SRX9572046	SRP293858	SAMN16909873	41,484,232	64%	18%
SRR13131433	SRX9572047	SRP293858	SAMN16909873	50,399,428	86%	13%
SRR13131432	SRX9572048	SRP293858	SAMN16909873	43,180,126	80%	7%
SRR13131431	SRX9572049	SRP293858	SAMN16909873	40,721,278	84%	19%
SRR13131430	SRX9572050	SRP293858	SAMN16909873	41,337,662	74%	20%
SRR13131429	SRX9572051	SRP293858	SAMN16909873	40,583,218	76%	23%
SRR13131428	SRX9572052	SRP293858	SAMN16909873	158,304,352	67%	2%
SRR13131426	SRX9572054	SRP293858	SAMN16909873	39,465,714	87%	2%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Crassostrea gigas high-quality model RefSeq (XP_)	28,029	17,501 (62.44%)	17,501 (62.44%)	59.12%	45.32%
Mollusca GenBank	15,301	7,559 (49.40%)	7,559 (49.40%)	70.53%	74.13%
Mollusca known RefSeq (NP_)	484	20 (4.13%)	20 (4.13%)	68.76%	73.00%
Aplysia californica high-quality model RefSeq (XP_)	9,849	7,414 (75.28%)	7,414 (75.28%)	62.15%	57.94%
Drosophila melanogaster known RefSeq (NP_)	30,704	11,447 (37.28%)	11,447 (37.28%)	59.81%	45.89%
Strongylocentrotus purpuratus high-quality model RefSeq (XP_)	19,173	11,481 (59.88%)	11,481 (59.88%)	60.66%	47.48%
Strongylocentrotus purpuratus known RefSeq (NP_)	425	322 (75.76%)	322 (75.76%)	67.77%	64.32%
Ciona intestinalis known RefSeq (NP_)	942	626 (66.45%)	626 (66.45%)	62.49%	47.31%
Homo sapiens known RefSeq (NP_)	62,211	30,384 (48.84%)	30,384 (48.84%)	60.77%	49.31%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences