NCBI Gadus chalcogrammus Annotation Release GCF_026213295.1-RS_2023_05

The genome sequence records for Gadus chalcogrammus RefSeq assembly GCF_026213295.1 (NIFS_Gcha_1.0) were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as "GCF_026213295.1-RS_2023_05".

Date of Entrez queries for transcripts and proteins: May 25 2023
Date of submission of annotation to the public databases: May 31 2023
Software version: 10.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
NIFS_Gcha_1.0	GCF_026213295.1	National Institute of Fisheries Science	11-18-2022	Reference	24 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	NIFS_Gcha_1.0
Genes and pseudogenes	37,220
protein-coding	23,523
non-coding	12,556
Transcribed pseudogenes	0
Non-transcribed pseudogenes	1,082
genes with variants	7,552
Immunoglobulin/T-cell receptor gene segments	45
other	14
mRNAs	37,388
fully-supported	33,643
with > 5% ab initio	1,669
partial	249
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	37,388
non-coding RNAs	14,602
fully-supported	3,753
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	11,696
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	37,446
fully-supported	33,643
with > 5% ab initio	1,967
partial	270
with major correction(s)	645
known RefSeq (NP_)	13
model RefSeq (XP_)	37,388

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	36,093	10,385	4,184	54	1,260,456
All transcripts	51,990	2,655	2,141	54	89,055
mRNA	37,388	3,393	2,769	186	89,055
misc_RNA	806	3,902	3,165	176	24,243
tRNA	2,904	75	73	67	93
lncRNA	2,948	2,316	1,532	113	24,202
snoRNA	361	117	99	65	332
snRNA	453	156	165	54	196
rRNA	7,116	120	119	114	3,937
Single-exon transcripts	848	1,876	1,448	186	10,354
coding transcripts (NM_/XM_ )	848	1,876	1,448	186	10,354
CDSs	37,401	1,971	1,437	96	87,789
Exons	277,269	314	138	1	20,939
in coding transcripts (NM_/XM_ )	266,785	308	138	1	20,939
in non-coding transcripts (NR_/XR_ )	14,169	419	140	10	19,942
Introns	245,980	1,346	455	30	1,158,275
in coding transcripts (NM_/XM_ )	238,409	1,352	461	30	1,158,275
in non-coding transcripts (NR_/XR_ )	11,228	1,091	300	30	88,154

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.48	1	1	50
Number of exons per transcript	9.88	7	1	219

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the actinopterygii_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 23510 coding genes, 21107 genes had a protein with an alignment covering 50% or more of the query and 9110 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
NIFS_Gcha_1.0	GCF_026213295.1	32.71%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez Nucleotide, Entrez Protein, and SRA, and aligned to the genome.

Transcript alignments

The alignments of the following transcripts with Splign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	22	20 (90.91%)	20 (90.91%)	99.02%	98.94%

RNA-Seq alignments

The alignments of the following RNA-Seq reads with STAR were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	1,553,913,776	82%	27%	284,356
SAMN19665371	Gonad (Gadus chalcogrammus, 10 months, SAMN19665371)	50,352,598	85%	27%	202,903
SAMN19665372	Liver (Gadus chalcogrammus, 10 months, SAMN19665372)	61,087,002	86%	28%	161,113
SAMN19665373	Muscle (Gadus chalcogrammus, 10 months, SAMN19665373)	65,473,890	79%	30%	173,313
SAMN19665374	Spleen (Gadus chalcogrammus, 10 months, SAMN19665374)	56,755,536	85%	19%	173,611
SAMN19665375	Stomach (Gadus chalcogrammus, 10 months, SAMN19665375)	62,892,360	78%	26%	183,543
SAMN19665376	Gonad (Gadus chalcogrammus, 10 months, SAMN19665376)	71,849,382	82%	27%	211,665
SAMN19665377	Liver (Gadus chalcogrammus, 10 months, SAMN19665377)	54,920,392	86%	29%	156,534
SAMN19665378	Muscle (Gadus chalcogrammus, 10 months, SAMN19665378)	53,783,086	80%	32%	158,494
SAMN19665379	Liver (Gadus chalcogrammus, 10 months, SAMN19665379)	60,979,632	83%	20%	156,136
SAMN19665380	Muscle (Gadus chalcogrammus, 10 months, SAMN19665380)	57,910,938	82%	32%	165,745
SAMN19665381	Liver (Gadus chalcogrammus, 10 months, SAMN19665381)	64,789,920	86%	29%	152,211
SAMN19665382	Muscle (Gadus chalcogrammus, 10 months, SAMN19665382)	61,724,994	77%	30%	156,140
SAMN19665383	Gill (Gadus chalcogrammus, 10 months, SAMN19665383)	52,517,110	80%	16%	187,087
SAMN19665384	Gonad (Gadus chalcogrammus, 10 months, SAMN19665384)	62,399,618	80%	27%	194,578
SAMN19665385	Heart (Gadus chalcogrammus, 10 months, SAMN19665385)	59,621,074	80%	22%	177,258
SAMN19665386	Liver (Gadus chalcogrammus, 10 months, SAMN19665386)	58,762,022	86%	29%	144,309
SAMN19665387	Muscle (Gadus chalcogrammus, 10 months, SAMN19665387)	52,492,698	79%	31%	143,092
SAMN19665388	Gonad (Gadus chalcogrammus, 10 months, SAMN19665388)	50,773,456	76%	26%	195,612
SAMN19665389	Heart (Gadus chalcogrammus, 10 months, SAMN19665389)	61,854,590	79%	23%	181,237
SAMN19665390	Liver (Gadus chalcogrammus, 10 months, SAMN19665390)	55,959,556	82%	30%	143,880
SAMN19665391	Muscle (Gadus chalcogrammus, 10 months, SAMN19665391)	69,146,918	80%	30%	165,609
SAMN19665392	Gill (Gadus chalcogrammus, 10 months, SAMN19665392)	56,862,030	81%	18%	183,247
SAMN19665393	Liver (Gadus chalcogrammus, 10 months, SAMN19665393)	69,441,252	88%	30%	156,209
SAMN19665394	Muscle (Gadus chalcogrammus, 10 months, SAMN19665394)	60,807,206	81%	29%	168,840
SAMN19665395	Liver (Gadus chalcogrammus, 10 months, SAMN19665395)	54,928,456	84%	31%	144,055
SAMN19665396	Muscle (Gadus chalcogrammus, 10 months, SAMN19665396)	65,828,060	79%	30%	164,960

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR14784759	SRX11118059	SRP323692	SAMN19665371	50,352,598	85%	27%
SRR14784750	SRX11118068	SRP323692	SAMN19665372	61,087,002	86%	28%
SRR14784749	SRX11118069	SRP323692	SAMN19665373	65,473,890	79%	30%
SRR14784748	SRX11118070	SRP323692	SAMN19665374	56,755,536	85%	19%
SRR14784747	SRX11118071	SRP323692	SAMN19665375	62,892,360	78%	26%
SRR14784746	SRX11118072	SRP323692	SAMN19665376	71,849,382	82%	27%
SRR14784745	SRX11118073	SRP323692	SAMN19665377	54,920,392	86%	29%
SRR14784744	SRX11118074	SRP323692	SAMN19665378	53,783,086	80%	32%
SRR14784769	SRX11118049	SRP323692	SAMN19665379	60,979,632	83%	20%
SRR14784768	SRX11118050	SRP323692	SAMN19665380	57,910,938	82%	32%
SRR14784767	SRX11118051	SRP323692	SAMN19665381	64,789,920	86%	29%
SRR14784766	SRX11118052	SRP323692	SAMN19665382	61,724,994	77%	30%
SRR14784765	SRX11118053	SRP323692	SAMN19665383	52,517,110	80%	16%
SRR14784764	SRX11118054	SRP323692	SAMN19665384	62,399,618	80%	27%
SRR14784763	SRX11118055	SRP323692	SAMN19665385	59,621,074	80%	22%
SRR14784762	SRX11118056	SRP323692	SAMN19665386	58,762,022	86%	29%
SRR14784761	SRX11118057	SRP323692	SAMN19665387	52,492,698	79%	31%
SRR14784760	SRX11118058	SRP323692	SAMN19665388	50,773,456	76%	26%
SRR14784758	SRX11118060	SRP323692	SAMN19665389	61,854,590	79%	23%
SRR14784757	SRX11118061	SRP323692	SAMN19665390	55,959,556	82%	30%
SRR14784756	SRX11118062	SRP323692	SAMN19665391	69,146,918	80%	30%
SRR14784755	SRX11118063	SRP323692	SAMN19665392	56,862,030	81%	18%
SRR14784754	SRX11118064	SRP323692	SAMN19665393	69,441,252	88%	30%
SRR14784753	SRX11118065	SRP323692	SAMN19665394	60,807,206	81%	29%
SRR14784752	SRX11118066	SRP323692	SAMN19665395	54,928,456	84%	31%
SRR14784751	SRX11118067	SRP323692	SAMN19665396	65,828,060	79%	30%

SRA Long Read Alignment Statistics

The alignments of the following long RNA-Seq reads (PacBio, Oxford Nanopore, 454, or other long-read sequencing technologies) from the Sequence Read Archive with minimap2 were used for gene prediction:

Run	Sample	Number of reads	Number (%) of sequences aligned by Minimap2	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
All	NA	36473	35667 (97.79%)	32196 (88.27%)	99.56	99.04
SRR14784770	SAMN19665370	36473	35667 (97.79%)	32196 (88.27%)	99.56	99.04

Protein alignments

The alignments of the following proteins with ProSplign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Betta splendens high-quality model RefSeq (XP_)	18,289	17,627 (96.38%)	17,627 (96.38%)	69.15%	77.35%
Actinopterygii GenBank	74,260	67,749 (91.23%)	67,749 (91.23%)	68.66%	79.68%
Actinopterygii known RefSeq (NP_)	25,459	23,535 (92.44%)	23,535 (92.44%)	68.50%	78.05%
Danio rerio high-quality model RefSeq (XP_)	7,712	7,172 (93.00%)	7,172 (93.00%)	67.27%	72.94%
Astyanax mexicanus high-quality model RefSeq (XP_)	19,875	18,829 (94.74%)	18,829 (94.74%)	66.97%	75.33%
Esox lucius high-quality model RefSeq (XP_)	18,508	17,574 (94.95%)	17,574 (94.95%)	68.32%	75.96%
Xiphophorus maculatus high-quality model RefSeq (XP_)	18,457	17,692 (95.86%)	17,692 (95.86%)	68.30%	76.62%
Homo sapiens known RefSeq (NP_)	66,957	54,516 (81.42%)	54,516 (81.42%)	67.37%	69.87%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences