NCBI Girardinichthys multiradiatus Annotation Release 100

The RefSeq genome records for Girardinichthys multiradiatus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Girardinichthys multiradiatus Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Apr 1 2022
Date of submission of annotation to the public databases: Apr 3 2022
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
DD_fGirMul_XY1	GCF_021462225.1	Max Planck Institute of Molecular Cell Biology and Genetics	02-24-2022	Reference	25 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	DD_fGirMul_XY1
Genes and pseudogenes	29,994
protein-coding	23,723
non-coding	5,631
Transcribed pseudogenes	0
Non-transcribed pseudogenes	453
genes with variants	11,439
Immunoglobulin/T-cell receptor gene segments	180
other	7
mRNAs	48,638
fully-supported	47,615
with > 5% ab initio	466
partial	78
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	48,638
non-coding RNAs	8,414
fully-supported	7,257
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	7,582
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	48,818
fully-supported	47,615
with > 5% ab initio	544
partial	81
with major correction(s)	241
known RefSeq (NP_)	0
model RefSeq (XP_)	48,638

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	29,361	23,641	10,040	57	1,067,819
All transcripts	57,052	3,370	2,749	57	102,054
mRNA	48,638	3,700	3,073	195	102,054
misc_RNA	1,554	3,093	2,624	215	13,614
tRNA	832	74	73	70	87
lncRNA	5,703	1,295	922	100	12,044
snoRNA	167	116	101	64	331
snRNA	128	153	164	57	191
rRNA	23	362	119	119	3,952
Single-exon transcripts	767	2,140	1,783	195	13,084
coding transcripts (NM_/XM_ )	767	2,140	1,783	195	13,084
CDSs	48,638	2,089	1,557	96	100,737
Exons	305,280	312	139	1	30,582
in coding transcripts (NM_/XM_ )	287,492	306	139	1	30,582
in non-coding transcripts (NR_/XR_ )	28,709	326	140	2	9,272
Introns	272,707	2,787	532	30	1,007,393
in coding transcripts (NM_/XM_ )	260,683	2,743	533	30	1,007,393
in non-coding transcripts (NR_/XR_ )	22,682	3,281	544	30	410,837

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.97	1	1	50
Number of exons per transcript	12.29	9	1	247

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the cyprinodontiformes_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 23723 coding genes, 21728 genes had a protein with an alignment covering 50% or more of the query and 10496 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
DD_fGirMul_XY1	GCF_021462225.1	37.82%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

No transcript evidence was used in this annotation

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	2,198,856,664	85%	42%	319,701
SAMN20168748	brain (Girardinichthys multiradiatus, female, SAMN20168748)	71,352,894	82%	33%	227,918
SAMN20168749	brain (Girardinichthys multiradiatus, female, SAMN20168749)	72,191,532	82%	35%	232,172
SAMN20168750	brain (Girardinichthys multiradiatus, female, SAMN20168750)	71,886,666	83%	36%	237,684
SAMN20168751	brain (Girardinichthys multiradiatus, male, SAMN20168751)	71,820,614	82%	33%	239,845
SAMN20168752	brain (Girardinichthys multiradiatus, male, SAMN20168752)	63,897,244	81%	35%	231,614
SAMN20168753	brain (Girardinichthys multiradiatus, male, SAMN20168753)	86,722,810	81%	34%	238,708
SAMN20168754	late embryos, whole fish (Girardinichthys multiradiatus, SAMN20168754)	71,348,668	87%	51%	235,409
SAMN20168755	late embryos, whole fish (Girardinichthys multiradiatus, SAMN20168755)	71,575,968	87%	51%	240,772
SAMN20168756	late embryos, whole fish (Girardinichthys multiradiatus, SAMN20168756)	71,508,030	87%	50%	235,125
SAMN20168757	early embryos, whole fish (Girardinichthys multiradiatus, SAMN20168757)	71,970,304	87%	51%	237,016
SAMN20168758	early embryos, whole fish (Girardinichthys multiradiatus, SAMN20168758)	72,060,922	87%	50%	241,205
SAMN20168759	early embryos, whole fish (Girardinichthys multiradiatus, SAMN20168759)	71,901,168	87%	49%	235,484
SAMN20168760	early embryos, whole fish (Girardinichthys multiradiatus, SAMN20168760)	66,830,050	88%	48%	252,050
SAMN20168761	ovary (Girardinichthys multiradiatus, female, SAMN20168761)	67,057,468	88%	47%	224,646
SAMN20168762	ovary (Girardinichthys multiradiatus, female, SAMN20168762)	66,891,962	88%	50%	230,331
SAMN20168763	ovary (Girardinichthys multiradiatus, female, SAMN20168763)	67,342,360	88%	48%	222,433
SAMN20168764	testis (Girardinichthys multiradiatus, male, SAMN20168764)	67,198,020	89%	52%	232,116
SAMN20168765	testis (Girardinichthys multiradiatus, male, SAMN20168765)	67,820,164	87%	49%	234,994
SAMN20168766	testis (Girardinichthys multiradiatus, male, SAMN20168766)	66,968,618	87%	43%	236,172
SAMN20168767	trophotaenia (Girardinichthys multiradiatus, female, SAMN20168767)	66,838,590	83%	45%	204,356
SAMN20168768	adult, organmix (Girardinichthys multiradiatus, SAMN20168768)	43,808,102	90%	29%	226,896
SAMN20168769	embryo, whole fish (Girardinichthys multiradiatus, SAMN20168769)	67,945,962	87%	43%	249,703
SAMN20168770	ovary (Girardinichthys multiradiatus, female, SAMN20168770)	67,414,372	88%	39%	223,756
SAMN20168771	embryo, whole fish (Girardinichthys multiradiatus, SAMN20168771)	67,194,266	87%	42%	249,036
SAMN20168772	ovary (Girardinichthys multiradiatus, female, SAMN20168772)	68,894,754	87%	38%	244,783
SAMN20168773	trophotaenia (Girardinichthys multiradiatus, female, SAMN20168773)	69,310,510	82%	35%	196,148
SAMN20168774	embryo, whole fish (Girardinichthys multiradiatus, SAMN20168774)	68,841,208	87%	42%	250,548
SAMN20168775	ovary (Girardinichthys multiradiatus, female, SAMN20168775)	67,472,004	88%	39%	234,250
SAMN20168776	embryo, whole fish (Girardinichthys multiradiatus, SAMN20168776)	67,942,708	87%	40%	251,835
SAMN20168777	ovary (Girardinichthys multiradiatus, female, SAMN20168777)	69,674,622	89%	41%	235,042
SAMN20168778	trophotaenia (Girardinichthys multiradiatus, female, SAMN20168778)	67,963,706	80%	35%	184,274
SAMN20168779	trophotaenia (Girardinichthys multiradiatus, female, SAMN20168779)	67,210,398	71%	38%	202,326

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR15099891	SRX11409679	SRP327885	SAMN20168748	71,352,894	82%	33%
SRR15099890	SRX11409680	SRP327885	SAMN20168749	72,191,532	82%	35%
SRR15099879	SRX11409691	SRP327885	SAMN20168750	71,886,666	83%	36%
SRR15099868	SRX11409702	SRP327885	SAMN20168751	71,820,614	82%	33%
SRR15099865	SRX11409705	SRP327885	SAMN20168752	63,897,244	81%	35%
SRR15099864	SRX11409706	SRP327885	SAMN20168753	86,722,810	81%	34%
SRR15099863	SRX11409707	SRP327885	SAMN20168754	71,348,668	87%	51%
SRR15099862	SRX11409708	SRP327885	SAMN20168755	71,575,968	87%	51%
SRR15099861	SRX11409709	SRP327885	SAMN20168756	71,508,030	87%	50%
SRR15099860	SRX11409710	SRP327885	SAMN20168757	71,970,304	87%	51%
SRR15099889	SRX11409681	SRP327885	SAMN20168758	72,060,922	87%	50%
SRR15099888	SRX11409682	SRP327885	SAMN20168759	71,901,168	87%	49%
SRR15099887	SRX11409683	SRP327885	SAMN20168760	66,830,050	88%	48%
SRR15099886	SRX11409684	SRP327885	SAMN20168761	67,057,468	88%	47%
SRR15099885	SRX11409685	SRP327885	SAMN20168762	66,891,962	88%	50%
SRR15099884	SRX11409686	SRP327885	SAMN20168763	67,342,360	88%	48%
SRR15099883	SRX11409687	SRP327885	SAMN20168764	67,198,020	89%	52%
SRR15099882	SRX11409688	SRP327885	SAMN20168765	67,820,164	87%	49%
SRR15099881	SRX11409689	SRP327885	SAMN20168766	66,968,618	87%	43%
SRR15099880	SRX11409690	SRP327885	SAMN20168767	66,838,590	83%	45%
SRR15099878	SRX11409692	SRP327885	SAMN20168768	43,808,102	90%	29%
SRR15099877	SRX11409693	SRP327885	SAMN20168769	67,945,962	87%	43%
SRR15099876	SRX11409694	SRP327885	SAMN20168770	67,414,372	88%	39%
SRR15099875	SRX11409695	SRP327885	SAMN20168771	67,194,266	87%	42%
SRR15099874	SRX11409696	SRP327885	SAMN20168772	68,894,754	87%	38%
SRR15099873	SRX11409697	SRP327885	SAMN20168773	69,310,510	82%	35%
SRR15099872	SRX11409698	SRP327885	SAMN20168774	68,841,208	87%	42%
SRR15099871	SRX11409699	SRP327885	SAMN20168775	67,472,004	88%	39%
SRR15099870	SRX11409700	SRP327885	SAMN20168776	67,942,708	87%	40%
SRR15099869	SRX11409701	SRP327885	SAMN20168777	69,674,622	89%	41%
SRR15099867	SRX11409703	SRP327885	SAMN20168778	67,963,706	80%	35%
SRR15099866	SRX11409704	SRP327885	SAMN20168779	67,210,398	71%	38%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Actinopterygii GenBank	88,399	83,717 (94.70%)	83,717 (94.70%)	69.40%	81.27%
Actinopterygii known RefSeq (NP_)	25,472	24,057 (94.44%)	24,057 (94.44%)	68.59%	79.20%
Poecilia reticulata high-quality model RefSeq (XP_)	16,791	16,592 (98.81%)	16,592 (98.81%)	72.85%	82.47%
Homo sapiens known RefSeq (NP_)	63,838	53,025 (83.06%)	53,025 (83.06%)	67.22%	71.09%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences