NCBI Solea senegalensis Annotation Release 100

The RefSeq genome records for Solea senegalensis were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Solea senegalensis Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Sep 30 2021
Date of submission of annotation to the public databases: Oct 3 2021
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
IFAPA_SoseM_1	GCF_019176455.1	IFAPA	07-09-2021	Reference	22 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	IFAPA_SoseM_1
Genes and pseudogenes	29,174
protein-coding	23,383
non-coding	5,440
Transcribed pseudogenes	2
Non-transcribed pseudogenes	251
genes with variants	8,777
Immunoglobulin/T-cell receptor gene segments	98
other	0
mRNAs	41,998
fully-supported	40,554
with > 5% ab initio	428
partial	422
with filled gap(s)	1
known RefSeq (NM_)	0
model RefSeq (XM_)	41,998
non-coding RNAs	6,510
fully-supported	2,867
with > 5% ab initio	0
partial	3
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	4,433
pseudo transcripts	2
fully-supported	1
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	2
CDSs	42,109
fully-supported	40,554
with > 5% ab initio	512
partial	430
with major correction(s)	270
known RefSeq (NP_)	0
model RefSeq (XP_)	42,011

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	28,823	13,706	6,200	54	1,171,052
All transcripts	48,508	3,094	2,528	54	94,924
mRNA	41,998	3,461	2,814	189	94,924
misc_RNA	538	3,261	2,585	83	23,967
tRNA	2,075	74	73	65	88
lncRNA	2,332	1,121	782	70	7,500
snoRNA	230	128	129	63	326
snRNA	969	154	164	54	200
guide_RNA	10	217	270	131	379
rRNA	356	158	119	115	4,140
Single-exon transcripts	952	1,836	1,566	189	12,832
coding transcripts (NM_/XM_ )	952	1,836	1,566	189	12,832
CDSs	42,011	2,200	1,578	96	93,672
Exons	279,656	276	137	1	19,371
in coding transcripts (NM_/XM_ )	272,291	275	137	1	19,371
in non-coding transcripts (NR_/XR_ )	11,584	286	134	2	10,382
Introns	253,022	1,508	400	30	1,128,792
in coding transcripts (NM_/XM_ )	247,970	1,474	396	30	1,128,792
in non-coding transcripts (NR_/XR_ )	9,186	2,440	513	30	122,246

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.74	1	1	34
Number of exons per transcript	12.8	9	1	249

BUSCO analysis of gene annotation

BUSCO v4.1.4 (Simão et al 2015, PMID: 26059717) was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the actinopterygii_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation (C:complete [S:single-copy, D:duplicated], F:fragmented, M:missing, n:number of genes used).

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 23370 coding genes, 21797 genes had a protein with an alignment covering 50% or more of the query and 10495 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
IFAPA_SoseM_1	GCF_019176455.1	27.41%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	376	373 (99.20%)	356 (94.68%)	99.49%	98.65%
Same-species EST	10,681	9,817 (91.91%)	9,386 (87.88%)	99.20%	99.14%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	4,249,218,493	83%	19%	302,436
SAMEA5056439	NA	skin (Solea senegalensis, SAMEA5056439)	124,341,105	81%	22%	193,615
SAMEA5056440	NA	skin (Solea senegalensis, SAMEA5056440)	138,320,240	80%	23%	200,956
SAMEA5056441	NA	gut (Solea senegalensis, SAMEA5056441)	129,447,335	83%	21%	196,242
SAMEA5056442	NA	gut (Solea senegalensis, SAMEA5056442)	141,251,740	83%	20%	180,198
SAMEA5130943	NA	intestine (Solea senegalensis, 6, mixed, SAMEA5130943)	129,116,328	83%	21%	196,240
SAMEA5130944	NA	dorsal skin (Solea senegalensis, 6, mixed, SAMEA5130944)	123,841,899	81%	22%	193,613
SAMEA5130945	NA	intestine (Solea senegalensis, 6, mixed, SAMEA5130945)	139,803,815	84%	20%	180,197
SAMEA5130946	NA	dorsal skin (Solea senegalensis, 6, mixed, SAMEA5130946)	133,666,984	83%	23%	200,953
SAMN01932439	NA	Generic sample from Solea senegalensis (Solea senegalensis, SAMN01932439)	1,252,465	84%	58%	160,629
SAMN02688048	NA	Premetamorphic larvae (9 DAH) (Solea senegalensis, Premetamorphic, not determined, SAMN02688048)	45,043,118	82%	16%	203,026
SAMN02688499	NA	larvae (Solea senegalensis, Premetamorphic larvae 2, Unknow, SAMN02688499)	47,397,302	84%	16%	192,312
SAMN02688501	NA	larvae (Solea senegalensis, Premetamorphic larvae 4, Unknow, SAMN02688501)	37,327,064	84%	14%	189,527
SAMN02688502	NA	larvae (Solea senegalensis, Premetamorphic larvae 5, Unknow, SAMN02688502)	51,198,970	79%	15%	191,454
SAMN02688503	NA	larvae (Solea senegalensis, Premetamorphic larvae 6, Unknow, SAMN02688503)	56,248,442	69%	15%	182,708
SAMN02921283	NA	larvae (Solea senegalensis, not collected, not collected, SAMN02921283)	131,200,512	84%	17%	237,094
SAMN02921286	NA	larvae (Solea senegalensis, not collected, not collected, SAMN02921286)	138,546,398	82%	15%	236,820
SAMN02921289	NA	larvae (Solea senegalensis, not collected, not collected, SAMN02921289)	131,028,390	73%	14%	237,925
SAMN02921292	NA	larvae (Solea senegalensis, not collected, not collected, SAMN02921292)	141,911,806	82%	12%	233,919
SAMN04880896	NA	Upper olfactory rosette (Solea senegalensis, adult, male, SAMN04880896)	305,106,496	86%	14%	237,547
SAMN04881095	NA	Upper olfactory rosette (Solea senegalensis, adult, male, SAMN04881095)	304,640,658	86%	15%	240,158
SAMN07414637	30065724	Head-Kidney (Solea senegalensis, SAMN07414637)	122,487,006	88%	23%	214,981
SAMN07414638	30065724	Eye-brain (Solea senegalensis, SAMN07414638)	86,634,638	80%	16%	195,285
SAMN07414639	30065724	Eye-brain (Solea senegalensis, SAMN07414639)	103,508,366	79%	16%	197,354
SAMN07414640	30065724	Eye-brain (Solea senegalensis, SAMN07414640)	103,854,188	80%	15%	209,646
SAMN07414641	30065724	Eye-brain (Solea senegalensis, SAMN07414641)	108,143,248	78%	15%	203,695
SAMN07414642	30065724	Eye-brain (Solea senegalensis, SAMN07414642)	102,168,420	79%	17%	179,902
SAMN07414643	30065724	Eye-brain (Solea senegalensis, SAMN07414643)	99,839,706	78%	16%	195,938
SAMN07414644	30065724	Eye-brain (Solea senegalensis, SAMN07414644)	118,139,950	75%	17%	197,870
SAMN07414645	30065724	Eye-brain (Solea senegalensis, SAMN07414645)	83,367,368	73%	17%	192,098
SAMN07414646	30065724	Eye-brain (Solea senegalensis, SAMN07414646)	88,416,250	76%	17%	168,235
SAMN07414647	30065724	Head-Kidney (Solea senegalensis, SAMN07414647)	92,733,406	89%	24%	200,055
SAMN07414648	30065724	Head-Kidney (Solea senegalensis, SAMN07414648)	104,239,742	89%	24%	216,056
SAMN07414649	30065724	Head-Kidney (Solea senegalensis, SAMN07414649)	106,579,928	89%	23%	209,564
SAMN07414650	30065724	Head-Kidney (Solea senegalensis, SAMN07414650)	88,169,852	89%	23%	205,147
SAMN07414651	30065724	Head-Kidney (Solea senegalensis, SAMN07414651)	90,388,882	89%	23%	203,966
SAMN07414652	30065724	Head-Kidney (Solea senegalensis, SAMN07414652)	90,788,162	89%	24%	212,104
SAMN07414653	30065724	Head-Kidney (Solea senegalensis, SAMN07414653)	97,514,460	89%	24%	202,867
SAMN07414654	30065724	Head-Kidney (Solea senegalensis, SAMN07414654)	111,553,854	89%	24%	206,853

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
ERR3415800	ERX3439328	ERP111750	SAMEA5056439	124,341,105	81%	22%
ERR3415801	ERX3439329	ERP111750	SAMEA5056440	138,320,240	80%	23%
ERR3415802	ERX3439330	ERP111750	SAMEA5056441	129,447,335	83%	21%
ERR3415803	ERX3439331	ERP111750	SAMEA5056442	141,251,740	83%	20%
ERR2930177	ERX2933065	ERP112310	SAMEA5130943	129,116,328	83%	21%
ERR2930178	ERX2933066	ERP112310	SAMEA5130944	123,841,899	81%	22%
ERR2930179	ERX2933067	ERP112310	SAMEA5130945	139,803,815	84%	20%
ERR2930180	ERX2933068	ERP112310	SAMEA5130946	133,666,984	83%	23%
SRR988100	SRX246914	SRP022228	SAMN01932439	115,813	86%	64%
SRR988101	SRX246914	SRP022228	SAMN01932439	234,026	82%	53%
SRR988102	SRX246914	SRP022228	SAMN01932439	107,339	82%	55%
SRR988103	SRX246914	SRP022228	SAMN01932439	191,832	84%	59%
SRR988104	SRX246914	SRP022228	SAMN01932439	155,238	83%	57%
SRR988105	SRX246914	SRP022228	SAMN01932439	204,244	85%	60%
SRR988106	SRX246914	SRP022228	SAMN01932439	105,123	82%	58%
SRR988107	SRX246914	SRP022228	SAMN01932439	138,850	85%	59%
SRR1190284	SRX487381	SRP040065	SAMN02688048	45,043,118	82%	16%
SRR1190270	SRX487375	SRP040065	SAMN02688499	28,059,856	83%	16%
SRR1190307	SRX487375	SRP040065	SAMN02688499	19,337,446	84%	16%
SRR1190274	SRX487378	SRP040065	SAMN02688501	37,327,064	84%	14%
SRR1190275	SRX487379	SRP040065	SAMN02688502	51,198,970	79%	15%
SRR1190276	SRX487380	SRP040065	SAMN02688503	56,248,442	69%	15%
SRR1518380	SRX655595	SRP044397	SAMN02921283	43,877,074	84%	18%
SRR1518391	SRX655595	SRP044397	SAMN02921283	38,420,624	84%	17%
SRR1518392	SRX655595	SRP044397	SAMN02921283	48,902,814	84%	17%
SRR1519314	SRX655596	SRP044397	SAMN02921286	34,284,486	85%	14%
SRR1519315	SRX655596	SRP044397	SAMN02921286	53,274,302	80%	15%
SRR1519316	SRX655596	SRP044397	SAMN02921286	50,987,610	81%	15%
SRR1518385	SRX655597	SRP044397	SAMN02921289	42,705,016	62%	15%
SRR1518386	SRX655597	SRP044397	SAMN02921289	40,785,294	73%	14%
SRR1518387	SRX655597	SRP044397	SAMN02921289	47,538,080	83%	14%
SRR1518388	SRX655599	SRP044397	SAMN02921292	44,062,396	78%	12%
SRR1518389	SRX655599	SRP044397	SAMN02921292	40,594,550	82%	12%
SRR1518390	SRX655599	SRP044397	SAMN02921292	57,254,860	85%	12%
SRR3417195	SRX1719632	SRP073693	SAMN04880896	92,709,024	87%	16%
SRR3417239	SRX1719663	SRP073693	SAMN04880896	110,131,928	85%	13%
SRR3417250	SRX1719679	SRP073693	SAMN04880896	102,265,544	86%	14%
SRR3417257	SRX1719681	SRP073693	SAMN04881095	115,878,816	87%	15%
SRR3417259	SRX1719683	SRP073693	SAMN04881095	96,182,542	85%	14%
SRR3417261	SRX1719684	SRP073693	SAMN04881095	92,579,300	86%	15%
SRR5867204	SRX3034934	SRP113561	SAMN07414637	122,487,006	88%	23%
SRR5867226	SRX3034951	SRP113561	SAMN07414638	86,634,638	80%	16%
SRR5867225	SRX3034950	SRP113561	SAMN07414639	103,508,366	79%	16%
SRR5867224	SRX3034949	SRP113561	SAMN07414640	103,854,188	80%	15%
SRR5867222	SRX3034948	SRP113561	SAMN07414641	30,782,566	80%	15%
SRR5867223	SRX3034948	SRP113561	SAMN07414641	77,360,682	77%	15%
SRR5867220	SRX3034947	SRP113561	SAMN07414642	37,030,182	81%	18%
SRR5867221	SRX3034947	SRP113561	SAMN07414642	65,138,238	77%	17%
SRR5867218	SRX3034946	SRP113561	SAMN07414643	35,436,878	80%	16%
SRR5867219	SRX3034946	SRP113561	SAMN07414643	64,402,828	77%	15%
SRR5867216	SRX3034945	SRP113561	SAMN07414644	44,791,388	77%	17%
SRR5867217	SRX3034945	SRP113561	SAMN07414644	73,348,562	74%	17%
SRR5867215	SRX3034944	SRP113561	SAMN07414645	83,367,368	73%	17%
SRR5867213	SRX3034943	SRP113561	SAMN07414646	33,930,950	79%	17%
SRR5867214	SRX3034943	SRP113561	SAMN07414646	54,485,300	74%	17%
SRR5867212	SRX3034942	SRP113561	SAMN07414647	92,733,406	89%	24%
SRR5867211	SRX3034941	SRP113561	SAMN07414648	104,239,742	89%	24%
SRR5867210	SRX3034940	SRP113561	SAMN07414649	106,579,928	89%	23%
SRR5867209	SRX3034939	SRP113561	SAMN07414650	88,169,852	89%	23%
SRR5867208	SRX3034938	SRP113561	SAMN07414651	90,388,882	89%	23%
SRR5867207	SRX3034937	SRP113561	SAMN07414652	90,788,162	89%	24%
SRR5867206	SRX3034936	SRP113561	SAMN07414653	97,514,460	89%	24%
SRR5867205	SRX3034935	SRP113561	SAMN07414654	111,553,854	89%	24%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Betta splendens high-quality model RefSeq (XP_)	18,343	17,818 (97.14%)	17,818 (97.14%)	70.51%	80.17%
Poecilia formosa high-quality model RefSeq (XP_)	18,503	17,632 (95.29%)	17,632 (95.29%)	69.88%	79.50%
Actinopterygii GenBank	88,820	71,851 (80.90%)	71,851 (80.90%)	69.27%	80.88%
Actinopterygii known RefSeq (NP_)	25,473	6,532 (25.64%)	6,532 (25.64%)	67.09%	76.48%
Esox lucius high-quality model RefSeq (XP_)	18,508	17,673 (95.49%)	17,673 (95.49%)	68.01%	76.84%
Xiphophorus maculatus high-quality model RefSeq (XP_)	18,457	17,410 (94.33%)	17,410 (94.33%)	69.19%	79.39%
Homo sapiens known RefSeq (NP_)	62,807	39,361 (62.67%)	39,361 (62.67%)	67.07%	70.19%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences