NCBI Trematomus bernacchii Annotation Release 100

The RefSeq genome records for Trematomus bernacchii were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Trematomus bernacchii Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: May 1 2020
Date of submission of annotation to the public databases: May 6 2020
Software version: 8.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
fTreBer1.1	GCF_902827165.1	SC	04-15-2020	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	fTreBer1.1
Genes and pseudogenes	32,112
protein-coding	25,524
non-coding	5,527
transcribed pseudogenes	1
non-transcribed pseudogenes	735
genes with variants	7,562
immunoglobulin/T-cell receptor gene segments	325
other	0
mRNAs	40,874
fully-supported	38,381
with > 5% ab initio	1,131
partial	876
with filled gap(s)	570
known RefSeq (NM_)	0
model RefSeq (XM_)	40,874
non-coding RNAs	6,587
fully-supported	2,648
with > 5% ab initio	0
partial	3
with filled gap(s)	3
known RefSeq (NR_)	0
model RefSeq (XR_)	3,304
pseudo transcripts	1
fully-supported	1
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	1
CDSs	41,199
fully-supported	38,381
with > 5% ab initio	1,283
partial	882
with major correction(s)	2,027
known RefSeq (NP_)	0
model RefSeq (XP_)	40,874

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	31,051	15,948	6,583	57	1,109,798
All transcripts	47,461	2,803	2,283	57	88,210
mRNA	40,874	3,161	2,559	159	88,210
misc_RNA	722	2,873	2,462	158	11,749
tRNA	3,283	75	73	70	89
lncRNA	1,926	751	513	100	6,881
snoRNA	461	109	93	62	308
snRNA	161	143	141	57	200
guide_RNA	9	209	169	129	382
rRNA	25	119	119	119	120
Single-exon transcripts	1,224	1,637	1,364	243	12,586
coding transcripts (NM_/XM_ )	1,224	1,637	1,364	243	12,586
CDSs	40,874	2,000	1,473	96	87,048
Exons	286,527	272	136	1	17,346
in coding transcripts (NM_/XM_ )	279,654	273	136	1	17,346
in non-coding transcripts (NR_/XR_ )	11,569	224	120	2	7,957
Introns	257,357	1,837	518	30	1,104,009
in coding transcripts (NM_/XM_ )	252,554	1,836	521	30	1,104,009
in non-coding transcripts (NR_/XR_ )	9,350	1,742	421	30	203,973

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.59	1	1	50
Number of exons per transcript	11.73	9	1	218

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 25524 coding genes, 23179 genes had a protein with an alignment covering 50% or more of the query and 10655 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
fTreBer1.1	GCF_902827165.1	3.64%	39.12%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	197	186 (94.42%)	169 (85.79%)	98.76%	88.99%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	1,027,351,312	81%	28%	308,198
SAMN02203121	24252228	General Sample for Trematomus bernacchii (Trematomus bernacchii, SAMN02203121)	99,359	38%	34%	5,087
SAMN02203131	24252228	General Sample for Trematomus bernacchii (Trematomus bernacchii, SAMN02203131)	150,592	53%	48%	9,977
SAMN02203160	24252228	General Sample for Trematomus bernacchii (Trematomus bernacchii, SAMN02203160)	146,996	47%	19%	15,925
SAMN02203161	24252228	General Sample for Trematomus bernacchii (Trematomus bernacchii, SAMN02203161)	117,311	47%	36%	13,043
SAMN02203162	24252228	General Sample for Trematomus bernacchii - Brain Tissue Control Group (Trematomus bernacchii, SAMN02203162)	105,730	59%	13%	11,033
SAMN02203163	24252228	General Sample for Trematomus bernacchii (Trematomus bernacchii, SAMN02203163)	118,391	48%	34%	13,864
SAMN03105957	NA	kidney (Trematomus bernacchii, SAMN03105957)	75,502,118	81%	37%	186,299
SAMN03785060	26268413	head kidney (Trematomus bernacchii, SAMN03785060)	21,917,494	76%	19%	75,298
SAMN03785061	26268413	head kidney (Trematomus bernacchii, SAMN03785061)	31,728,452	56%	19%	112,218
SAMN03785062	26268413	head kidney (Trematomus bernacchii, SAMN03785062)	34,484,220	73%	17%	93,094
SAMN03857797	NA	Gill (Trematomus bernacchii, SAMN03857797)	658,771,861	86%	28%	280,103
SAMN04417070	NA	brain (Trematomus bernacchii, SAMN04417070)	20,551,500	72%	10%	41,874
SAMN04417071	NA	adult, ovary (Trematomus bernacchii, female, SAMN04417071)	30,750,436	84%	26%	127,313
SAMN04417072	NA	liver (Trematomus bernacchii, SAMN04417072)	49,208,136	75%	20%	91,286
SAMN08612675	NA	bone (Trematomus bernacchii, not determined, SAMN08612675)	24,272,744	64%	10%	49,646
SAMN09205388	NA	Hind Brain (Trematomus bernacchii, male, SAMN09205388)	3,514,170	65%	38%	137,981
SAMN09205389	NA	Pituitary gland (Trematomus bernacchii, male, SAMN09205389)	3,010,076	63%	42%	89,723
SAMN09205390	NA	Hypothalamus (Trematomus bernacchii, male, SAMN09205390)	3,261,264	69%	26%	110,591
SAMN09205391	NA	Optic tectum (Trematomus bernacchii, male, SAMN09205391)	4,260,568	65%	33%	130,287
SAMN09205392	NA	Telenceparon (Trematomus bernacchii, male, SAMN09205392)	3,435,932	66%	37%	126,814
SAMN09205393	NA	Liver (Trematomus bernacchii, male, SAMN09205393)	4,708,462	66%	36%	77,492
SAMN09205394	NA	Heart (Trematomus bernacchii, male, SAMN09205394)	3,040,694	60%	42%	113,888
SAMN09205395	NA	Head kidney (Trematomus bernacchii, male, SAMN09205395)	3,995,348	62%	43%	140,496
SAMN09205396	NA	Skin (Trematomus bernacchii, male, SAMN09205396)	3,950,980	65%	39%	112,045
SAMN09205397	NA	Lateral line muscle (Trematomus bernacchii, male, SAMN09205397)	3,044,622	66%	59%	55,003
SAMN09205398	NA	Pyloric caeca (Trematomus bernacchii, male, SAMN09205398)	4,370,722	67%	39%	106,874
SAMN09205399	NA	Intestine (Trematomus bernacchii, male, SAMN09205399)	4,997,384	61%	40%	128,506
SAMN09205400	NA	Eye (Trematomus bernacchii, male, SAMN09205400)	13,335,114	64%	36%	147,458
SAMN09205401	NA	Spleen (Trematomus bernacchii, male, SAMN09205401)	3,887,398	61%	27%	86,770
SAMN09205402	NA	Gonad (Trematomus bernacchii, male, SAMN09205402)	2,612,766	67%	49%	108,827
SAMN09205403	NA	Stomach (Trematomus bernacchii, male, SAMN09205403)	923,176	1%	41%	6,708
SAMN09205404	NA	Muscle (Trematomus bernacchii, male, SAMN09205404)	4,726,872	58%	48%	68,102
SAMN09205405	NA	Adipose tissue (Trematomus bernacchii, male, SAMN09205405)	4,509,070	59%	43%	95,421
SAMN09205406	NA	Gill (Trematomus bernacchii, male, SAMN09205406)	3,841,354	56%	41%	131,177

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR900297	SRX305406	SRP026018	SAMN02203121	99,359	38%	34%
SRR901623	SRX306432	SRP026018	SAMN02203131	150,592	53%	48%
SRR901652	SRX306459	SRP026018	SAMN02203160	146,996	47%	19%
SRR901663	SRX306462	SRP026018	SAMN02203161	117,311	47%	36%
SRR901664	SRX306463	SRP026018	SAMN02203162	105,730	59%	13%
SRR901665	SRX306464	SRP026018	SAMN02203163	118,391	48%	34%
SRR1611142	SRX731622	SRP048867	SAMN03105957	75,502,118	81%	37%
SRR2072642	SRX1067743	SRP059740	SAMN03785060	21,917,494	76%	19%
SRR2072643	SRX1067744	SRP059740	SAMN03785061	31,728,452	56%	19%
SRR2072644	SRX1067745	SRP059740	SAMN03785062	34,484,220	73%	17%
SRR2102321	SRX1096937	SRP061140	SAMN03857797	28,429,166	87%	28%
SRR2102323	SRX1096937	SRP061140	SAMN03857797	24,993,402	87%	28%
SRR2102324	SRX1096937	SRP061140	SAMN03857797	28,429,166	87%	28%
SRR2102325	SRX1096937	SRP061140	SAMN03857797	24,567,477	86%	27%
SRR2102351	SRX1096937	SRP061140	SAMN03857797	25,892,806	86%	27%
SRR2102353	SRX1096937	SRP061140	SAMN03857797	30,232,462	85%	29%
SRR2102355	SRX1096937	SRP061140	SAMN03857797	29,858,517	87%	28%
SRR2102356	SRX1096937	SRP061140	SAMN03857797	24,909,044	86%	28%
SRR2102360	SRX1096937	SRP061140	SAMN03857797	27,164,011	87%	29%
SRR2102361	SRX1096937	SRP061140	SAMN03857797	27,359,804	87%	28%
SRR2102363	SRX1096937	SRP061140	SAMN03857797	25,584,516	86%	29%
SRR2102522	SRX1096937	SRP061140	SAMN03857797	28,003,698	69%	30%
SRR2102524	SRX1096937	SRP061140	SAMN03857797	28,173,671	87%	30%
SRR2102525	SRX1096937	SRP061140	SAMN03857797	24,396,946	87%	29%
SRR2102526	SRX1096937	SRP061140	SAMN03857797	29,095,490	87%	30%
SRR2102528	SRX1096937	SRP061140	SAMN03857797	30,834,415	86%	29%
SRR2102530	SRX1096937	SRP061140	SAMN03857797	25,577,726	86%	26%
SRR2102535	SRX1096937	SRP061140	SAMN03857797	30,223,950	86%	27%
SRR2102546	SRX1096937	SRP061140	SAMN03857797	29,524,377	86%	30%
SRR2102547	SRX1096937	SRP061140	SAMN03857797	29,591,134	87%	28%
SRR2102548	SRX1096937	SRP061140	SAMN03857797	29,198,603	87%	28%
SRR2102549	SRX1096937	SRP061140	SAMN03857797	22,089,222	87%	30%
SRR2102550	SRX1096937	SRP061140	SAMN03857797	27,370,246	86%	27%
SRR2102551	SRX1096937	SRP061140	SAMN03857797	27,272,012	86%	29%
SRR3104397	SRX1532824	SRP068525	SAMN04417070	20,551,500	72%	10%
SRR3104398	SRX1532830	SRP068525	SAMN04417071	30,750,436	84%	26%
SRR3104399	SRX1532832	SRP068525	SAMN04417072	49,208,136	75%	20%
SRR6793948	SRX3752968	SRP133712	SAMN08612675	24,272,744	64%	10%
SRR7164553	SRX4082727	SRP145777	SAMN09205388	3,514,170	65%	38%
SRR7164554	SRX4082726	SRP145777	SAMN09205389	3,010,076	63%	42%
SRR7164555	SRX4082725	SRP145777	SAMN09205390	3,261,264	69%	26%
SRR7164556	SRX4082724	SRP145777	SAMN09205391	4,260,568	65%	33%
SRR7164557	SRX4082723	SRP145777	SAMN09205392	3,435,932	66%	37%
SRR7164558	SRX4082722	SRP145777	SAMN09205393	4,708,462	66%	36%
SRR7164559	SRX4082721	SRP145777	SAMN09205394	3,040,694	60%	42%
SRR7164560	SRX4082720	SRP145777	SAMN09205395	3,995,348	62%	43%
SRR7164561	SRX4082719	SRP145777	SAMN09205396	3,950,980	65%	39%
SRR7164562	SRX4082718	SRP145777	SAMN09205397	3,044,622	66%	59%
SRR7164567	SRX4082713	SRP145777	SAMN09205398	4,370,722	67%	39%
SRR7164568	SRX4082712	SRP145777	SAMN09205399	4,997,384	61%	40%
SRR7164569	SRX4082711	SRP145777	SAMN09205400	13,335,114	64%	36%
SRR7164570	SRX4082710	SRP145777	SAMN09205401	3,887,398	61%	27%
SRR7164563	SRX4082717	SRP145777	SAMN09205402	2,612,766	67%	49%
SRR7164564	SRX4082716	SRP145777	SAMN09205403	923,176	1%	41%
SRR7164565	SRX4082715	SRP145777	SAMN09205404	4,726,872	58%	48%
SRR7164566	SRX4082714	SRP145777	SAMN09205405	4,509,070	59%	43%
SRR7164571	SRX4082709	SRP145777	SAMN09205406	3,841,354	56%	41%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Larimichthys crocea high-quality model RefSeq (XP_)	18,161	17,705 (97.49%)	17,705 (97.49%)	71.60%	81.61%
Same-species GenBank	85	83 (97.65%)	83 (97.65%)	80.60%	88.39%
Cottoperca gobio high-quality model RefSeq (XP_)	13,837	13,396 (96.81%)	13,396 (96.81%)	73.62%	83.32%
Actinopterygii GenBank	85,678	52,061 (60.76%)	52,061 (60.76%)	68.68%	81.01%
Actinopterygii known RefSeq (NP_)	24,999	23,335 (93.34%)	23,335 (93.34%)	67.92%	78.44%
Danio rerio high-quality model RefSeq (XP_)	7,935	7,315 (92.19%)	7,315 (92.19%)	64.54%	72.57%
Perca flavescens high-quality model RefSeq (XP_)	16,027	15,545 (96.99%)	15,545 (96.99%)	72.72%	83.19%
Homo sapiens known RefSeq (NP_)	57,162	37,113 (64.93%)	37,113 (64.93%)	66.10%	68.00%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences