NCBI Colossoma macropomum Annotation Release 100

The RefSeq genome records for Colossoma macropomum were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Colossoma macropomum Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Oct 8 2020
Date of submission of annotation to the public databases: Oct 13 2020
Software version: 8.5

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Colossoma_macropomum	GCF_904425465.1	WELLCOME TRUST SANGER INSTITUTE	09-25-2020	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Colossoma_macropomum
Genes and pseudogenes	31,149
protein-coding	26,670
non-coding	3,279
transcribed pseudogenes	0
non-transcribed pseudogenes	880
genes with variants	8,274
immunoglobulin/T-cell receptor gene segments	320
other	0
mRNAs	43,618
fully-supported	40,689
with > 5% ab initio	1,500
partial	251
with filled gap(s)	53
known RefSeq (NM_)	0
model RefSeq (XM_)	43,618
non-coding RNAs	4,226
fully-supported	2,269
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	2,983
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	43,938
fully-supported	40,689
with > 5% ab initio	1,648
partial	267
with major correction(s)	1,208
known RefSeq (NP_)	0
model RefSeq (XP_)	43,618

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	29,949	23,697	10,422	55	1,059,267
All transcripts	47,844	3,057	2,485	55	89,435
mRNA	43,618	3,257	2,633	183	89,435
misc_RNA	670	2,966	2,480	147	15,094
tRNA	1,243	74	73	71	87
lncRNA	1,599	1,230	765	127	11,056
snoRNA	218	121	123	62	320
snRNA	244	146	141	55	200
guide_RNA	9	201	162	92	310
rRNA	243	292	119	118	4,268
Single-exon transcripts	1,186	1,427	1,032	231	10,014
coding transcripts (NM_/XM_ )	1,186	1,427	1,032	231	10,014
CDSs	43,618	2,011	1,455	99	88,143
Exons	301,851	280	136	1	20,231
in coding transcripts (NM_/XM_ )	295,910	278	136	1	20,231
in non-coding transcripts (NR_/XR_ )	10,182	295	130	2	9,311
Introns	272,290	2,631	905	30	1,025,684
in coding transcripts (NM_/XM_ )	268,203	2,631	906	30	1,025,684
in non-coding transcripts (NR_/XR_ )	8,226	2,493	843	30	177,679

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.62	1	1	35
Number of exons per transcript	12.02	9	1	231

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 26670 coding genes, 24448 genes had a protein with an alignment covering 50% or more of the query and 12341 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Colossoma_macropomum	GCF_904425465.1	4.00%	34.42%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	34	34 (100.00%)	29 (85.29%)	99.24%	99.77%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	1,361,759,081	79%	43%	312,762
SAMN06014806	Adult, Skin (Colossoma macropomum, SAMN06014806)	4,719,800	70%	1%	29,144
SAMN06014807	Adult, Muscle (Colossoma macropomum, SAMN06014807)	4,087,546	86%	14%	42,412
SAMN06014866	Adult, Skin (Colossoma macropomum, SAMN06014866)	3,805,336	72%	0%	22,908
SAMN06014867	Adult, Muscle (Colossoma macropomum, SAMN06014867)	3,765,208	65%	11%	34,388
SAMN06168612	liver (Colossoma macropomum, adult, pooled male and female, SAMN06168612)	277,596	89%	34%	37,431
SAMN08554961	Juvenile, Liver (Colossoma macropomum, SAMN08554961)	87,018,342	81%	24%	170,220
SAMN10879417	juvenile, liver (Colossoma macropomum, 7 months, SAMN10879417)	9,850,968	90%	47%	89,289
SAMN10879418	juvenile, liver (Colossoma macropomum, 7 months, SAMN10879418)	9,385,714	77%	38%	50,891
SAMN10879419	juvenile, liver (Colossoma macropomum, 7 months, SAMN10879419)	8,103,212	92%	47%	84,441
SAMN10879420	juvenile, liver (Colossoma macropomum, 11 months, SAMN10879420)	11,662,012	90%	47%	96,657
SAMN10879421	juvenile, liver (Colossoma macropomum, 11 months, SAMN10879421)	9,297,996	76%	42%	86,777
SAMN10879422	juvenile, liver (Colossoma macropomum, 11 months, SAMN10879422)	8,400,730	92%	47%	84,340
SAMN10879423	juvenile, liver (Colossoma macropomum, 7 months, SAMN10879423)	12,217,980	90%	46%	95,659
SAMN10879424	juvenile, liver (Colossoma macropomum, 7 months, SAMN10879424)	9,979,556	78%	42%	82,975
SAMN10879425	juvenile, liver (Colossoma macropomum, 7 months, SAMN10879425)	12,094,826	92%	46%	94,446
SAMN10879426	juvenile, liver (Colossoma macropomum, 11 months, SAMN10879426)	11,477,378	90%	44%	87,937
SAMN10879427	juvenile, liver (Colossoma macropomum, 11 months, SAMN10879427)	9,348,310	77%	39%	80,320
SAMN10879428	juvenile, liver (Colossoma macropomum, 11 months, SAMN10879428)	7,976,976	91%	44%	79,338
SAMN11605106	trunk (Colossoma macropomum, SAMN11605106)	115,045,432	83%	49%	233,166
SAMN11605107	trunk (Colossoma macropomum, SAMN11605107)	142,224,004	79%	49%	235,876
SAMN11605108	trunk (Colossoma macropomum, SAMN11605108)	118,044,192	77%	47%	228,208
SAMN11605109	trunk (Colossoma macropomum, SAMN11605109)	94,770,760	78%	49%	242,765
SAMN11605110	trunk (Colossoma macropomum, SAMN11605110)	75,762,856	77%	52%	210,983
SAMN11605111	trunk (Colossoma macropomum, SAMN11605111)	97,412,372	80%	50%	240,927
SAMN11605112	trunk (Colossoma macropomum, SAMN11605112)	120,824,908	83%	43%	247,937
SAMN11605113	trunk (Colossoma macropomum, SAMN11605113)	120,699,622	75%	31%	235,969
SAMN11605114	trunk (Colossoma macropomum, SAMN11605114)	132,101,332	81%	36%	246,834
SAMN11605115	trunk (Colossoma macropomum, SAMN11605115)	81,354,006	79%	49%	241,450
SAMN13783598	Juvenile, Brain (Colossoma macropomum, Juvenile, SAMN13783598)	6,754,434	55%	7%	57,315
SAMN13783599	Juvenile, Brain (Colossoma macropomum, Juvenile, SAMN13783599)	4,210,463	52%	34%	89,014
SAMN13783600	Juvenile, Brain (Colossoma macropomum, Juvenile, SAMN13783600)	5,419,968	43%	15%	70,755
SAMN13783601	Juvenile, Brain (Colossoma macropomum, Juvenile, SAMN13783601)	7,614,932	62%	28%	99,469
SAMN13783602	Juvenile, Brain (Colossoma macropomum, Juvenile, SAMN13783602)	4,598,458	47%	19%	89,188
SAMN13783603	Juvenile, Brain (Colossoma macropomum, Juvenile, SAMN13783603)	2,281,003	29%	17%	33,254
SAMN13783604	Juvenile, Brain (Colossoma macropomum, Juvenile, SAMN13783604)	7,170,084	58%	13%	89,010
SAMN13783605	Juvenile, Brain (Colossoma macropomum, Juvenile, SAMN13783605)	2,000,769	46%	19%	46,658

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR5036134	SRX2360416	SRP093537	SAMN06014806	2,359,900	71%	1%
SRR5054068	SRX2375120	SRP093537	SAMN06014806	2,359,900	69%	2%
SRR5036139	SRX2360421	SRP093537	SAMN06014807	2,043,773	84%	12%
SRR5054072	SRX2375124	SRP093537	SAMN06014807	2,043,773	87%	16%
SRR5036137	SRX2360419	SRP093537	SAMN06014866	1,902,668	72%	0%
SRR5054070	SRX2375122	SRP093537	SAMN06014866	1,902,668	72%	0%
SRR5036136	SRX2360418	SRP093537	SAMN06014867	1,882,604	65%	11%
SRR5054067	SRX2375119	SRP093537	SAMN06014867	1,882,604	65%	11%
SRR5122711	SRX2437848	SRP095431	SAMN06168612	277,596	89%	34%
SRR6741523	SRX3714434	SRP132982	SAMN08554961	87,018,342	81%	24%
SRR9029884	SRX5807167	SRP197107	SAMN11605106	115,045,432	83%	49%
SRR9029883	SRX5807166	SRP197107	SAMN11605107	142,224,004	79%	49%
SRR9029882	SRX5807165	SRP197107	SAMN11605108	118,044,192	77%	47%
SRR9029881	SRX5807164	SRP197107	SAMN11605109	94,770,760	78%	49%
SRR9029880	SRX5807163	SRP197107	SAMN11605110	75,762,856	77%	52%
SRR9029879	SRX5807162	SRP197107	SAMN11605111	97,412,372	80%	50%
SRR9029878	SRX5807161	SRP197107	SAMN11605112	120,824,908	83%	43%
SRR9029877	SRX5807160	SRP197107	SAMN11605113	120,699,622	75%	31%
SRR9029876	SRX5807159	SRP197107	SAMN11605114	132,101,332	81%	36%
SRR9029875	SRX5807158	SRP197107	SAMN11605115	81,354,006	79%	49%
SRR10855593	SRX7525968	SRP240854	SAMN13783598	6,754,434	55%	7%
SRR10855592	SRX7525969	SRP240854	SAMN13783599	4,210,463	52%	34%
SRR10855591	SRX7525970	SRP240854	SAMN13783600	5,419,968	43%	15%
SRR10855590	SRX7525971	SRP240854	SAMN13783601	7,614,932	62%	28%
SRR10855589	SRX7525972	SRP240854	SAMN13783602	4,598,458	47%	19%
SRR10855588	SRX7525973	SRP240854	SAMN13783603	2,281,003	29%	17%
SRR10855587	SRX7525974	SRP240854	SAMN13783604	7,170,084	58%	13%
SRR10855586	SRX7525975	SRP240854	SAMN13783605	2,000,769	46%	19%
SRR11808608	SRX8359975	SRP262182	SAMN10879417	9,850,968	90%	47%
SRR11808607	SRX8359976	SRP262182	SAMN10879418	9,385,714	77%	38%
SRR11808604	SRX8359979	SRP262182	SAMN10879419	8,103,212	92%	47%
SRR11808603	SRX8359980	SRP262182	SAMN10879420	11,662,012	90%	47%
SRR11808602	SRX8359981	SRP262182	SAMN10879421	9,297,996	76%	42%
SRR11808601	SRX8359982	SRP262182	SAMN10879422	8,400,730	92%	47%
SRR11808600	SRX8359983	SRP262182	SAMN10879423	12,217,980	90%	46%
SRR11808599	SRX8359984	SRP262182	SAMN10879424	9,979,556	78%	42%
SRR11808598	SRX8359985	SRP262182	SAMN10879425	12,094,826	92%	46%
SRR11808597	SRX8359986	SRP262182	SAMN10879426	11,477,378	90%	44%
SRR11808606	SRX8359977	SRP262182	SAMN10879427	9,348,310	77%	39%
SRR11808605	SRX8359978	SRP262182	SAMN10879428	7,976,976	91%	44%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Betta splendens high-quality model RefSeq (XP_)	18,279	17,600 (96.29%)	17,600 (96.29%)	67.37%	75.82%
Actinopterygii GenBank	86,629	70,546 (81.43%)	70,546 (81.43%)	67.91%	79.83%
Actinopterygii known RefSeq (NP_)	25,473	6,727 (26.41%)	6,727 (26.41%)	67.46%	79.65%
Danio rerio high-quality model RefSeq (XP_)	7,718	7,348 (95.21%)	7,348 (95.21%)	66.81%	77.30%
Astyanax mexicanus high-quality model RefSeq (XP_)	16,692	16,394 (98.21%)	16,394 (98.21%)	71.33%	83.37%
Esox lucius high-quality model RefSeq (XP_)	18,508	17,806 (96.21%)	17,806 (96.21%)	67.76%	76.75%
Xiphophorus maculatus high-quality model RefSeq (XP_)	18,457	17,770 (96.28%)	17,770 (96.28%)	66.41%	75.29%
Homo sapiens known RefSeq (NP_)	60,129	39,112 (65.05%)	39,112 (65.05%)	66.77%	69.14%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences