NCBI Notothenia coriiceps Annotation Release 100

The RefSeq genome records for Notothenia coriiceps were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Notothenia coriiceps Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Dec 18 2014
Date of submission of annotation to the public databases: Dec 23 2014
Software version: 6.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
NC01	GCF_000735185.1	Antarctic Fish Genome Project	07-29-2014	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	NC01
Genes and pseudogenes	27,613
protein-coding	24,795
non-coding	2,515
pseudogenes	303
genes with variants	4,353
mRNAs	32,169
fully-supported	24,676
with > 5% ab initio	2,933
partial	4,849
with filled gap(s)	7
known RefSeq (NM_)	0
model RefSeq (XM_)	32,169
Other RNAs	3,515
fully-supported	3,006
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	3,006
CDSs	32,188
fully-supported	24,676
with > 5% ab initio	3,619
partial	4,854
with major correction(s)	439
known RefSeq (NP_)	0
model RefSeq (XP_)	32,169

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	27,310	11,198	4,971	71	1,038,955
All transcripts	35,684	2,194	1,677	71	25,080
mRNA	32,169	2,349	1,827	214	25,080
misc_RNA	341	2,277	1,856	113	16,229
tRNA	509	75	73	71	86
lncRNA	2,665	717	429	80	10,917
Single-exon transcripts	966	1,682	1,353	228	12,221
coding transcripts (NM_/XM_ )	966	1,682	1,353	228	12,221
CDSs	32,169	1,448	1,041	114	23,985
Exons	208,239	268	135	1	13,388
in coding transcripts (NM_/XM_ )	200,076	269	136	1	13,388
in non-coding transcripts (NR_/XR_ )	10,127	240	111	2	10,717
Introns	179,786	1,509	433	30	962,814
in coding transcripts (NM_/XM_ )	174,405	1,491	430	30	962,814
in non-coding transcripts (NR_/XR_ )	7,289	1,866	475	30	382,370

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.29	1	1	50
Number of exons per transcript	8.47	6	1	128

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 24776 coding genes, 23225 genes had a protein with an alignment covering 50% or more of the query and 9753 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
NC01	GCF_000735185.1	2.65%	28.70%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with short reads and reported in the Short read transcript alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	60	58 (96.67%)	53 (88.33%)	99.07%	83.03%

Short read transcript alignments

The following short reads (RNA-Seq) from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Number (%) of aligned reads	Number (%) spliced reads	Number of introns
All	Aggregate of all aligned samples	946,922,953	760,035,812 (80.26%)	216,315,219 (22.84%)	232,976
SAMN02389858	blood (Notothenia coriiceps, SAMN02389858)	8,335,794	6,470,312 (77.62%)	1,873,813 (22.48%)	74,836
SAMN02389859	brain (Notothenia coriiceps, SAMN02389859)	33,616,256	27,249,898 (81.06%)	6,315,984 (18.79%)	170,261
SAMN02389860	skin (Notothenia coriiceps, SAMN02389860)	67,633,352	52,502,897 (77.63%)	16,057,951 (23.74%)	158,012
SAMN02389883	egg (Notothenia coriiceps, SAMN02389883)	50,160,667	36,854,805 (73.47%)	9,352,057 (18.64%)	139,109
SAMN02389884	kidney (Notothenia coriiceps, SAMN02389884)	60,161,028	44,226,938 (73.51%)	14,279,965 (23.74%)	158,846
SAMN02389885	muscle (Notothenia coriiceps, SAMN02389885)	60,118,476	50,523,864 (84.04%)	20,115,694 (33.46%)	117,321
SAMN02389886	stomach (Notothenia coriiceps, SAMN02389886)	5,077,628	2,233,979 (44.00%)	692,535 (13.64%)	42,402
SAMN02400674	brain (Notothenia coriiceps, SAMN02400674)	73,591,374	58,315,323 (79.24%)	14,378,796 (19.54%)	185,927
SAMN02400675	brain (Notothenia coriiceps, SAMN02400675)	71,284,436	54,948,074 (77.08%)	10,637,563 (14.92%)	171,273
SAMN02400676	brain (Notothenia coriiceps, SAMN02400676)	60,095,576	46,960,820 (78.14%)	10,350,228 (17.22%)	170,454
SAMN02400677	skin (Notothenia coriiceps, SAMN02400677)	86,889,450	72,102,550 (82.98%)	21,365,338 (24.59%)	171,623
SAMN02400678	skin (Notothenia coriiceps, SAMN02400678)	77,110,504	64,695,083 (83.90%)	19,210,792 (24.91%)	167,025
SAMN02400680	skin (Notothenia coriiceps, SAMN02400680)	77,090,498	62,953,255 (81.66%)	18,734,102 (24.30%)	164,507
SAMN02400681	whole blood (Notothenia coriiceps, SAMN02400681)	71,755,790	59,652,101 (83.13%)	16,954,111 (23.63%)	120,568
SAMN02400682	whole blood (Notothenia coriiceps, SAMN02400682)	70,335,794	59,874,105 (85.13%)	18,927,881 (26.91%)	111,507
SAMN02400683	whole blood (Notothenia coriiceps, SAMN02400683)	73,666,330	60,471,808 (82.09%)	17,068,409 (23.17%)	117,092

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Number (%) of aligned reads	Number (%) spliced reads
SRR1015895	SRX372095	SRP031878	SAMN02389858	8,335,794	6,470,312 (77.62%)	1,873,813 (22.48%)
SRR1015901	SRX372096	SRP031878	SAMN02389859	33,616,256	27,249,898 (81.06%)	6,315,984 (18.79%)
SRR1015899	SRX372097	SRP031878	SAMN02389860	67,504,452	52,439,492 (77.68%)	16,019,469 (23.73%)
SRR1535142	SRX668454	SRP031878	SAMN02389860	128,900	63,405 (49.19%)	38,482 (29.85%)
SRR1015896	SRX372098	SRP031878	SAMN02389883	50,000,000	36,775,973 (73.55%)	9,309,067 (18.62%)
SRR1535133	SRX668450	SRP031878	SAMN02389883	160,667	78,832 (49.07%)	42,990 (26.76%)
SRR1015897	SRX372099	SRP031878	SAMN02389884	60,161,028	44,226,938 (73.51%)	14,279,965 (23.74%)
SRR1015898	SRX372100	SRP031878	SAMN02389885	60,000,000	50,462,071 (84.10%)	20,080,374 (33.47%)
SRR1535158	SRX668455	SRP031878	SAMN02389885	118,476	61,793 (52.16%)	35,320 (29.81%)
SRR1015900	SRX372094	SRP031878	SAMN02389886	5,077,628	2,233,979 (44.00%)	692,535 (13.64%)
SRR1015883	SRX382147	SRP031878	SAMN02400674	39,975,118	31,065,425 (77.71%)	8,062,812 (20.17%)
SRR1015884	SRX382147	SRP031878	SAMN02400674	33,616,256	27,249,898 (81.06%)	6,315,984 (18.79%)
SRR1015885	SRX382148	SRP031878	SAMN02400675	36,471,762	28,115,006 (77.09%)	5,261,205 (14.43%)
SRR1015886	SRX382148	SRP031878	SAMN02400675	34,812,674	26,833,068 (77.08%)	5,376,358 (15.44%)
SRR1015887	SRX382149	SRP031878	SAMN02400676	33,196,790	25,639,869 (77.24%)	5,726,320 (17.25%)
SRR1015888	SRX382149	SRP031878	SAMN02400676	26,898,786	21,320,951 (79.26%)	4,623,908 (17.19%)
SRR1015889	SRX382168	SRP031878	SAMN02400677	34,983,508	28,714,794 (82.08%)	8,300,313 (23.73%)
SRR1015890	SRX382168	SRP031878	SAMN02400677	51,905,942	43,387,756 (83.59%)	13,065,025 (25.17%)
SRR1015893	SRX382172	SRP031878	SAMN02400678	45,418,514	37,979,146 (83.62%)	11,111,857 (24.47%)
SRR1015894	SRX382172	SRP031878	SAMN02400678	31,691,990	26,715,937 (84.30%)	8,098,935 (25.56%)
SRR1015891	SRX382171	SRP031878	SAMN02400680	43,262,444	35,143,197 (81.23%)	10,364,008 (23.96%)
SRR1015892	SRX382171	SRP031878	SAMN02400680	33,828,054	27,810,058 (82.21%)	8,370,094 (24.74%)
SRR1015840	SRX382150	SRP031878	SAMN02400681	37,978,348	32,076,660 (84.46%)	9,262,228 (24.39%)
SRR1015842	SRX382150	SRP031878	SAMN02400681	33,777,442	27,575,441 (81.64%)	7,691,883 (22.77%)
SRR1015881	SRX382167	SRP031878	SAMN02400682	37,743,234	31,583,089 (83.68%)	9,794,774 (25.95%)
SRR1015882	SRX382167	SRP031878	SAMN02400682	32,592,560	28,291,016 (86.80%)	9,133,107 (28.02%)
SRR1015879	SRX382161	SRP031878	SAMN02400683	38,834,040	31,029,106 (79.90%)	8,007,737 (20.62%)
SRR1015880	SRX382161	SRP031878	SAMN02400683	34,832,290	29,442,702 (84.53%)	9,060,672 (26.01%)

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Actinopterygii GenBank	71,417	62,964 (88.16%)	62,964 (88.16%)	70.19%	72.56%
Actinopterygii known RefSeq (NP_)	22,367	20,224 (90.42%)	20,224 (90.42%)	68.87%	69.35%
Same-species GenBank	48	47 (97.92%)	47 (97.92%)	80.01%	77.23%
Homo sapiens known RefSeq (NP_)	38,277	29,590 (77.30%)	29,590 (77.30%)	65.58%	57.74%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences