NCBI Cyprinodon variegatus Annotation Release 100

The RefSeq genome records for Cyprinodon variegatus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Cyprinodon variegatus Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Jan 8 2016
Date of submission of annotation to the public databases: Jan 14 2016
Software version: 6.5

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
C_variegatus-1.0	GCF_000732505.1	Aquatic Genome Models	10-01-2015	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	C_variegatus-1.0
Genes and pseudogenes	24,641
protein-coding	23,373
non-coding	1,010
pseudogenes	258
genes with variants	6,257
mRNAs	36,511
fully-supported	32,660
with > 5% ab initio	1,265
partial	1,059
with filled gap(s)	1
known RefSeq (NM_)	0
model RefSeq (XM_)	36,511
Other RNAs	1,811
fully-supported	1,368
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	1,368
CDSs	36,620
fully-supported	32,660
with > 5% ab initio	1,514
partial	1,069
with major correction(s)	262
known RefSeq (NP_)	0
model RefSeq (XP_)	36,511

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	24,383	22,479	10,710	71	1,206,010
All transcripts	38,322	3,030	2,438	71	95,566
mRNA	36,511	3,129	2,514	207	95,566
misc_RNA	298	3,224	2,691	209	13,625
tRNA	443	74	73	71	84
lncRNA	1,070	798	564	119	4,809
Single-exon transcripts	817	1,616	1,394	240	10,058
coding transcripts (NM_/XM_ )	817	1,616	1,394	240	10,058
CDSs	36,511	2,061	1,485	96	94,296
Exons	258,950	263	137	1	17,310
in coding transcripts (NM_/XM_ )	255,215	263	137	1	17,310
in non-coding transcripts (NR_/XR_ )	6,230	230	130	2	7,053
Introns	232,227	2,367	486	30	665,340
in coding transcripts (NM_/XM_ )	229,666	2,354	487	30	665,340
in non-coding transcripts (NR_/XR_ )	4,994	3,240	455	30	258,741

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.56	1	1	43
Number of exons per transcript	12.54	9	1	255

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 23264 coding genes, 21748 genes had a protein with an alignment covering 50% or more of the query and 10130 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
C_variegatus-1.0	GCF_000732505.1	1.63%	26.95%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with short reads and reported in the Short read transcript alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	35	35 (100.00%)	35 (100.00%)	98.80%	99.51%
Same-species EST	8,099	7,053 (87.08%)	6,330 (78.16%)	98.94%	98.81%

Short read transcript alignments

The following short reads (RNA-Seq) from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	1,494,119,229	62%	32%	253,595
SAMN04327297	head, st26 (Cyprinodon variegatus, pooled male and female, SAMN04327297)	44,270,407	56%	30%	181,847
SAMN04327298	head, st29 (Cyprinodon variegatus, pooled male and female, SAMN04327298)	33,142,360	80%	40%	210,415
SAMN04327299	head, 8dpf (Cyprinodon variegatus, pooled male and female, SAMN04327299)	49,351,671	69%	37%	217,641
SAMN04327300	head, 15dpf (Cyprinodon variegatus, pooled male and female, SAMN04327300)	52,222,141	46%	24%	204,961
SAMN04327301	head, st26 (Cyprinodon variegatus, pooled male and female, SAMN04327301)	55,696,427	72%	39%	192,985
SAMN04327302	head, st29 (Cyprinodon variegatus, pooled male and female, SAMN04327302)	52,173,949	67%	33%	213,938
SAMN04327303	head, 8dpf (Cyprinodon variegatus, pooled male and female, SAMN04327303)	57,673,534	73%	41%	220,245
SAMN04327304	head, 15dpf (Cyprinodon variegatus, pooled male and female, SAMN04327304)	36,711,239	53%	28%	196,411
SAMN04327305	head, st26 (Cyprinodon variegatus, pooled male and female, SAMN04327305)	44,993,613	79%	43%	188,646
SAMN04327306	head, st29 (Cyprinodon variegatus, pooled male and female, SAMN04327306)	33,062,560	74%	35%	207,252
SAMN04327307	head, 8dpf (Cyprinodon variegatus, pooled male and female, SAMN04327307)	36,398,415	75%	40%	212,247
SAMN04327308	head, 15dpf (Cyprinodon variegatus, pooled male and female, SAMN04327308)	57,408,687	46%	23%	208,106
SAMN04327309	head, st26 (Cyprinodon variegatus, pooled male and female, SAMN04327309)	46,041,780	41%	22%	176,612
SAMN04327310	head, st29 (Cyprinodon variegatus, pooled male and female, SAMN04327310)	41,277,136	44%	21%	203,175
SAMN04327311	head, 8dpf (Cyprinodon variegatus, pooled male and female, SAMN04327311)	50,607,003	58%	32%	212,539
SAMN04327312	head, 15dpf (Cyprinodon variegatus, pooled male and female, SAMN04327312)	44,017,712	54%	26%	207,944
SAMN04327313	head, st26 (Cyprinodon variegatus, pooled male and female, SAMN04327313)	47,264,253	71%	37%	189,238
SAMN04327314	head, st29 (Cyprinodon variegatus, pooled male and female, SAMN04327314)	45,354,406	59%	27%	211,052
SAMN04327315	head, 8dpf (Cyprinodon variegatus, pooled male and female, SAMN04327315)	39,018,768	44%	23%	204,172
SAMN04327316	head, 15dpf (Cyprinodon variegatus, pooled male and female, SAMN04327316)	55,743,934	76%	40%	216,630
SAMN04327317	head, st26 (Cyprinodon variegatus, pooled male and female, SAMN04327317)	48,785,471	57%	29%	185,492
SAMN04327318	head, st29 (Cyprinodon variegatus, pooled male and female, SAMN04327318)	49,738,737	53%	24%	209,231
SAMN04327319	head, 8dpf (Cyprinodon variegatus, pooled male and female, SAMN04327319)	40,848,500	79%	44%	215,728
SAMN04327320	head, 15dpf (Cyprinodon variegatus, pooled male and female, SAMN04327320)	43,903,006	71%	37%	213,277
SAMN04327321	head, st26 (Cyprinodon variegatus, pooled male and female, SAMN04327321)	38,312,801	54%	27%	178,680
SAMN04327322	head, st29 (Cyprinodon variegatus, pooled male and female, SAMN04327322)	43,281,760	73%	36%	211,530
SAMN04327323	head, 8dpf (Cyprinodon variegatus, pooled male and female, SAMN04327323)	46,312,975	75%	42%	216,658
SAMN04327324	head, 15dpf (Cyprinodon variegatus, pooled male and female, SAMN04327324)	56,805,610	44%	22%	210,382
SAMN04327325	head, st26 (Cyprinodon variegatus, pooled male and female, SAMN04327325)	50,005,605	45%	23%	180,051
SAMN04327326	head, st29 (Cyprinodon variegatus, pooled male and female, SAMN04327326)	48,483,723	42%	20%	202,779
SAMN04327327	head, 8dpf (Cyprinodon variegatus, pooled male and female, SAMN04327327)	55,221,792	80%	44%	218,136
SAMN04327328	head, 15dpf (Cyprinodon variegatus, pooled male and female, SAMN04327328)	49,989,254	75%	38%	211,785

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
SRR2990976	SRX1478890	SRP067182	SAMN04327297	44,270,407	56%	30%
SRR2990977	SRX1478891	SRP067182	SAMN04327298	33,142,360	80%	40%
SRR2990988	SRX1478902	SRP067182	SAMN04327299	49,351,671	69%	37%
SRR2990999	SRX1478913	SRP067182	SAMN04327300	52,222,141	46%	24%
SRR2991002	SRX1478916	SRP067182	SAMN04327301	3,283,182	28%	19%
SRR2991006	SRX1478920	SRP067182	SAMN04327301	52,413,245	75%	40%
SRR2991007	SRX1478921	SRP067182	SAMN04327302	52,173,949	67%	33%
SRR2991003	SRX1478917	SRP067182	SAMN04327303	3,411,546	31%	22%
SRR2991008	SRX1478922	SRP067182	SAMN04327303	54,261,988	75%	42%
SRR2991009	SRX1478923	SRP067182	SAMN04327304	36,711,239	53%	28%
SRR2991010	SRX1478924	SRP067182	SAMN04327305	44,993,613	79%	43%
SRR2991011	SRX1478925	SRP067182	SAMN04327306	33,062,560	74%	35%
SRR2990978	SRX1478892	SRP067182	SAMN04327307	36,398,415	75%	40%
SRR2990979	SRX1478893	SRP067182	SAMN04327308	57,408,687	46%	23%
SRR2990980	SRX1478894	SRP067182	SAMN04327309	46,041,780	41%	22%
SRR2990981	SRX1478895	SRP067182	SAMN04327310	41,277,136	44%	21%
SRR2990982	SRX1478896	SRP067182	SAMN04327311	50,607,003	58%	32%
SRR2990983	SRX1478897	SRP067182	SAMN04327312	44,017,712	54%	26%
SRR2990984	SRX1478898	SRP067182	SAMN04327313	47,264,253	71%	37%
SRR2990985	SRX1478899	SRP067182	SAMN04327314	45,354,406	59%	27%
SRR2990986	SRX1478900	SRP067182	SAMN04327315	39,018,768	44%	23%
SRR2990987	SRX1478901	SRP067182	SAMN04327316	53,585,502	78%	41%
SRR2991004	SRX1478918	SRP067182	SAMN04327316	2,158,432	33%	22%
SRR2990989	SRX1478903	SRP067182	SAMN04327317	48,785,471	57%	29%
SRR2990990	SRX1478904	SRP067182	SAMN04327318	49,738,737	53%	24%
SRR2990991	SRX1478905	SRP067182	SAMN04327319	40,848,500	79%	44%
SRR2990992	SRX1478906	SRP067182	SAMN04327320	43,903,006	71%	37%
SRR2990993	SRX1478907	SRP067182	SAMN04327321	38,312,801	54%	27%
SRR2990994	SRX1478908	SRP067182	SAMN04327322	43,281,760	73%	36%
SRR2990995	SRX1478909	SRP067182	SAMN04327323	46,312,975	75%	42%
SRR2990996	SRX1478910	SRP067182	SAMN04327324	56,805,610	44%	22%
SRR2990997	SRX1478911	SRP067182	SAMN04327325	50,005,605	45%	23%
SRR2990998	SRX1478912	SRP067182	SAMN04327326	48,483,723	42%	20%
SRR2991000	SRX1478914	SRP067182	SAMN04327327	55,221,792	80%	44%
SRR2991001	SRX1478915	SRP067182	SAMN04327328	46,288,038	78%	39%
SRR2991005	SRX1478919	SRP067182	SAMN04327328	3,701,216	36%	25%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Poecilia formosa high-quality model RefSeq (XP_)	16,804	16,716 (99.48%)	16,716 (99.48%)	72.08%	78.73%
Actinopterygii GenBank	74,704	69,274 (92.73%)	69,274 (92.73%)	68.19%	76.58%
Actinopterygii known RefSeq (NP_)	23,776	22,723 (95.57%)	22,723 (95.57%)	68.16%	74.84%
Danio rerio high-quality model RefSeq (XP_)	8,186	7,816 (95.48%)	7,816 (95.48%)	65.06%	67.83%
Fundulus heteroclitus high-quality model RefSeq (XP_)	13,959	13,890 (99.51%)	13,890 (99.51%)	72.81%	79.32%
Homo sapiens known RefSeq (NP_)	40,054	33,585 (83.85%)	33,585 (83.85%)	65.86%	64.81%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences