NCBI Lepisosteus oculatus Annotation Release 101

The RefSeq genome records for Lepisosteus oculatus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Lepisosteus oculatus Annotation Release 101

Annotation release ID: 101
Date of Entrez queries for transcripts and proteins: Jan 6 2016
Date of submission of annotation to the public databases: Jan 8 2016
Software version: 6.5

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
LepOcu1	GCF_000242695.1	Broad Institute	01-13-2012	Reference	30 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	LepOcu1
Genes and pseudogenes	21,445
protein-coding	18,843
non-coding	2,485
pseudogenes	117
genes with variants	8,686
mRNAs	41,634
fully-supported	38,798
with > 5% ab initio	838
partial	561
with filled gap(s)	0
known RefSeq (NM_)	2
model RefSeq (XM_)	41,632
Other RNAs	4,534
fully-supported	4,047
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	4,047
CDSs	41,706
fully-supported	38,798
with > 5% ab initio	1,104
partial	564
with major correction(s)	199
known RefSeq (NP_)	2
model RefSeq (XP_)	41,632

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	21,328	26,981	11,935	71	830,748
All transcripts	46,168	3,771	3,066	71	97,822
mRNA	41,634	4,003	3,273	243	97,822
misc_RNA	903	3,256	2,673	150	16,121
tRNA	487	74	73	71	84
lncRNA	3,144	1,417	892	104	18,710
Single-exon transcripts	566	1,889	1,309	246	11,946
coding transcripts (NM_/XM_ )	566	1,889	1,309	246	11,946
CDSs	41,634	2,111	1,515	96	96,531
Exons	235,135	325	138	1	20,255
in coding transcripts (NM_/XM_ )	224,991	320	137	1	20,255
in non-coding transcripts (NR_/XR_ )	15,420	353	139	3	17,220
Introns	212,434	3,121	858	30	548,287
in coding transcripts (NM_/XM_ )	205,300	3,077	853	30	548,287
in non-coding transcripts (NR_/XR_ )	12,227	3,819	974	31	369,097

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.15	1	1	50
Number of exons per transcript	12.8	10	1	247

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 18771 coding genes, 17581 genes had a protein with an alignment covering 50% or more of the query and 9469 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
LepOcu1	GCF_000242695.1	0.94%	22.32%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with short reads and reported in the Short read transcript alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	2	2 (100.00%)	2 (100.00%)	100.00%	100.00%
Same-species Genbank	45	45 (100.00%)	39 (86.67%)	99.61%	98.23%
Actinopterygii known RefSeq (NM_/NR_)	24,504	7,094 (28.95%)	483 (1.97%)	89.19%	84.63%
Actinopterygii Genbank	189,818	46,668 (24.59%)	7,499 (3.95%)	86.89%	79.29%
Actinopterygii EST	5,625,891	149,935 (2.67%)	73,360 (1.30%)	87.90%	95.28%

Short read transcript alignments

The following short reads (RNA-Seq) from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	1,647,085,306	74%	21%	277,678
SAMN02781061	liver (Lepisosteus oculatus, SAMN02781061)	91,618,066	75%	22%	164,572
SAMN02781069	stage 28 embryo (Lepisosteus oculatus, SAMN02781069)	80,557,234	71%	21%	193,425
SAMN02781072	eye (Lepisosteus oculatus, SAMN02781072)	72,541,368	78%	18%	194,254
SAMN02781078	testis (Lepisosteus oculatus, SAMN02781078)	81,272,844	68%	15%	198,365
SAMN02781081	kidney (Lepisosteus oculatus, SAMN02781081)	69,944,752	76%	19%	189,675
SAMN02781083	larvae 8dpf (Lepisosteus oculatus, SAMN02781083)	74,486,056	79%	23%	201,890
SAMN02781090	brain (Lepisosteus oculatus, SAMN02781090)	74,649,446	77%	16%	193,421
SAMN02781093	heart (Lepisosteus oculatus, SAMN02781093)	73,025,684	61%	16%	166,313
SAMN02781095	muscle (Lepisosteus oculatus, SAMN02781095)	71,564,360	78%	27%	139,433
SAMN02781105	skin (Lepisosteus oculatus, SAMN02781105)	67,889,584	75%	21%	179,978
SAMN02929473	Brain (Lepisosteus oculatus, unknown, female, SAMN02929473)	195,470,154	72%	16%	201,301
SAMN02929474	Gills (Lepisosteus oculatus, unknown, female, SAMN02929474)	52,775,242	77%	28%	135,904
SAMN02929475	Heart (Lepisosteus oculatus, unknown, female, SAMN02929475)	89,955,344	75%	24%	162,662
SAMN02929476	Muscle (Lepisosteus oculatus, unknown, female, SAMN02929476)	59,459,998	75%	27%	109,956
SAMN02929477	Liver (Lepisosteus oculatus, unknown, female, SAMN02929477)	55,461,062	74%	26%	121,621
SAMN02929478	Kidney (Lepisosteus oculatus, unknown, female, SAMN02929478)	64,244,520	75%	26%	133,865
SAMN02929479	Bones (Lepisosteus oculatus, unknown, female, SAMN02929479)	59,216,640	78%	21%	153,290
SAMN02929480	Intestine (Lepisosteus oculatus, unknown, female, SAMN02929480)	54,908,106	74%	22%	144,213
SAMN02929481	Embryo (Lepisosteus oculatus, stage 27-28, unknown, SAMN02929481)	50,857,540	80%	25%	188,741
SAMN02929482	Ovary (Lepisosteus oculatus, unknown, female, SAMN02929482)	68,897,496	69%	27%	149,607
SAMN02929483	Testis (Lepisosteus oculatus, unknown, male, SAMN02929483)	59,809,018	76%	21%	198,829
SAMN03253063	kidney,lung,brain,spleen,heart,liver (Lepisosteus oculatus, not collected, not collected, SAMN03253063)	78,480,792	73%	17%	183,259

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
SRR1287992	SRX543376	SRP042013	SAMN02781061	45,985,684	75%	22%
SRR1288192	SRX543376	SRP042013	SAMN02781061	45,632,382	75%	22%
SRR1287998	SRX543382	SRP042013	SAMN02781069	40,155,272	71%	21%
SRR1288004	SRX543382	SRP042013	SAMN02781069	40,401,962	71%	21%
SRR1288001	SRX543385	SRP042013	SAMN02781072	36,154,820	78%	18%
SRR1288144	SRX543385	SRP042013	SAMN02781072	36,386,548	78%	18%
SRR1288131	SRX543514	SRP042013	SAMN02781078	40,494,548	68%	15%
SRR1288141	SRX543514	SRP042013	SAMN02781078	40,778,296	68%	15%
SRR1288134	SRX543517	SRP042013	SAMN02781081	35,082,888	76%	19%
SRR1288180	SRX543517	SRP042013	SAMN02781081	34,861,864	76%	19%
SRR1288136	SRX543519	SRP042013	SAMN02781083	37,115,836	79%	23%
SRR1288183	SRX543519	SRP042013	SAMN02781083	37,370,220	79%	23%
SRR1288146	SRX543528	SRP042013	SAMN02781090	37,437,850	77%	16%
SRR1288160	SRX543528	SRP042013	SAMN02781090	37,211,596	77%	16%
SRR1288148	SRX543532	SRP042013	SAMN02781093	36,628,018	61%	16%
SRR1288155	SRX543532	SRP042013	SAMN02781093	36,397,666	61%	16%
SRR1288154	SRX543535	SRP042013	SAMN02781095	35,911,514	78%	27%
SRR1288176	SRX543535	SRP042013	SAMN02781095	35,652,846	78%	27%
SRR1288186	SRX543560	SRP042013	SAMN02781105	34,072,794	75%	21%
SRR1288189	SRX543560	SRP042013	SAMN02781105	33,816,790	75%	21%
SRR1524250	SRX661015	SRP044782	SAMN02929473	195,470,154	72%	16%
SRR1524251	SRX661016	SRP044782	SAMN02929474	52,775,242	77%	28%
SRR1524252	SRX661017	SRP044782	SAMN02929475	89,955,344	75%	24%
SRR1524253	SRX661018	SRP044782	SAMN02929476	59,459,998	75%	27%
SRR1524254	SRX661019	SRP044782	SAMN02929477	55,461,062	74%	26%
SRR1524255	SRX661020	SRP044782	SAMN02929478	64,244,520	75%	26%
SRR1524256	SRX661021	SRP044782	SAMN02929479	59,216,640	78%	21%
SRR1524257	SRX661022	SRP044782	SAMN02929480	54,908,106	74%	22%
SRR1524258	SRX661023	SRP044782	SAMN02929481	50,857,540	80%	25%
SRR1524259	SRX661024	SRP044782	SAMN02929482	68,897,496	69%	27%
SRR1524260	SRX661025	SRP044782	SAMN02929483	59,809,018	76%	21%
SRR1693778	SRX796496	SRP050575	SAMN03253063	78,480,792	73%	17%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Latimeria chalumnae GenBank	68	61 (89.71%)	61 (89.71%)	71.61%	76.99%
Latimeria chalumnae high-quality model RefSeq (XP_)	12,075	10,769 (89.18%)	10,769 (89.18%)	66.47%	70.22%
Latimeria chalumnae known RefSeq (NP_)	14	14 (100.00%)	14 (100.00%)	64.45%	64.81%
Actinopterygii GenBank	47,849	41,957 (87.69%)	41,957 (87.69%)	68.67%	75.44%
Actinopterygii known RefSeq (NP_)	8,809	8,184 (92.90%)	8,184 (92.90%)	69.66%	75.52%
Danio rerio GenBank	26,848	25,219 (93.93%)	25,219 (93.93%)	67.22%	73.74%
Danio rerio high-quality model RefSeq (XP_)	8,186	7,775 (94.98%)	7,775 (94.98%)	64.16%	67.91%
Danio rerio known RefSeq (NP_)	14,967	14,088 (94.13%)	14,088 (94.13%)	66.55%	73.83%
Homo sapiens known RefSeq (NP_)	39,983	34,045 (85.15%)	34,045 (85.15%)	65.75%	67.08%

Comparison of the current and previous annotations

The annotation produced for this release (101) was compared to the annotation in the previous release (100) for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	LepOcu1 (Current) to LepOcu1 (Previous)
Identical	1%
Minor changes	34%
Major changes	43%
New	22%
Deprecated	2%
Other	<1%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences