NCBI Eumetopias jubatus Annotation Release 100

The RefSeq genome records for Eumetopias jubatus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Eumetopias jubatus Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Feb 4 2019
Date of submission of annotation to the public databases: Feb 11 2019
Software version: 8.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ASM402803v1	GCF_004028035.1	Canada's Genomic Enterprise	01-15-2019	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ASM402803v1
Genes and pseudogenes	30,336
protein-coding	19,668
non-coding	3,786
transcribed pseudogenes	696
non-transcribed pseudogenes	6,118
genes with variants	8,430
immunoglobulin/T-cell receptor gene segments	68
other	0
mRNAs	39,913
fully-supported	37,834
with > 5% ab initio	1,041
partial	460
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	39,913
non-coding RNAs	5,921
fully-supported	3,670
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	5,440
pseudo transcripts	699
fully-supported	639
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	699
CDSs	39,994
fully-supported	37,834
with > 5% ab initio	1,157
partial	470
with major correction(s)	1,047
known RefSeq (NP_)	13
model RefSeq (XP_)	39,913

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	23,454	41,412	12,608	44	2,205,313
All transcripts	45,834	2,754	2,225	44	103,144
mRNA	39,913	2,956	2,396	108	103,144
misc_RNA	1,337	3,109	2,595	113	15,613
tRNA	479	74	73	59	85
lncRNA	2,333	1,649	1,126	63	9,379
snoRNA	603	110	104	44	329
snRNA	1,134	116	107	52	200
guide_RNA	25	164	132	79	411
rRNA	10	527	120	117	1,869
Single-exon transcripts	1,880	1,203	951	114	11,661
coding transcripts (NM_/XM_ )	1,878	1,204	951	114	11,661
non-coding transcripts (NR_/XR_ )	2	359	427	290	427
CDSs	39,926	1,896	1,416	96	103,044
Exons	228,100	268	134	1	17,106
in coding transcripts (NM_/XM_ )	219,846	261	133	1	17,106
in non-coding transcripts (NR_/XR_ )	17,086	309	138	2	10,239
Introns	204,935	5,397	1,383	30	997,067
in coding transcripts (NM_/XM_ )	198,979	5,396	1,377	30	997,067
in non-coding transcripts (NR_/XR_ )	14,504	4,600	1,477	37	278,480

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.97	1	1	50
Number of exons per transcript	11.22	9	1	314

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 19655 coding genes, 19397 genes had a protein with an alignment covering 50% or more of the query and 16835 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
ASM402803v1	GCF_004028035.1	43.28%	33.18%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	8	8 (100.00%)	8 (100.00%)	99.87%	100.00%
Carnivora known RefSeq (NM_/NR_)	2,893	2,688 (92.91%)	1,978 (68.37%)	93.73%	97.52%
Carnivora Genbank	6,265	5,657 (90.30%)	2,778 (44.34%)	92.90%	96.49%
Carnivora TSA	319,728	279,783 (87.51%)	98,616 (30.84%)	97.02%	98.80%
Carnivora EST	428,476	281,667 (65.74%)	207,799 (48.50%)	92.30%	97.46%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	964,698,701	83%	24%	212,853
SAMN07328124	whole blood (Eumetopias jubatus, male, SAMN07328124)	94,859,506	86%	24%	164,164
SAMN07328125	whole blood (Eumetopias jubatus, male, SAMN07328125)	24,443,734	56%	12%	77,398
SAMN07328126	whole blood (Eumetopias jubatus, male, SAMN07328126)	28,089,509	58%	12%	93,227
SAMN07328127	whole blood (Eumetopias jubatus, male, SAMN07328127)	30,148,916	46%	8%	56,528
SAMN07328128	whole blood (Eumetopias jubatus, male, SAMN07328128)	13,495,544	55%	7%	34,964
SAMN07328129	whole blood (Eumetopias jubatus, male, SAMN07328129)	95,285,296	87%	28%	151,357
SAMN07328130	whole blood (Eumetopias jubatus, male, SAMN07328130)	90,479,544	84%	22%	170,608
SAMN07328131	whole blood (Eumetopias jubatus, male, SAMN07328131)	97,976,398	85%	25%	160,374
SAMN07328132	whole blood (Eumetopias jubatus, male, SAMN07328132)	102,873,008	88%	28%	154,360
SAMN07328133	whole blood (Eumetopias jubatus, male, SAMN07328133)	98,695,474	89%	27%	157,535
SAMN07328134	whole blood (Eumetopias jubatus, male, SAMN07328134)	98,179,534	85%	23%	175,184
SAMN07328135	whole blood (Eumetopias jubatus, male, SAMN07328135)	102,167,302	87%	24%	167,018
SAMN07328136	whole blood (Eumetopias jubatus, male, SAMN07328136)	88,004,936	85%	24%	160,191

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR5809380	SRX2987952	SRP111289	SAMN07328124	94,859,506	86%	24%
SRR5809394	SRX2987964	SRP111289	SAMN07328125	24,443,734	56%	12%
SRR5809393	SRX2987963	SRP111289	SAMN07328126	28,089,509	58%	12%
SRR5809391	SRX2987962	SRP111289	SAMN07328127	16,139,449	47%	8%
SRR5809392	SRX2987962	SRP111289	SAMN07328127	14,009,467	46%	8%
SRR5809389	SRX2987961	SRP111289	SAMN07328128	4,570,718	56%	7%
SRR5809390	SRX2987961	SRP111289	SAMN07328128	8,924,826	54%	7%
SRR5809388	SRX2987960	SRP111289	SAMN07328129	95,285,296	87%	28%
SRR5809387	SRX2987959	SRP111289	SAMN07328130	90,479,544	84%	22%
SRR5809386	SRX2987958	SRP111289	SAMN07328131	97,976,398	85%	25%
SRR5809385	SRX2987957	SRP111289	SAMN07328132	102,873,008	88%	28%
SRR5809384	SRX2987956	SRP111289	SAMN07328133	98,695,474	89%	27%
SRR5809383	SRX2987955	SRP111289	SAMN07328134	98,179,534	85%	23%
SRR5809382	SRX2987954	SRP111289	SAMN07328135	102,167,302	87%	24%
SRR5809381	SRX2987953	SRP111289	SAMN07328136	88,004,936	85%	24%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Ursus arctos horribilis high-quality model RefSeq (XP_)	14,931	14,848 (99.44%)	14,848 (99.44%)	80.18%	88.39%
Neomonachus schauinslandi high-quality model RefSeq (XP_)	8,834	8,778 (99.37%)	8,778 (99.37%)	82.96%	90.05%
Carnivora GenBank	5,116	4,974 (97.22%)	4,974 (97.22%)	80.13%	88.51%
Carnivora known RefSeq (NP_)	2,377	2,352 (98.95%)	2,352 (98.95%)	80.91%	89.96%
Same-species GenBank	8	8 (100.00%)	8 (100.00%)	88.18%	83.06%
Enhydra lutris kenyoni high-quality model RefSeq (XP_)	11,211	11,148 (99.44%)	11,148 (99.44%)	81.04%	89.50%
Homo sapiens known RefSeq (NP_)	52,516	51,523 (98.11%)	51,523 (98.11%)	77.26%	84.37%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences