NCBI Scyliorhinus canicula Annotation Release 100

The RefSeq genome records for Scyliorhinus canicula were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Scyliorhinus canicula Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Dec 29 2020
Date of submission of annotation to the public databases: Jan 6 2021
Software version: 8.5

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
sScyCan1.1	GCF_902713615.1	SC	12-17-2020	Reference	32 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	sScyCan1.1
Genes and pseudogenes	29,227
protein-coding	20,574
non-coding	7,157
transcribed pseudogenes	3
non-transcribed pseudogenes	1,259
genes with variants	10,109
immunoglobulin/T-cell receptor gene segments	234
other	0
mRNAs	49,332
fully-supported	46,794
with > 5% ab initio	1,296
partial	401
with filled gap(s)	5
known RefSeq (NM_)	0
model RefSeq (XM_)	49,332
non-coding RNAs	9,509
fully-supported	5,570
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	6,414
pseudo transcripts	3
fully-supported	2
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	3
CDSs	49,579
fully-supported	46,794
with > 5% ab initio	1,512
partial	409
with major correction(s)	906
known RefSeq (NP_)	13
model RefSeq (XP_)	49,332

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	27,731	78,823	27,993	62	4,148,867
All transcripts	58,841	3,076	2,433	53	104,710
mRNA	49,332	3,489	2,793	87	104,710
misc_RNA	1,560	2,792	2,218	54	19,721
tRNA	3,093	74	73	67	88
lncRNA	4,010	1,039	731	53	10,074
snoRNA	220	124	125	64	318
snRNA	574	160	164	62	200
guide_RNA	12	183	161	87	336
rRNA	40	386	119	119	8,318
Single-exon transcripts	1,247	1,404	1,103	162	11,091
coding transcripts (NM_/XM_ )	1,247	1,404	1,103	162	11,091
CDSs	49,345	2,074	1,473	87	103,434
Exons	250,963	293	137	1	22,888
in coding transcripts (NM_/XM_ )	238,545	290	137	1	22,888
in non-coding transcripts (NR_/XR_ )	22,566	277	139	2	9,462
Introns	227,311	12,299	3,925	30	1,163,947
in coding transcripts (NM_/XM_ )	218,758	12,268	3,915	30	1,163,947
in non-coding transcripts (NR_/XR_ )	18,221	11,485	3,989	30	869,288

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.26	1	1	50
Number of exons per transcript	12.08	9	1	289

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 20561 coding genes, 19315 genes had a protein with an alignment covering 50% or more of the query and 10260 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
sScyCan1.1	GCF_902713615.1	1.69%	50.63%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	582	578 (99.31%)	521 (89.52%)	99.10%	96.71%
Same-species EST	1,600	1,574 (98.38%)	1,556 (97.25%)	99.20%	99.32%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	1,856,428,927	69%	28%	345,451
SAMN00188954	22174244,25309735	Scyliorhinus canicula pooled Stage 24-30 embryos (Scyliorhinus canicula, SAMN00188954)	50,760,902	58%	17%	172,642
SAMN02910595	NA	Liver (Scyliorhinus canicula, Adult, male, SAMN02910595)	64,212,636	82%	29%	119,953
SAMN02910596	NA	Brain (Scyliorhinus canicula, Adult, male, SAMN02910596)	24,403,364	77%	22%	166,715
SAMN02910597	NA	Pancreas (Scyliorhinus canicula, Adult, male, SAMN02910597)	12,520,796	55%	26%	55,493
SAMN06234165	NA	developing tissue from jaw/mandible (Scyliorhinus canicula, not collected, not collected, SAMN06234165)	145,444,244	68%	8%	165,523
SAMN06234166	NA	developing tissue from jaw/mandible (Scyliorhinus canicula, not collected, not collected, SAMN06234166)	144,787,984	65%	9%	165,044
SAMN10258649	NA	Multi-tissue: Brain, stomach, liver, spleen, gill, spiral valve, epigonal, leidig (Scyliorhinus canicula, 3 years old, female, SAMN10258649)	134,841,605	75%	41%	214,702
SAMN10397604	NA	testis (Scyliorhinus canicula, male, SAMN10397604)	115,395,166	73%	25%	230,642
SAMN10397675	NA	ovary (Scyliorhinus canicula, female, SAMN10397675)	52,994,410	76%	25%	190,374
SAMN11166960	NA	abdomen section including vertebral column (Scyliorhinus canicula, 3.5 months, not determined, SAMN11166960)	82,337,490	82%	33%	172,484
SAMN16422658	NA	Retina (Scyliorhinus canicula, SAMN16422658)	57,437,588	66%	31%	180,308
SAMN16422659	NA	Retina (Scyliorhinus canicula, SAMN16422659)	102,779,602	62%	38%	182,543
SAMN16422660	NA	Retina (Scyliorhinus canicula, SAMN16422660)	94,726,476	68%	35%	191,649
SAMN16422661	NA	Retina (Scyliorhinus canicula, SAMN16422661)	96,111,632	73%	37%	179,725
SAMN16422662	NA	Retina (Scyliorhinus canicula, SAMN16422662)	98,049,568	69%	39%	177,061
SAMN16422663	NA	Retina (Scyliorhinus canicula, SAMN16422663)	86,697,580	68%	36%	190,257
SAMN16422664	NA	Retina (Scyliorhinus canicula, SAMN16422664)	77,826,586	66%	27%	190,203
SAMN16422665	NA	Retina (Scyliorhinus canicula, SAMN16422665)	109,717,390	72%	29%	203,506
SAMN16422666	NA	Retina (Scyliorhinus canicula, SAMN16422666)	82,710,258	64%	29%	191,452
SAMN16422667	NA	Retina (Scyliorhinus canicula, SAMN16422667)	70,232,956	61%	27%	180,652
SAMN16422668	NA	Retina (Scyliorhinus canicula, SAMN16422668)	65,306,828	58%	28%	179,933
SAMN16422669	NA	Retina (Scyliorhinus canicula, SAMN16422669)	87,133,866	65%	28%	194,396

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR088621	SRX036537	SRP004911	SAMN00188954	15,549,984	57%	17%
SRR088622	SRX036537	SRP004911	SAMN00188954	15,341,425	58%	17%
SRR088623	SRX036537	SRP004911	SAMN00188954	19,869,493	60%	17%
SRR1514129	SRX651773	SRP044283	SAMN02910595	64,212,636	82%	29%
SRR1514130	SRX651774	SRP044283	SAMN02910596	24,403,364	77%	22%
SRR1514131	SRX651775	SRP044283	SAMN02910597	12,520,796	55%	26%
SRR5179117	SRX2495301	SRP095206	SAMN06234165	145,444,244	68%	8%
SRR5179116	SRX2495300	SRP095206	SAMN06234166	144,787,984	65%	9%
SRR8077742	SRX4904998	SRP166123	SAMN10258649	134,841,605	75%	41%
SRR8179291	SRX4999362	SRP168359	SAMN10397604	65,299,964	73%	27%
SRR8179290	SRX4999363	SRP168359	SAMN10397604	50,095,202	72%	23%
SRR8179289	SRX4999364	SRP168359	SAMN10397675	52,994,410	76%	25%
SRR8753342	SRX5544262	SRP188860	SAMN11166960	82,337,490	82%	33%
SRR12813958	SRX9282058	SRP287158	SAMN16422658	57,437,588	66%	31%
SRR12813957	SRX9282059	SRP287158	SAMN16422659	102,779,602	62%	38%
SRR12813954	SRX9282062	SRP287158	SAMN16422660	94,726,476	68%	35%
SRR12813953	SRX9282063	SRP287158	SAMN16422661	96,111,632	73%	37%
SRR12813952	SRX9282064	SRP287158	SAMN16422662	98,049,568	69%	39%
SRR12813951	SRX9282065	SRP287158	SAMN16422663	86,697,580	68%	36%
SRR12813950	SRX9282066	SRP287158	SAMN16422664	77,826,586	66%	27%
SRR12813949	SRX9282067	SRP287158	SAMN16422665	109,717,390	72%	29%
SRR12813948	SRX9282068	SRP287158	SAMN16422666	82,710,258	64%	29%
SRR12813947	SRX9282069	SRP287158	SAMN16422667	70,232,956	61%	27%
SRR12813956	SRX9282060	SRP287158	SAMN16422668	65,306,828	58%	28%
SRR12813955	SRX9282061	SRP287158	SAMN16422669	87,133,866	65%	28%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Scleropages formosus high-quality model RefSeq (XP_)	17,879	16,321 (91.29%)	16,321 (91.29%)	67.01%	71.80%
Amblyraja radiata high-quality model RefSeq (XP_)	13,167	12,821 (97.37%)	12,821 (97.37%)	72.95%	83.58%
Latimeria chalumnae high-quality model RefSeq (XP_)	9,807	9,042 (92.20%)	9,042 (92.20%)	67.89%	75.97%
Actinopterygii GenBank	87,054	63,911 (73.42%)	63,911 (73.42%)	67.39%	74.47%
Actinopterygii known RefSeq (NP_)	25,473	5,897 (23.15%)	5,897 (23.15%)	65.62%	70.70%
Lepisosteus oculatus high-quality model RefSeq (XP_)	13,124	12,150 (92.58%)	12,150 (92.58%)	67.15%	73.72%
Danio rerio high-quality model RefSeq (XP_)	7,718	6,742 (87.35%)	6,742 (87.35%)	64.50%	66.45%
Xenopus tropicalis high-quality model RefSeq (XP_)	10,221	8,870 (86.78%)	8,870 (86.78%)	65.30%	70.15%
Xenopus tropicalis known RefSeq (NP_)	8,612	7,839 (91.02%)	7,839 (91.02%)	67.96%	76.08%
Homo sapiens known RefSeq (NP_)	60,884	39,057 (64.15%)	39,057 (64.15%)	67.87%	71.17%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences