NCBI Cervus canadensis Annotation Release 100

The RefSeq genome records for Cervus canadensis were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Cervus canadensis Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Aug 31 2021
Date of submission of annotation to the public databases: Sep 6 2021
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ASM1932006v1	GCF_019320065.1	Iowa State University	07-26-2021	Reference	35 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ASM1932006v1
Genes and pseudogenes	35,756
protein-coding	22,348
non-coding	9,164
Transcribed pseudogenes	221
Non-transcribed pseudogenes	3,816
genes with variants	11,418
Immunoglobulin/T-cell receptor gene segments	207
other	0
mRNAs	56,132
fully-supported	54,330
with > 5% ab initio	964
partial	70
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	56,132
non-coding RNAs	12,597
fully-supported	7,623
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	10,493
pseudo transcripts	221
fully-supported	189
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	221
CDSs	56,339
fully-supported	54,330
with > 5% ab initio	1,085
partial	70
with major correction(s)	609
known RefSeq (NP_)	0
model RefSeq (XP_)	56,132

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	31,512	38,264	9,479	56	2,532,480
All transcripts	68,729	3,125	2,458	56	28,382
mRNA	56,132	3,567	2,871	102	28,382
misc_RNA	1,888	2,924	2,335	148	16,044
tRNA	2,104	73	73	69	87
lncRNA	5,735	1,467	800	80	18,182
snoRNA	847	118	125	56	322
snRNA	1,908	113	107	61	200
guide_RNA	57	167	136	78	446
rRNA	58	1,679	153	119	4,795
Single-exon transcripts	3,292	1,292	951	102	16,261
coding transcripts (NM_/XM_ )	3,291	1,292	951	102	16,261
non-coding transcripts (NR_/XR_ )	1	384	384	384	384
CDSs	56,132	2,004	1,455	96	26,742
Exons	261,788	322	139	1	20,272
in coding transcripts (NM_/XM_ )	243,251	316	139	1	20,272
in non-coding transcripts (NR_/XR_ )	27,977	331	135	2	15,408
Introns	231,882	6,692	1,480	30	1,138,658
in coding transcripts (NM_/XM_ )	218,522	6,547	1,444	30	1,138,658
in non-coding transcripts (NR_/XR_ )	22,467	7,620	1,837	30	869,514

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.27	1	1	50
Number of exons per transcript	11.28	8	1	173

BUSCO analysis of gene annotation

BUSCO v4.1.4 (Simão et al 2015, PMID: 26059717) was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the cetartiodactyla_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation (C:complete [S:single-copy, D:duplicated], F:fragmented, M:missing, n:number of genes used).

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 22348 coding genes, 21817 genes had a protein with an alignment covering 50% or more of the query and 18823 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
ASM1932006v1	GCF_019320065.1	37.29%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	7	7 (100.00%)	7 (100.00%)	99.45%	99.97%
Same-species EST	237	224 (94.51%)	215 (90.72%)	99.71%	99.46%
Homo sapiens known RefSeq (NM_/NR_)	81,350	67,294 (82.72%)	15,880 (19.52%)	89.17%	81.50%
Homo sapiens Genbank	341,681	151,857 (44.44%)	51,597 (15.10%)	89.78%	89.33%
Bos taurus known RefSeq (NM_/NR_)	15,339	14,936 (97.37%)	12,394 (80.80%)	95.58%	99.15%
Bos taurus Genbank	19,974	18,896 (94.60%)	13,274 (66.46%)	95.19%	98.61%
Bos taurus EST	1,583,270	1,350,528 (85.30%)	1,168,658 (73.81%)	94.72%	98.49%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	3,002,928,140	93%	37%	394,100
SAMN02727543	NA	pedicle periosteum (Cervus nippon, 2, male, SAMN02727543)	54,381,996	93%	18%	168,607
SAMN02727547	NA	pedicle periosteum (Cervus nippon, 2, male, SAMN02727547)	51,843,164	93%	18%	167,823
SAMN07187858	NA	blood (Cervus elaphus, not collected, SAMN07187858)	23,831,096	91%	20%	128,409
SAMN07187859	NA	blood (Cervus nippon, not collected, SAMN07187859)	22,963,318	90%	19%	139,403
SAMN07673484	NA	cartilage bone (Cervus nippon hortulorum, 3, male, SAMN07673484)	659,886,034	94%	44%	249,394
SAMN10219010	31221828	Headgear (Cervus nippon, 1year, male, SAMN10219010)	59,132,574	94%	35%	115,963
SAMN10219011	31221828	Headgear (Cervus nippon, 1year, male, SAMN10219011)	44,428,358	94%	40%	169,768
SAMN10219012	31221828	Headgear (Cervus nippon, 1year, male, SAMN10219012)	54,386,770	93%	37%	187,421
SAMN10219013	31221828	Bone (Cervus nippon, 1year, male, SAMN10219013)	61,168,930	93%	41%	190,672
SAMN10219014	31221828	Skin (Cervus nippon, 1year, male, SAMN10219014)	54,149,842	93%	45%	185,140
SAMN10219015	31221828	Abomasum (Cervus nippon, 1year, male, SAMN10219015)	57,693,134	96%	36%	139,159
SAMN10219016	31221828	Intestines (Cervus nippon, 1year, male, SAMN10219016)	56,055,160	92%	40%	189,968
SAMN10219017	31221828	Esophagus (Cervus nippon, 1year, male, SAMN10219017)	59,563,016	93%	40%	190,538
SAMN10219018	31221828	Liver (Cervus nippon, 1year, male, SAMN10219018)	45,341,260	94%	47%	166,254
SAMN10219019	31221828	Forestomach (Cervus nippon, 1year, male, SAMN10219019)	64,362,238	93%	30%	182,182
SAMN10219020	31221828	Forestomach (Cervus nippon, 1year, male, SAMN10219020)	54,693,332	94%	37%	182,788
SAMN10219021	31221828	Adipose tissue (Cervus nippon, 1year, male, SAMN10219021)	49,073,902	93%	35%	191,486
SAMN10219022	31221828	Brain (Cervus nippon, 1year, male, SAMN10219022)	51,659,918	95%	20%	180,820
SAMN10219023	31221828	Brain (Cervus nippon, 1year, male, SAMN10219023)	51,832,938	95%	16%	173,599
SAMN10219024	31221828	Heart (Cervus nippon, 1year, male, SAMN10219024)	70,020,200	94%	34%	174,280
SAMN10219025	31221828	Kidney (Cervus nippon, 1year, male, SAMN10219025)	64,264,010	94%	31%	187,922
SAMN10219026	31221828	Lung (Cervus nippon, 1year, male, SAMN10219026)	50,555,896	94%	29%	157,217
SAMN10219027	31221828	Muscle (Cervus nippon, 1year, male, SAMN10219027)	62,520,930	94%	41%	176,069
SAMN10219028	31221828	Spleen (Cervus nippon, 1year, male, SAMN10219028)	63,002,096	92%	37%	194,185
SAMN10219029	31221828	Testis (Cervus nippon, 1year, male, SAMN10219029)	56,621,380	94%	40%	204,645
SAMN12179201	NA	antler (Cervus nippon, 5, male, SAMN12179201)	97,031,530	93%	35%	187,834
SAMN12179202	NA	antler (Cervus nippon, 5, male, SAMN12179202)	87,829,286	93%	34%	181,947
SAMN12179203	NA	antler (Cervus nippon, 5, male, SAMN12179203)	85,749,416	93%	34%	186,829
SAMN12179204	NA	antler (Cervus nippon, 5, male, SAMN12179204)	106,246,500	93%	35%	187,200
SAMN12179205	NA	antler (Cervus nippon, 5, male, SAMN12179205)	94,736,132	93%	39%	177,605
SAMN12179206	NA	antler (Cervus nippon, 5, male, SAMN12179206)	86,053,716	93%	37%	180,352
SAMN12179207	NA	antler (Cervus nippon, 5, male, SAMN12179207)	89,552,202	93%	38%	179,168
SAMN12179208	NA	antler (Cervus nippon, 5, male, SAMN12179208)	99,362,848	93%	37%	183,857
SAMN12179209	NA	antler (Cervus nippon, 5, male, SAMN12179209)	96,299,236	93%	37%	184,425
SAMN13819016	NA	Msc_4 (Cervus nippon, SAMN13819016)	53,124,512	92%	41%	163,911
SAMN13819017	NA	Msc_2 (Cervus nippon, SAMN13819017)	55,707,792	91%	38%	169,461
SAMN13819018	NA	Msc_1 (Cervus nippon, SAMN13819018)	53,912,470	90%	40%	164,464
SAMN13819019	NA	Msc_3 (Cervus nippon, SAMN13819019)	53,891,008	93%	41%	161,516

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR1238080	SRX516823	SRP041164	SAMN02727543	54,381,996	93%	18%
SRR1238083	SRX516824	SRP041164	SAMN02727547	51,843,164	93%	18%
SRR5642294	SRX2880481	SRP108483	SAMN07187858	23,831,096	91%	20%
SRR5642321	SRX2880508	SRP108483	SAMN07187859	22,963,318	90%	19%
SRR6202157	SRX3312069	SRP120928	SAMN07673484	29,826,686	93%	42%
SRR6202156	SRX3312070	SRP120928	SAMN07673484	28,749,662	94%	45%
SRR6202155	SRX3312071	SRP120928	SAMN07673484	54,112,270	93%	42%
SRR6202154	SRX3312072	SRP120928	SAMN07673484	62,058,040	93%	42%
SRR6202153	SRX3312073	SRP120928	SAMN07673484	51,173,254	94%	40%
SRR6202152	SRX3312074	SRP120928	SAMN07673484	45,362,014	94%	43%
SRR6202151	SRX3312075	SRP120928	SAMN07673484	60,520,400	94%	41%
SRR6202150	SRX3312076	SRP120928	SAMN07673484	54,501,768	94%	40%
SRR6202149	SRX3312077	SRP120928	SAMN07673484	38,648,870	94%	48%
SRR6202148	SRX3312078	SRP120928	SAMN07673484	33,299,322	94%	49%
SRR6202147	SRX3312079	SRP120928	SAMN07673484	46,163,234	93%	48%
SRR6202146	SRX3312080	SRP120928	SAMN07673484	51,215,288	93%	46%
SRR6202145	SRX3312081	SRP120928	SAMN07673484	34,110,582	94%	47%
SRR6202144	SRX3312082	SRP120928	SAMN07673484	38,793,220	94%	43%
SRR6202143	SRX3312083	SRP120928	SAMN07673484	31,351,424	93%	44%
SRR8002926	SRX4833803	SRP135837	SAMN10219010	59,132,574	94%	35%
SRR8002929	SRX4833800	SRP135837	SAMN10219011	44,428,358	94%	40%
SRR8002928	SRX4833801	SRP135837	SAMN10219012	54,386,770	93%	37%
SRR8002921	SRX4833808	SRP135837	SAMN10219013	61,168,930	93%	41%
SRR8002920	SRX4833809	SRP135837	SAMN10219014	54,149,842	93%	45%
SRR8002959	SRX4833770	SRP135837	SAMN10219015	57,693,134	96%	36%
SRR8002960	SRX4833769	SRP135837	SAMN10219016	56,055,160	92%	40%
SRR8002961	SRX4833768	SRP135837	SAMN10219017	59,563,016	93%	40%
SRR8002962	SRX4833767	SRP135837	SAMN10219018	45,341,260	94%	47%
SRR8002963	SRX4833766	SRP135837	SAMN10219019	64,362,238	93%	30%
SRR8002964	SRX4833765	SRP135837	SAMN10219020	54,693,332	94%	37%
SRR8002965	SRX4833764	SRP135837	SAMN10219021	49,073,902	93%	35%
SRR8002966	SRX4833763	SRP135837	SAMN10219022	51,659,918	95%	20%
SRR8002956	SRX4833773	SRP135837	SAMN10219023	51,832,938	95%	16%
SRR8002957	SRX4833772	SRP135837	SAMN10219024	70,020,200	94%	34%
SRR8002945	SRX4833784	SRP135837	SAMN10219025	64,264,010	94%	31%
SRR8002944	SRX4833785	SRP135837	SAMN10219026	50,555,896	94%	29%
SRR8002943	SRX4833786	SRP135837	SAMN10219027	62,520,930	94%	41%
SRR8002942	SRX4833787	SRP135837	SAMN10219028	63,002,096	92%	37%
SRR8002919	SRX4833810	SRP135837	SAMN10219029	56,621,380	94%	40%
SRR9618241	SRX6381282	SRP212520	SAMN12179201	97,031,530	93%	35%
SRR9618242	SRX6381281	SRP212520	SAMN12179202	87,829,286	93%	34%
SRR9618239	SRX6381284	SRP212520	SAMN12179203	85,749,416	93%	34%
SRR9618240	SRX6381283	SRP212520	SAMN12179204	106,246,500	93%	35%
SRR9618237	SRX6381286	SRP212520	SAMN12179205	94,736,132	93%	39%
SRR9618238	SRX6381285	SRP212520	SAMN12179206	86,053,716	93%	37%
SRR9618235	SRX6381288	SRP212520	SAMN12179207	89,552,202	93%	38%
SRR9618236	SRX6381287	SRP212520	SAMN12179208	99,362,848	93%	37%
SRR9618243	SRX6381280	SRP212520	SAMN12179209	96,299,236	93%	37%
SRR10867785	SRX7537808	SRP241149	SAMN13819016	53,124,512	92%	41%
SRR10867783	SRX7537806	SRP241149	SAMN13819017	55,707,792	91%	38%
SRR10867782	SRX7537805	SRP241149	SAMN13819018	53,912,470	90%	40%
SRR10867784	SRX7537807	SRP241149	SAMN13819019	53,891,008	93%	41%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species GenBank	7	7 (100.00%)	7 (100.00%)	71.48%	93.32%
Pecora GenBank	23,560	19,790 (84.00%)	19,790 (84.00%)	79.39%	88.29%
Homo sapiens known RefSeq (NP_)	62,730	45,751 (72.93%)	45,751 (72.93%)	78.45%	86.28%
Bovidae known RefSeq (NP_)	15,700	4,942 (31.48%)	4,942 (31.48%)	76.28%	93.06%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences