NCBI Equus przewalskii Annotation Release 100

The RefSeq genome records for Equus przewalskii were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Equus przewalskii Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Jul 10 2014
Date of submission of annotation to the public databases: Jul 14 2014
Software version: 6.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Burgud	GCF_000696695.1	College of Animal Science, Inner Mongolian Agricultural University, China	06-05-2014	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Burgud
Genes and pseudogenes	28,150
protein-coding	21,552
non-coding	4,269
pseudogenes	2,329
genes with variants	7,443
mRNAs	38,403
fully-supported	33,389
with > 5% ab initio	1,776
partial	2,663
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	38,403
Other RNAs	7,032
fully-supported	6,601
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	6,603
CDSs	38,614
fully-supported	33,389
with > 5% ab initio	2,262
partial	2,693
with major correction(s)	1,805
known RefSeq (NP_)	0
model RefSeq (XP_)	38,403

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	25,821	33,794	13,265	71	1,067,965
All transcripts	45,435	2,829	2,243	33	104,465
mRNA	38,403	3,037	2,444	78	104,465
misc_RNA	1,081	3,175	2,635	93	14,882
tRNA	429	74	73	71	97
lncRNA	5,522	1,528	916	33	19,992
Single-exon transcripts	1,874	1,308	948	99	11,900
coding transcripts (NM_/XM_ )	1,874	1,308	948	99	11,900
CDSs	38,403	1,731	1,281	73	103,431
Exons	233,568	304	135	1	21,443
in coding transcripts (NM_/XM_ )	216,638	293	134	1	21,443
in non-coding transcripts (NR_/XR_ )	23,086	377	135	2	12,790
Introns	206,034	4,617	1,350	30	674,893
in coding transcripts (NM_/XM_ )	194,274	4,583	1,337	30	674,893
in non-coding transcripts (NR_/XR_ )	17,716	4,718	1,520	30	419,715

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.76	1	1	35
Number of exons per transcript	10.03	7	1	317

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Burgud	GCF_000696695.1	40.20%	27.86%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with short reads and reported in the Short read transcript alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	2	2 (100.00%)	2 (100.00%)	96.00%	40.50%
Equus caballus known RefSeq (NM_/NR_)	1,186	1,146 (96.63%)	622 (52.45%)	99.40%	95.13%
Equus caballus Genbank	2,826	2,678 (94.76%)	1,676 (59.31%)	96.48%	84.90%
Equus caballus TSA	26,376	25,867 (98.07%)	23,553 (89.30%)	99.48%	99.16%
Equus caballus EST	37,755	33,711 (89.29%)	30,109 (79.75%)	99.12%	98.09%

Short read transcript alignments

The following short reads (RNA-Seq) from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics

Sample Id	Track name	Number of reads	Number (%) of aligned reads	Number (%) spliced reads	Number of introns
All	Aggregate of all aligned samples	2,357,742,424	2,071,918,055 (87.88%)	343,687,063 (14.58%)	343,669
SAMEA2008636	LP/lp skin (Equus caballus, SAMEA2008636)	20,442,758	16,802,016 (82.19%)	3,300,198 (16.14%)	132,745
SAMEA2008637	LP/LP skin (Equus caballus, SAMEA2008637)	7,982,171	6,201,503 (77.69%)	1,283,702 (16.08%)	104,497
SAMEA2008638	lp/lp retina (Equus caballus, SAMEA2008638)	20,222,194	16,407,828 (81.14%)	2,365,211 (11.70%)	141,315
SAMEA2008639	LP/LP retina (Equus caballus, SAMEA2008639)	20,527,334	14,348,987 (69.90%)	2,452,306 (11.95%)	133,739
SAMEA2445337	RNA-seq (Equus caballus, SAMEA2445337)	152,265,046	145,620,635 (95.64%)	15,742,850 (10.34%)	170,989
SAMEA2452527	CU1 (Equus caballus, SAMEA2452527)	36,277,643	34,104,673 (94.01%)	7,919,320 (21.83%)	149,211
SAMEA2452528	CU18 (Equus caballus, SAMEA2452528)	43,422,463	41,171,812 (94.82%)	10,914,443 (25.14%)	152,182
SAMEA2452529	LSUJ (Equus caballus, SAMEA2452529)	33,278,897	31,456,177 (94.52%)	7,861,559 (23.62%)	145,721
SAMEA2452530	06-92 Pigmented Skin (Equus caballus, SAMEA2452530)	34,133,912	31,515,288 (92.33%)	7,002,485 (20.51%)	164,554
SAMEA2452531	06-92 Unpigmented Skin (Equus caballus, SAMEA2452531)	30,292,320	27,870,616 (92.01%)	5,877,500 (19.40%)	158,516
SAMEA2452532	D-052 Pigmented Skin (Equus caballus, SAMEA2452532)	41,083,310	38,293,453 (93.21%)	11,660,226 (28.38%)	150,462
SAMN00002024	Generic sample from Equus caballus (Equus caballus, SAMN00002024)	826,176	139,660 (16.90%)	31,859 (3.86%)	17,050
SAMN00198585	blood (Equus caballus, SAMN00198585)	490,408	352,620 (71.90%)	98,050 (19.99%)	28,523
SAMN00631156	chorionic girdle (Equus caballus, SAMN00631156)	82,501,499	63,166,214 (76.56%)	4,318,066 (5.23%)	120,801
SAMN00809361	invasive trophoblast tissue of the chorionic girdle (Equus caballus, SAMN00809361)	49,235,800	11,715,404 (23.79%)	2,428,320 (4.93%)	69,894
SAMN00809362	invasive trophoblast tissue of the chorionic girdle (Equus caballus, SAMN00809362)	33,360,169	24,320,708 (72.90%)	1,588,297 (4.76%)	88,258
SAMN00809363	invasive trophoblast tissue of the chorionic girdle (Equus caballus, SAMN00809363)	30,740,449	19,295,830 (62.77%)	1,314,951 (4.28%)	63,721
SAMN00809364	invasive trophoblast tissue of the chorionic girdle (Equus caballus, SAMN00809364)	31,613,568	23,741,897 (75.10%)	1,574,898 (4.98%)	80,388
SAMN00809365	invasive trophoblast tissue of the chorionic girdle (Equus caballus, SAMN00809365)	30,758,114	23,014,427 (74.82%)	1,591,090 (5.17%)	87,047
SAMN00991814	muscle (Equus caballus, male, SAMN00991814)	57,906,672	54,963,770 (94.92%)	9,815,142 (16.95%)	133,676
SAMN00991815	muscle (Equus caballus, male, SAMN00991815)	57,212,080	55,145,881 (96.39%)	11,046,952 (19.31%)	132,552
SAMN00991816	muscle (Equus caballus, male, SAMN00991816)	57,906,672	55,949,445 (96.62%)	9,596,671 (16.57%)	137,115
SAMN00991817	muscle (Equus caballus, male, SAMN00991817)	57,906,672	55,012,258 (95.00%)	9,460,314 (16.34%)	133,910
SAMN00991818	muscle (Equus caballus, female, SAMN00991818)	57,906,672	56,179,732 (97.02%)	10,502,316 (18.14%)	132,552
SAMN00991819	muscle (Equus caballus, female, SAMN00991819)	56,138,890	54,788,825 (97.60%)	7,576,000 (13.50%)	122,948
SAMN00991820	muscle (Equus caballus, male, SAMN00991820)	56,437,122	54,417,726 (96.42%)	9,528,486 (16.88%)	131,009
SAMN00991821	muscle (Equus caballus, male, SAMN00991821)	57,906,672	56,237,906 (97.12%)	6,783,571 (11.71%)	126,065
SAMN00991822	muscle (Equus caballus, male, SAMN00991822)	57,190,008	55,561,379 (97.15%)	6,306,541 (11.03%)	120,422
SAMN00991823	muscle (Equus caballus, male, SAMN00991823)	57,906,672	55,921,320 (96.57%)	8,854,285 (15.29%)	135,914
SAMN00991824	muscle (Equus caballus, female, SAMN00991824)	56,138,890	54,285,223 (96.70%)	9,132,149 (16.27%)	134,516
SAMN00991825	muscle (Equus caballus, female, SAMN00991825)	56,138,890	54,646,390 (97.34%)	6,466,108 (11.52%)	121,070
SAMN00991826	blood (Equus caballus, male, SAMN00991826)	53,155,556	48,807,912 (91.82%)	8,872,045 (16.69%)	147,262
SAMN00991827	blood (Equus caballus, male, SAMN00991827)	52,222,220	47,505,622 (90.97%)	8,625,535 (16.52%)	149,655
SAMN00991828	blood (Equus caballus, male, SAMN00991828)	52,222,220	48,003,397 (91.92%)	9,138,962 (17.50%)	150,792
SAMN00991829	blood (Equus caballus, male, SAMN00991829)	52,222,220	47,360,232 (90.69%)	8,424,155 (16.13%)	149,449
SAMN00991830	blood (Equus caballus, female, SAMN00991830)	52,022,222	47,668,536 (91.63%)	9,035,442 (17.37%)	153,372
SAMN00991831	blood (Equus caballus, female, SAMN00991831)	51,666,598	47,762,229 (92.44%)	9,025,019 (17.47%)	150,792
SAMN00991832	blood (Equus caballus, male, SAMN00991832)	52,222,220	47,514,943 (90.99%)	8,235,431 (15.77%)	143,881
SAMN00991833	blood (Equus caballus, male, SAMN00991833)	52,222,220	47,432,141 (90.83%)	8,295,634 (15.89%)	148,138
SAMN00991834	blood (Equus caballus, male, SAMN00991834)	53,155,556	48,458,967 (91.16%)	8,285,714 (15.59%)	145,263
SAMN00991835	blood (Equus caballus, male, SAMN00991835)	51,244,444	46,600,955 (90.94%)	7,970,227 (15.55%)	147,668
SAMN00991836	blood (Equus caballus, female, SAMN00991836)	50,015,638	42,919,413 (85.81%)	8,071,013 (16.14%)	156,242
SAMN00991837	blood (Equus caballus, female, SAMN00991837)	52,266,668	48,081,839 (91.99%)	9,218,749 (17.64%)	143,947
SAMN02142713	full thickness articular cartilage (Equus caballus, 3 years old, male, SAMN02142713)	18,111,101	12,233,078 (67.54%)	4,124,018 (22.77%)	119,828
SAMN02142714	full thickness articular cartilage (Equus caballus, 3 years old, male, SAMN02142714)	18,194,276	11,499,189 (63.20%)	3,629,922 (19.95%)	112,492
SAMN02142715	placental villous (Equus caballus, full term, female, SAMN02142715)	19,317,760	13,224,868 (68.46%)	3,110,374 (16.10%)	134,547
SAMN02142716	testes (Equus caballus, 4 years old, male, SAMN02142716)	19,839,518	13,858,144 (69.85%)	3,925,972 (19.79%)	160,685
SAMN02142717	cerebellum (Equus caballus, 2 years old, female, SAMN02142717)	19,293,725	12,420,554 (64.38%)	2,608,676 (13.52%)	138,115
SAMN02142718	synovial membrane (Equus caballus, 3 years old, male, SAMN02142718)	18,744,053	9,416,868 (50.24%)	2,684,896 (14.32%)	130,328
SAMN02142719	embryo (Equus caballus, 34 days gestation, male, SAMN02142719)	19,696,865	8,328,197 (42.28%)	2,159,220 (10.96%)	145,139
SAMN02142720	synovial membrane (Equus caballus, 3 years old, male, SAMN02142720)	19,233,722	7,680,795 (39.93%)	2,499,944 (13.00%)	117,760
SAMN02378296	inner cell mass (Equus caballus, SAMN02378296)	32,939,140	30,020,679 (91.14%)	2,787,521 (8.46%)	123,039
SAMN02378297	Trophectoderm (Equus caballus, SAMN02378297)	33,003,433	30,630,650 (92.81%)	3,805,949 (11.53%)	117,338
SAMN02378298	inner cell mass (Equus caballus, SAMN02378298)	31,241,965	29,159,723 (93.34%)	3,422,747 (10.96%)	108,744
SAMN02378299	Trophectoderm (Equus caballus, SAMN02378299)	24,713,427	23,473,531 (94.98%)	2,738,497 (11.08%)	111,851
SAMN02378300	inner cell mass (Equus caballus, SAMN02378300)	38,553,987	36,375,468 (94.35%)	4,554,051 (11.81%)	122,841
SAMN02378301	Trophectoderm (Equus caballus, SAMN02378301)	33,217,549	32,124,743 (96.71%)	3,740,164 (11.26%)	116,128
SAMN02378474	Tissue sample of Mongolian horse (Equus caballus, 8, female, SAMN02378474)	853,978	705,779 (82.65%)	303,320 (35.52%)	108,133

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Homo sapiens known RefSeq (NP_)	37,645	36,609 (97.25%)	36,609 (97.25%)	77.05%	77.75%
Equus caballus GenBank	1,229	1,091 (88.77%)	1,091 (88.77%)	77.59%	77.80%
Equus caballus known RefSeq (NP_)	836	818 (97.85%)	818 (97.85%)	78.16%	81.21%
Same-species GenBank	2	2 (100.00%)	2 (100.00%)	79.43%	81.86%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences