NCBI Cercocebus atys Annotation Release 100

The RefSeq genome records for Cercocebus atys were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Cercocebus atys Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Mar 23 2015
Date of submission of annotation to the public databases: Mar 30 2015
Software version: 6.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Caty_1.0	GCF_000955945.1	Human Genome Sequencing Center - BCM	03-19-2015	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Caty_1.0
Genes and pseudogenes	30,556
protein-coding	20,829
non-coding	4,464
pseudogenes	5,263
genes with variants	12,310
mRNAs	65,920
fully-supported	64,100
with > 5% ab initio	830
partial	256
with filled gap(s)	2
known RefSeq (NM_)	0
model RefSeq (XM_)	65,920
Other RNAs	10,000
fully-supported	9,633
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	9,634
CDSs	66,135
fully-supported	64,100
with > 5% ab initio	991
partial	261
with major correction(s)	1,508
known RefSeq (NP_)	0
model RefSeq (XP_)	65,920

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	25,293	53,463	18,105	71	3,090,810
All transcripts	75,920	3,769	3,055	71	106,596
mRNA	65,920	3,947	3,231	117	106,596
misc_RNA	3,335	3,461	2,813	122	19,473
tRNA	366	74	73	71	86
lncRNA	6,299	2,281	1,419	90	24,952
Single-exon transcripts	1,575	1,892	1,259	159	13,582
coding transcripts (NM_/XM_ )	1,564	1,895	1,261	159	13,582
non-coding transcripts (NR_/XR_ )	11	1,504	1,241	253	3,103
CDSs	65,920	2,100	1,545	75	105,330
Exons	271,707	371	142	1	21,390
in coding transcripts (NM_/XM_ )	250,660	352	141	1	21,390
in non-coding transcripts (NR_/XR_ )	42,280	398	142	1	16,714
Introns	243,018	7,804	1,807	30	1,186,923
in coding transcripts (NM_/XM_ )	228,044	7,617	1,777	30	1,186,923
in non-coding transcripts (NR_/XR_ )	35,443	7,776	1,802	30	1,023,716

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	3.02	1	1	50
Number of exons per transcript	12.47	9	1	341

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 20614 coding genes, 19807 genes had a protein with an alignment covering 50% or more of the query and 17295 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with RepeatMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Caty_1.0	GCF_000955945.1	48.43%	34.67%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with short reads and reported in the Short read transcript alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	74	74 (100.00%)	74 (100.00%)	99.52%	99.39%
Homo sapiens known RefSeq (NM_/NR_)	49,777	48,748 (97.93%)	31,312 (62.90%)	95.39%	98.28%
Homo sapiens Genbank	272,600	227,228 (83.36%)	140,961 (51.71%)	94.61%	93.21%
Homo sapiens EST	8,652,387	7,486,571 (86.53%)	6,331,547 (73.18%)	97.87%	97.31%

Short read transcript alignments

The following short reads (RNA-Seq) from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Number (%) of aligned reads	Number (%) spliced reads	Number of introns
All	Aggregate of all aligned samples	4,249,659,038	3,204,684,527 (75.41%)	533,476,446 (12.55%)	389,805
SAMN02045729	pooled tissue sample from Sooty mangabey (Cercocebus atys) (Cercocebus atys, SAMN02045729)	444,385,538	160,292,690 (36.07%)	18,157,379 (4.09%)	152,665
SAMN02045730	pooled tissue sample from Sooty mangabey (Cercocebus atys) (Cercocebus atys, SAMN02045730)	1,779,729,044	1,417,088,072 (79.62%)	337,909,519 (18.99%)	360,198
SAMN03085078	Whole blood (Cercocebus atys, not applicable, male, SAMN03085078)	200,302,908	134,549,480 (67.17%)	5,479,610 (2.74%)	129,767
SAMN03282394	Bone Marrow (Cercocebus atys, not applicable, male, SAMN03282394)	110,975,952	92,065,686 (82.96%)	11,407,235 (10.28%)	177,636
SAMN03282395	Brain Cerebellum (Cercocebus atys, not applicable, male, SAMN03282395)	165,682,008	132,469,955 (79.95%)	7,844,443 (4.73%)	214,965
SAMN03282396	Brain Frontal Cortex (Cercocebus atys, not applicable, male, SAMN03282396)	142,497,232	116,146,265 (81.51%)	9,869,195 (6.93%)	216,503
SAMN03282397	Brain Pituitary (Cercocebus atys, not applicable, male, SAMN03282397)	102,398,136	80,150,752 (78.27%)	7,802,174 (7.62%)	205,465
SAMN03282398	Colon (Cercocebus atys, not applicable, male, SAMN03282398)	139,950,832	114,403,816 (81.75%)	12,818,126 (9.16%)	216,615
SAMN03282399	Heart (Cercocebus atys, not applicable, male, SAMN03282399)	199,645,024	164,303,193 (82.30%)	23,083,654 (11.56%)	205,741
SAMN03282400	Kidney (Cercocebus atys, not applicable, male, SAMN03282400)	166,530,858	137,075,442 (82.31%)	16,568,806 (9.95%)	224,840
SAMN03282401	Liver (Cercocebus atys, not applicable, male, SAMN03282401)	151,988,484	126,731,949 (83.38%)	22,968,229 (15.11%)	189,385
SAMN03282402	Lung (Cercocebus atys, not applicable, male, SAMN03282402)	115,946,080	93,010,658 (80.22%)	12,212,783 (10.53%)	204,288
SAMN03282403	Lymph Node (Cercocebus atys, not applicable, male, SAMN03282403)	156,219,992	127,529,162 (81.63%)	13,386,165 (8.57%)	217,021
SAMN03282404	Skeletal Muscle (Cercocebus atys, not applicable, male, SAMN03282404)	114,830,616	96,151,061 (83.73%)	14,142,935 (12.32%)	184,602
SAMN03282405	Spleen (Cercocebus atys, not applicable, male, SAMN03282405)	109,771,286	91,661,689 (83.50%)	9,771,424 (8.90%)	201,266
SAMN03282406	Thymus (Cercocebus atys, not applicable, male, SAMN03282406)	148,805,048	121,054,657 (81.35%)	10,054,769 (6.76%)	224,197

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Number (%) of aligned reads	Number (%) spliced reads
SRR832957	SRX270667	SRP021223	SAMN02045729	444,385,538	160,292,690 (36.07%)	18,157,379 (4.09%)
SRR832956	SRX270666	SRP021223	SAMN02045730	1,779,729,044	1,417,088,072 (79.62%)	337,909,519 (18.99%)
SRR1602585	SRX724900	SRP048678	SAMN03085078	200,302,908	134,549,480 (67.17%)	5,479,610 (2.74%)
SRR1759017	SRX843248	SRP051959	SAMN03282394	110,975,952	92,065,686 (82.96%)	11,407,235 (10.28%)
SRR1759018	SRX843249	SRP051959	SAMN03282395	80,213,046	64,266,554 (80.12%)	3,874,307 (4.83%)
SRR1759019	SRX843250	SRP051959	SAMN03282395	85,468,962	68,203,401 (79.80%)	3,970,136 (4.65%)
SRR1759020	SRX843251	SRP051959	SAMN03282396	58,360,222	47,195,926 (80.87%)	4,012,286 (6.88%)
SRR1759021	SRX843252	SRP051959	SAMN03282396	84,137,010	68,950,339 (81.95%)	5,856,909 (6.96%)
SRR1759022	SRX843253	SRP051959	SAMN03282397	102,398,136	80,150,752 (78.27%)	7,802,174 (7.62%)
SRR1759023	SRX843254	SRP051959	SAMN03282398	139,950,832	114,403,816 (81.75%)	12,818,126 (9.16%)
SRR1759024	SRX843255	SRP051959	SAMN03282399	199,645,024	164,303,193 (82.30%)	23,083,654 (11.56%)
SRR1759025	SRX843256	SRP051959	SAMN03282400	166,530,858	137,075,442 (82.31%)	16,568,806 (9.95%)
SRR1759026	SRX843257	SRP051959	SAMN03282401	151,988,484	126,731,949 (83.38%)	22,968,229 (15.11%)
SRR1759027	SRX843258	SRP051959	SAMN03282402	115,946,080	93,010,658 (80.22%)	12,212,783 (10.53%)
SRR1759028	SRX843259	SRP051959	SAMN03282403	62,404,962	50,536,309 (80.98%)	5,320,918 (8.53%)
SRR1759029	SRX843260	SRP051959	SAMN03282403	93,815,030	76,992,853 (82.07%)	8,065,247 (8.60%)
SRR1759030	SRX843261	SRP051959	SAMN03282404	114,830,616	96,151,061 (83.73%)	14,142,935 (12.32%)
SRR1759031	SRX843262	SRP051959	SAMN03282405	109,771,286	91,661,689 (83.50%)	9,771,424 (8.90%)
SRR1759032	SRX843263	SRP051959	SAMN03282406	148,805,048	121,054,657 (81.35%)	10,054,769 (6.76%)

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Primates GenBank	22,509	20,841 (92.59%)	20,841 (92.59%)	82.23%	92.99%
Primates known RefSeq (NP_)	14,208	13,883 (97.71%)	13,883 (97.71%)	86.41%	90.76%
Same-species GenBank	71	71 (100.00%)	71 (100.00%)	93.75%	93.81%
Homo sapiens GenBank	125,296	115,118 (91.88%)	115,118 (91.88%)	84.74%	88.31%
Homo sapiens known RefSeq (NP_)	38,764	37,999 (98.03%)	37,999 (98.03%)	86.28%	88.73%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences