NCBI Capra hircus Annotation Release 102

The RefSeq genome records for Capra hircus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Similarity of current and previous assembly: The similarity of the current and previous assembly
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Capra hircus Annotation Release 102

Annotation release ID: 102
Date of Entrez queries for transcripts and proteins: Aug 29 2016
Date of submission of annotation to the public databases: Sep 8 2016
Software version: 7.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ARS1	GCF_001704415.1	USDA ARS	08-24-2016	Reference	30 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ARS1
Genes and pseudogenes	28,908
protein-coding	20,755
non-coding	4,011
pseudogenes	4,142
genes with variants	9,086
mRNAs	42,674
fully-supported	40,463
with > 5% ab initio	1,072
partial	483
with filled gap(s)	220
known RefSeq (NM_)	377
model RefSeq (XM_)	42,297
Other RNAs	5,998
fully-supported	4,052
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	264
model RefSeq (XR_)	3,798
CDSs	42,836
fully-supported	40,463
with > 5% ab initio	1,207
partial	457
with major correction(s)	980
known RefSeq (NP_)	377
model RefSeq (XP_)	42,297

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	24,766	44,080	12,491	61	2,364,981
All transcripts	48,672	3,212	2,576	17	106,760
mRNA	42,674	3,488	2,831	153	106,760
misc_RNA	1,122	3,136	2,595	114	13,846
miRNA	430	22	22	17	30
tRNA	1,770	73	72	69	84
lncRNA	2,676	1,434	994	64	14,605
Single-exon transcripts	2,162	1,363	960	153	9,824
coding transcripts (NM_/XM_ )	2,159	1,363	960	153	9,824
non-coding transcripts (NR_/XR_ )	3	1,441	1,714	617	1,993
CDSs	42,674	2,015	1,455	96	105,519
Exons	236,566	314	138	1	21,251
in coding transcripts (NM_/XM_ )	227,632	310	138	1	21,251
in non-coding transcripts (NR_/XR_ )	16,532	323	138	2	13,026
Introns	209,898	5,868	1,390	30	1,172,700
in coding transcripts (NM_/XM_ )	203,638	5,848	1,385	30	1,172,700
in non-coding transcripts (NR_/XR_ )	13,630	5,093	1,280	30	556,177

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.05	1	1	46
Number of exons per transcript	12.07	9	1	343

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 20593 coding genes, 20256 genes had a protein with an alignment covering 50% or more of the query and 17484 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
ARS1	GCF_001704415.1	50.58%	44.31%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	657	657 (100.00%)	641 (97.56%)	99.40%	99.78%
Same-species Genbank	1,336	1,254 (93.86%)	1,202 (89.97%)	99.37%	98.39%
Same-species EST	14,185	12,900 (90.94%)	12,422 (87.57%)	99.51%	99.68%
Homo sapiens known RefSeq (NM_/NR_)	57,159	45,824 (80.17%)	11,731 (20.52%)	89.08%	80.50%
Homo sapiens Genbank	284,895	145,097 (50.93%)	50,426 (17.70%)	89.76%	89.09%
Bos taurus known RefSeq (NM_/NR_)	14,246	14,081 (98.84%)	11,794 (82.79%)	96.33%	99.15%
Bos taurus Genbank	18,978	18,223 (96.02%)	13,443 (70.83%)	95.92%	98.62%
Bos taurus EST	1,583,270	1,401,716 (88.53%)	1,189,918 (75.16%)	94.70%	97.05%

RefSeq transcript alignment quality report

The known RefSeq transcripts (NM_ and NR_ accessions) are a set of hiqh-quality transcripts maintained by the RefSeq group at NCBI. Alignment statistics for this group of transcripts, such as percent and number of sequences not aligning at all, percent best alignments split between multiple scaffolds, and percent alignments not covering the full CDS are indicative of the genome quality and are provided below.

	ARS1 Primary Assembly
Number of sequences retrieved from Entrez	657
Number (%) of sequences not aligning	0 (0.00%)
Number (%) of sequences with multiple best alignments (split genes)	0 (0.00%)
Number (%) of sequences with CDS coverage < 95%	3 (0.76%)

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	3,268,359,340	88%	15%	226,700
SAMN01086873	skin (Capra hircus, SAMN01086873)	36,232,550	87%	23%	148,508
SAMN01086875	skin (Capra hircus, SAMN01086875)	41,232,610	87%	24%	154,986
SAMN01086876	skin (Capra hircus, SAMN01086876)	39,363,238	88%	26%	137,954
SAMN01086877	skin (Capra hircus, SAMN01086877)	49,962,602	86%	25%	150,219
SAMN01726798	skin (Capra hircus, SAMN01726798)	158,415,840	82%	14%	141,904
SAMN01726799	skin (Capra hircus, SAMN01726799)	59,405,940	69%	7%	94,760
SAMN01831861	normal liver (Capra hircus, adult, SAMN01831861)	31,747,734	89%	25%	139,501
SAMN01831862	normal liver (Capra hircus, adult, SAMN01831862)	32,909,212	79%	23%	141,023
SAMN01831911	normal kidney (Capra hircus, adult, SAMN01831911)	50,998,336	82%	6%	143,584
SAMN01831912	normal kidney (Capra hircus, adult, SAMN01831912)	73,417,988	81%	7%	153,399
SAMN01831956	normal brain (Capra hircus, adult, SAMN01831956)	69,928,334	83%	5%	158,329
SAMN01831957	normal brain (Capra hircus, adult, SAMN01831957)	51,595,962	87%	16%	173,443
SAMN01888076	pooled tissue (Capra hircus, female, SAMN01888076)	804,601	82%	35%	51,420
SAMN02264637	skin (Capra hircus, two year old, female, SAMN02264637)	46,659,562	94%	31%	128,154
SAMN02264643	skin (Capra hircus, two year old, female, SAMN02264643)	97,455,470	94%	28%	134,251
SAMN02265254	longissimus thoracis (Capra hircus, On day 90 goat, embryos, SAMN02265254)	27,512,850	91%	24%	144,466
SAMN02265255	longissimus thoracis (Capra hircus, 6 months old goat, SAMN02265255)	27,582,908	91%	17%	113,144
SAMN02712004	mammary gland (Capra hircus, 2-4 years old, female, SAMN02712004)	125,588,073	88%	28%	182,039
SAMN02769471	Ovary (Capra hircus, 4 years old, female, SAMN02769471)	6,027,714	87%	6%	62,495
SAMN02769472	ovary (Capra hircus, 4 years old, female, SAMN02769472)	5,884,062	85%	5%	70,890
SAMN03003643	ovary (Capra hircus, 2 years, female, SAMN03003643)	40,106,478	88%	15%	156,553
SAMN04447792	follicle (Capra hircus, 24-28 months, female, SAMN04447792)	116,805,880	85%	11%	134,205
SAMN04447793	follicle (Capra hircus, 24-28 months, female, SAMN04447793)	91,209,622	91%	18%	129,090
SAMN04455912	abomasum (Capra hircus, 4 months old, male, SAMN04455912)	1,385,845,948	89%	8%	186,220
SAMN05231885	mammary gland (Capra hircus, 4, female, SAMN05231885)	79,828,428	85%	11%	147,969
SAMN05363357	Longissimus thoracis, normal (Capra hircus, female, SAMN05363357)	55,621,438	93%	30%	133,397
SAMN05363358	Longissimus thoracis, ovariectomized (Capra hircus, female, SAMN05363358)	53,527,082	93%	25%	134,297
SAMN05363359	Longissimus thoracis, normal (Capra hircus, female, SAMN05363359)	51,690,018	94%	28%	111,335
SAMN05363360	Longissimus thoracis, ovariectomized (Capra hircus, female, SAMN05363360)	55,036,314	93%	32%	132,287
SAMN05363361	Longissimus thoracis, ovariectomized (Capra hircus, female, SAMN05363361)	49,947,344	94%	31%	91,197
SAMN05462051	Rumen Epithelium (Capra hircus, 6 month, male, SAMN05462051)	51,288,024	93%	28%	136,023
SAMN05462052	Rumen Epithelium (Capra hircus, 6 month, male, SAMN05462052)	39,742,206	92%	27%	113,254
SAMN05462053	Rumen Epithelium (Capra hircus, 6 month, male, SAMN05462053)	51,808,932	92%	27%	138,311
SAMN05462054	Rumen Epithelium (Capra hircus, 6 month, male, SAMN05462054)	37,523,380	93%	29%	141,627
SAMN05462055	Rumen Epithelium (Capra hircus, 6 month, male, SAMN05462055)	39,252,548	94%	29%	145,818
SAMN05462056	Rumen Epithelium (Capra hircus, 6 month, male, SAMN05462056)	36,400,112	93%	32%	146,032

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
SRR520146	SRX159006	SRP014175	SAMN01086873	36,232,550	87%	23%
SRR520147	SRX159002	SRP014175	SAMN01086875	41,232,610	87%	24%
SRR520145	SRX159004	SRP014175	SAMN01086876	39,363,238	88%	26%
SRR520148	SRX159005	SRP014175	SAMN01086877	49,962,602	86%	25%
SRR609420	SRX189187	SRP015859	SAMN01726798	49,504,950	90%	18%
SRR609421	SRX189187	SRP015859	SAMN01726798	59,405,940	69%	7%
SRR609447	SRX201720	SRP015859	SAMN01726798	49,504,950	90%	18%
SRR609448	SRX201721	SRP015859	SAMN01726799	59,405,940	69%	7%
SRR636844	SRX211584	SRP017611	SAMN01831861	31,747,734	89%	25%
SRR636845	SRX211585	SRP017611	SAMN01831862	32,909,212	79%	23%
SRR636894	SRX211634	SRP017611	SAMN01831911	50,998,336	82%	6%
SRR636895	SRX211635	SRP017611	SAMN01831912	73,417,988	81%	7%
SRR636939	SRX211679	SRP017611	SAMN01831956	69,928,334	83%	5%
SRR636940	SRX211680	SRP017611	SAMN01831957	51,595,962	87%	16%
SRR649468	SRX217723	SRP017964	SAMN01888076	804,601	82%	35%
SRR943138	SRX327893	SRP028253	SAMN02264637	46,659,562	94%	31%
SRR943136	SRX327891	SRP028253	SAMN02264643	51,818,210	95%	28%
SRR943137	SRX327892	SRP028253	SAMN02264643	45,637,260	94%	28%
SRR943302	SRX328069	SRP028266	SAMN02265254	27,512,850	91%	24%
SRR943303	SRX328070	SRP028266	SAMN02265255	27,582,908	91%	17%
SRR1216868	SRX503584	SRP040710	SAMN02712004	125,588,073	88%	28%
SRR1283211	SRX540103	SRP041866	SAMN02769471	6,027,714	87%	6%
SRR1283212	SRX540104	SRP041866	SAMN02769472	5,884,062	85%	5%
SRR1556738	SRX685596	SRP045739	SAMN03003643	40,106,478	88%	15%
SRR3133401	SRX1552769	SRP069042	SAMN04447792	116,805,880	85%	11%
SRR3133414	SRX1552777	SRP069042	SAMN04447793	91,209,622	91%	18%
SRR3144626	SRX1560777	SRP069288	SAMN04455912	68,274,328	90%	8%
SRR3144627	SRX1560777	SRP069288	SAMN04455912	69,235,996	89%	7%
SRR3144628	SRX1560777	SRP069288	SAMN04455912	91,800,202	90%	7%
SRR3144629	SRX1560777	SRP069288	SAMN04455912	70,062,580	91%	9%
SRR3144632	SRX1560777	SRP069288	SAMN04455912	74,518,522	90%	9%
SRR3144634	SRX1560777	SRP069288	SAMN04455912	75,949,200	89%	9%
SRR3144635	SRX1560777	SRP069288	SAMN04455912	70,617,348	90%	7%
SRR3144636	SRX1560777	SRP069288	SAMN04455912	63,763,506	86%	8%
SRR3144637	SRX1560777	SRP069288	SAMN04455912	64,478,618	90%	7%
SRR3144638	SRX1560777	SRP069288	SAMN04455912	61,878,410	90%	8%
SRR3144639	SRX1560777	SRP069288	SAMN04455912	77,544,710	91%	8%
SRR3144640	SRX1560777	SRP069288	SAMN04455912	74,934,390	90%	8%
SRR3144641	SRX1560777	SRP069288	SAMN04455912	67,339,602	90%	8%
SRR3144642	SRX1560777	SRP069288	SAMN04455912	74,733,198	91%	8%
SRR3144660	SRX1560777	SRP069288	SAMN04455912	76,240,130	90%	8%
SRR3144664	SRX1560777	SRP069288	SAMN04455912	72,129,474	83%	8%
SRR3144671	SRX1560777	SRP069288	SAMN04455912	79,232,660	89%	8%
SRR3144672	SRX1560777	SRP069288	SAMN04455912	80,067,730	90%	8%
SRR3144673	SRX1560777	SRP069288	SAMN04455912	73,045,344	89%	8%
SRR3659132	SRX1838080	SRP076449	SAMN05231885	41,635,680	85%	15%
SRR3659145	SRX1838080	SRP076449	SAMN05231885	38,192,748	85%	7%
SRR3746184	SRX1900065	SRP078004	SAMN05363357	27,810,719	94%	30%
SRR3746185	SRX1900065	SRP078004	SAMN05363357	27,810,719	92%	30%
SRR3746186	SRX1900066	SRP078004	SAMN05363358	26,763,541	94%	25%
SRR3746187	SRX1900066	SRP078004	SAMN05363358	26,763,541	92%	25%
SRR3746188	SRX1900067	SRP078004	SAMN05363359	25,845,009	95%	28%
SRR3746189	SRX1900067	SRP078004	SAMN05363359	25,845,009	93%	28%
SRR3746190	SRX1900068	SRP078004	SAMN05363360	27,518,157	94%	33%
SRR3746191	SRX1900068	SRP078004	SAMN05363360	27,518,157	92%	32%
SRR3746192	SRX1900069	SRP078004	SAMN05363361	24,973,672	95%	31%
SRR3746193	SRX1900069	SRP078004	SAMN05363361	24,973,672	93%	31%
SRR3989693	SRX1991436	SRP080217	SAMN05462051	51,288,024	93%	28%
SRR3989694	SRX1991437	SRP080217	SAMN05462052	39,742,206	92%	27%
SRR3989695	SRX1991438	SRP080217	SAMN05462053	51,808,932	92%	27%
SRR3989696	SRX1991439	SRP080217	SAMN05462054	37,523,380	93%	29%
SRR3989697	SRX1991440	SRP080217	SAMN05462055	39,252,548	94%	29%
SRR3989698	SRX1991441	SRP080217	SAMN05462056	36,400,112	93%	32%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Homo sapiens known RefSeq (NP_)	44,062	43,183 (98.01%)	43,183 (98.01%)	75.51%	82.94%
Bos taurus known RefSeq (NP_)	13,361	13,287 (99.45%)	13,287 (99.45%)	81.54%	87.97%
Same-species GenBank	1,124	1,096 (97.51%)	1,096 (97.51%)	79.98%	85.49%
Same-species known RefSeq (NP_)	393	389 (98.98%)	389 (98.98%)	79.16%	89.75%

Assembly-assembly alignments of current to previous assembly

When the assembly changes between two rounds of annotation, genes in the current and the previous annotation are mapped to each other using the genomic alignments of the current assembly to the previous assembly so that gene identifiers can be preserved. The success of the remapping depends largely on how well the two assembly versions align to each other.

Below are the percent coverage of one assembly by the other and the average percent identity of the alignments. The 'First pass' alignments are reciprocal best hits, while the 'Total' alignments also include 'Second pass' or non-reciprocal best alignments. For more information about the assembly-assembly alignment process, please visit the NCBI Genome Remapping Service page.

First Pass	Total
ARS1 (Current) Coverage: 82.02%	ARS1 (Current) Coverage: 83.25%
CHIR_1.0 (Previous) Coverage: 96.04%	CHIR_1.0 (Previous) Coverage: 96.48%
Percent Identity: 99.43%	Percent Identity: 99.39%

Comparison of the current and previous annotations

The annotation produced for this release (102) was compared to the annotation in the previous release (101) for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	ARS1 (Current) to CHIR_1.0 (Previous)
Identical	11%
Minor changes	51%
Major changes	16%
New	14%
Deprecated	8%
Other	7%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences