NCBI Camelus bactrianus Annotation Release 102

The RefSeq genome records for Camelus bactrianus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Camelus bactrianus Annotation Release 102

Annotation release ID: 102
Date of Entrez queries for transcripts and proteins: Dec 6 2021
Date of submission of annotation to the public databases: Dec 16 2021
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Ca_bactrianus_MBC_1.0	GCF_000767855.1	Inner Mongolia Agricultural University, P.R.China	10-23-2014	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Ca_bactrianus_MBC_1.0
Genes and pseudogenes	29,764
protein-coding	19,712
non-coding	7,720
Transcribed pseudogenes	35
Non-transcribed pseudogenes	2,177
genes with variants	8,537
Immunoglobulin/T-cell receptor gene segments	93
other	27
mRNAs	41,421
fully-supported	38,877
with > 5% ab initio	910
partial	4,207
with filled gap(s)	3,674
known RefSeq (NM_)	8
model RefSeq (XM_)	41,413
non-coding RNAs	12,657
fully-supported	9,330
with > 5% ab initio	0
partial	77
with filled gap(s)	74
known RefSeq (NR_)	0
model RefSeq (XR_)	10,722
pseudo transcripts	35
fully-supported	25
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	35
CDSs	41,527
fully-supported	38,877
with > 5% ab initio	1,157
partial	3,620
with major correction(s)	2,477
known RefSeq (NP_)	8
model RefSeq (XP_)	41,426

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	27,459	35,387	10,578	55	1,872,531
All transcripts	54,078	3,060	2,422	55	106,870
mRNA	41,421	3,351	2,706	114	106,870
misc_RNA	2,102	3,546	2,897	182	36,108
tRNA	1,933	73	73	59	87
lncRNA	7,231	2,617	1,694	77	24,510
snoRNA	563	111	125	55	326
snRNA	798	117	107	60	198
rRNA	3	884	968	119	1,565
Single-exon transcripts	1,877	1,448	957	114	21,479
coding transcripts (NM_/XM_ )	1,876	1,448	957	114	21,479
non-coding transcripts (NR_/XR_ )	1	1,494	1,494	1,494	1,494
CDSs	41,434	1,848	1,374	102	105,612
Exons	242,584	332	138	1	24,048
in coding transcripts (NM_/XM_ )	220,043	303	135	1	24,048
in non-coding transcripts (NR_/XR_ )	33,955	471	150	2	22,689
Introns	214,607	5,277	1,323	30	929,027
in coding transcripts (NM_/XM_ )	197,870	4,882	1,286	30	929,027
in non-coding transcripts (NR_/XR_ )	27,821	7,699	1,705	30	405,334

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.04	1	1	50
Number of exons per transcript	10.84	8	1	345

BUSCO analysis of gene annotation

BUSCO v4.1.4 (Simão et al 2015, PMID: 26059717) was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the cetartiodactyla_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation (C:complete [S:single-copy, D:duplicated], F:fragmented, M:missing, n:number of genes used).

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 19628 coding genes, 19320 genes had a protein with an alignment covering 50% or more of the query and 15943 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
Ca_bactrianus_MBC_1.0	GCF_000767855.1	24.97%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	8	8 (100.00%)	8 (100.00%)	99.38%	99.88%
Same-species Genbank	155	144 (92.90%)	32 (20.65%)	98.77%	90.63%
Homo sapiens known RefSeq (NM_/NR_)	81,570	69,068 (84.67%)	13,319 (16.33%)	89.40%	81.58%
Homo sapiens Genbank	342,314	173,111 (50.57%)	55,956 (16.35%)	89.75%	86.87%
Artiodactyla known RefSeq (NM_/NR_)	21,932	20,433 (93.17%)	9,950 (45.37%)	91.09%	92.98%
Artiodactyla Genbank	75,874	63,177 (83.27%)	25,351 (33.41%)	90.76%	91.18%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	3,189,245,081	68%	36%	270,952
SAMN01093898	camelus bactrianus transcriptome (Camelus bactrianus, SAMN01093898)	298,123,581	40%	19%	194,508
SAMN07840747	renal cortex (Camelus bactrianus, 7year, pooled male and female, SAMN07840747)	77,435,302	80%	33%	174,126
SAMN07840748	renal medulla (Camelus bactrianus, 7year, pooled male and female, SAMN07840748)	70,604,592	82%	34%	198,598
SAMN07840749	pancreas (Camelus bactrianus, 7year, pooled male and female, SAMN07840749)	58,813,474	89%	51%	145,881
SAMN07840750	muscle (Camelus bactrianus, 7year, pooled male and female, SAMN07840750)	61,802,884	80%	32%	168,061
SAMN07840751	liver (Camelus bactrianus, 7year, pooled male and female, SAMN07840751)	59,909,960	90%	40%	160,423
SAMN09209274	Adipose tissues (Camelus bactrianus, pooled male and female, SAMN09209274)	807,127,428	61%	38%	237,762
SAMN09812175	tonsil (Camelus bactrianus, 5 years, SAMN09812175)	52,508,502	77%	36%	183,702
SAMN09812176	spleen (Camelus bactrianus, 5 years, SAMN09812176)	53,039,624	77%	28%	184,239
SAMN09812177	kidney (Camelus bactrianus, 5 years, SAMN09812177)	42,327,812	73%	31%	174,352
SAMN09812178	the second chamber stomach (Camelus bactrianus, 3 years, female, SAMN09812178)	51,043,018	70%	34%	178,894
SAMN09812179	the second chamber stomach (Camelus bactrianus, 3 years, male, SAMN09812179)	48,782,294	66%	32%	178,764
SAMN09812180	pseudo rumen (Camelus bactrianus, 3 years, male, SAMN09812180)	52,718,274	68%	32%	181,348
SAMN09812181	skin (Camelus bactrianus, 3 years, female, SAMN09812181)	63,918,094	81%	36%	199,628
SAMN09812182	esophagus (Camelus bactrianus, 3 years, female, SAMN09812182)	48,968,280	76%	44%	185,179
SAMN09812183	liver (Camelus bactrianus, 5 years, SAMN09812183)	49,278,454	84%	50%	158,478
SAMN09812184	abomasum (Camelus bactrianus, 3 years, female, SAMN09812184)	35,916,996	62%	34%	172,350
SAMN09812185	testes (Camelus bactrianus, 3 years, male, SAMN09812185)	55,900,352	73%	29%	181,288
SAMN09812186	pseudo rumen (Camelus bactrianus, 3 years, female, SAMN09812186)	52,746,038	73%	39%	149,364
SAMN09812187	subcutaneous fat (Camelus bactrianus, 5 years, SAMN09812187)	49,383,382	79%	36%	184,919
SAMN09812188	skin (Camelus bactrianus, 3 years, male, SAMN09812188)	46,966,608	77%	39%	190,447
SAMN09812189	muscle (Camelus bactrianus, 5 years, SAMN09812189)	49,221,660	77%	44%	148,563
SAMN09812190	cecum (Camelus bactrianus, 3 years, male, SAMN09812190)	52,184,182	81%	37%	178,383
SAMN09812191	esophagus (Camelus bactrianus, 3 years, male, SAMN09812191)	51,667,772	77%	39%	174,203
SAMN09812192	duodenum (Camelus bactrianus, 3 years, male, SAMN09812192)	42,807,358	75%	36%	173,395
SAMN09812193	cecum (Camelus bactrianus, 3 years, female, SAMN09812193)	49,123,086	70%	35%	167,325
SAMN09812194	pseudo rumen (Camelus bactrianus, SAMN09812194)	47,974,774	64%	37%	155,901
SAMN09812195	esophagus (Camelus bactrianus, SAMN09812195)	40,499,904	78%	44%	156,404
SAMN09812196	pseudo rumen (Camelus bactrianus, SAMN09812196)	47,728,894	61%	40%	157,568
SAMN09812197	the second chamber stomach (Camelus bactrianus, SAMN09812197)	46,513,632	67%	41%	158,727
SAMN09812198	pseudo rumen (Camelus bactrianus, SAMN09812198)	48,004,758	56%	38%	157,253
SAMN09812199	pseudo rumen (Camelus bactrianus, SAMN09812199)	45,896,234	74%	39%	163,877
SAMN09812200	esophagus (Camelus bactrianus, SAMN09812200)	45,288,930	61%	43%	150,747
SAMN09812201	the second chamber stomach (Camelus bactrianus, SAMN09812201)	46,346,632	73%	34%	159,923
SAMN09812202	abomasum (Camelus bactrianus, SAMN09812202)	45,720,120	56%	39%	160,628
SAMN09812203	esophagus (Camelus bactrianus, SAMN09812203)	46,545,990	69%	45%	165,051
SAMN09812204	esophagus (Camelus bactrianus, SAMN09812204)	49,692,184	73%	43%	170,141
SAMN09812205	the second chamber stomach (Camelus bactrianus, SAMN09812205)	41,322,358	75%	42%	148,849
SAMN09812206	duodenum (Camelus bactrianus, SAMN09812206)	47,178,968	63%	39%	169,870
SAMN14047989	renal medulla (Camelus bactrianus, female, SAMN14047989)	110,130,768	83%	21%	200,149
SAMN14047990	renal medulla (Camelus bactrianus, female, SAMN14047990)	98,081,928	78%	21%	199,798

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR527309	SRX170822	SRP014573	SAMN01093898	33,869,452	55%	16%
SRR527310	SRX170822	SRP014573	SAMN01093898	34,152,514	44%	22%
SRR527311	SRX170822	SRP014573	SAMN01093898	33,210,264	52%	16%
SRR527312	SRX170822	SRP014573	SAMN01093898	35,131,338	55%	23%
SRR527313	SRX170822	SRP014573	SAMN01093898	17,246,995	17%	19%
SRR527314	SRX170822	SRP014573	SAMN01093898	31,548,154	51%	21%
SRR527315	SRX170822	SRP014573	SAMN01093898	41,016,808	20%	22%
SRR527316	SRX170822	SRP014573	SAMN01093898	35,974,028	31%	17%
SRR527317	SRX170822	SRP014573	SAMN01093898	35,974,028	31%	17%
SRR6228806	SRX3337402	SRP122491	SAMN07840747	77,435,302	80%	33%
SRR6228807	SRX3337403	SRP122491	SAMN07840748	70,604,592	82%	34%
SRR6228808	SRX3337404	SRP122491	SAMN07840749	58,813,474	89%	51%
SRR6228809	SRX3337405	SRP122491	SAMN07840750	61,802,884	80%	32%
SRR6228810	SRX3337406	SRP122491	SAMN07840751	59,909,960	90%	40%
SRR7188319	SRX4105024	SRP148535	SAMN09209274	49,574,934	66%	38%
SRR7188318	SRX4105025	SRP148535	SAMN09209274	42,163,054	62%	38%
SRR7188317	SRX4105026	SRP148535	SAMN09209274	29,221,074	62%	37%
SRR7188316	SRX4105027	SRP148535	SAMN09209274	17,731,176	66%	38%
SRR7188315	SRX4105028	SRP148535	SAMN09209274	44,144,336	61%	38%
SRR7188314	SRX4105029	SRP148535	SAMN09209274	54,756,398	54%	36%
SRR7188313	SRX4105030	SRP148535	SAMN09209274	20,750,788	64%	37%
SRR7188312	SRX4105031	SRP148535	SAMN09209274	42,535,766	66%	39%
SRR7188311	SRX4105032	SRP148535	SAMN09209274	24,843,644	66%	38%
SRR7188310	SRX4105033	SRP148535	SAMN09209274	43,667,128	67%	44%
SRR7188309	SRX4105034	SRP148535	SAMN09209274	30,390,168	61%	39%
SRR7188308	SRX4105035	SRP148535	SAMN09209274	23,700,096	61%	37%
SRR7188307	SRX4105036	SRP148535	SAMN09209274	57,033,908	62%	41%
SRR7188306	SRX4105037	SRP148535	SAMN09209274	44,944,254	63%	38%
SRR7188305	SRX4105038	SRP148535	SAMN09209274	26,581,470	58%	37%
SRR7188304	SRX4105039	SRP148535	SAMN09209274	18,869,654	64%	37%
SRR7188303	SRX4105040	SRP148535	SAMN09209274	25,449,480	61%	39%
SRR7188302	SRX4105041	SRP148535	SAMN09209274	85,335,162	55%	39%
SRR7188301	SRX4105042	SRP148535	SAMN09209274	45,058,260	65%	39%
SRR7188300	SRX4105043	SRP148535	SAMN09209274	26,818,616	59%	38%
SRR7188299	SRX4105044	SRP148535	SAMN09209274	53,558,062	56%	36%
SRR7755223	SRX4611161	SRP158841	SAMN09812175	52,508,502	77%	36%
SRR7755220	SRX4611164	SRP158841	SAMN09812176	53,039,624	77%	28%
SRR7755221	SRX4611163	SRP158841	SAMN09812177	42,327,812	73%	31%
SRR7755218	SRX4611166	SRP158841	SAMN09812178	51,043,018	70%	34%
SRR7755219	SRX4611165	SRP158841	SAMN09812179	48,782,294	66%	32%
SRR7755216	SRX4611168	SRP158841	SAMN09812180	52,718,274	68%	32%
SRR7755217	SRX4611167	SRP158841	SAMN09812181	63,918,094	81%	36%
SRR7755214	SRX4611170	SRP158841	SAMN09812182	48,968,280	76%	44%
SRR7755215	SRX4611169	SRP158841	SAMN09812183	49,278,454	84%	50%
SRR7755394	SRX4610990	SRP158841	SAMN09812184	35,916,996	62%	34%
SRR7755393	SRX4610991	SRP158841	SAMN09812185	55,900,352	73%	29%
SRR7755090	SRX4611294	SRP158841	SAMN09812186	52,746,038	73%	39%
SRR7755089	SRX4611295	SRP158841	SAMN09812187	49,383,382	79%	36%
SRR7755096	SRX4611288	SRP158841	SAMN09812188	46,966,608	77%	39%
SRR7755095	SRX4611289	SRP158841	SAMN09812189	49,221,660	77%	44%
SRR7755094	SRX4611290	SRP158841	SAMN09812190	52,184,182	81%	37%
SRR7755093	SRX4611291	SRP158841	SAMN09812191	51,667,772	77%	39%
SRR7755406	SRX4610978	SRP158841	SAMN09812192	42,807,358	75%	36%
SRR7755405	SRX4610979	SRP158841	SAMN09812193	49,123,086	70%	35%
SRR7755436	SRX4611409	SRP158841	SAMN09812194	47,974,774	64%	37%
SRR7755437	SRX4611408	SRP158841	SAMN09812195	40,499,904	78%	44%
SRR7755438	SRX4611407	SRP158841	SAMN09812196	47,728,894	61%	40%
SRR7755439	SRX4611406	SRP158841	SAMN09812197	46,513,632	67%	41%
SRR7755432	SRX4611413	SRP158841	SAMN09812198	48,004,758	56%	38%
SRR7755433	SRX4611412	SRP158841	SAMN09812199	45,896,234	74%	39%
SRR7755434	SRX4611411	SRP158841	SAMN09812200	45,288,930	61%	43%
SRR7755435	SRX4611410	SRP158841	SAMN09812201	46,346,632	73%	34%
SRR7755440	SRX4611405	SRP158841	SAMN09812202	45,720,120	56%	39%
SRR7755441	SRX4611404	SRP158841	SAMN09812203	46,545,990	69%	45%
SRR7755372	SRX4611012	SRP158841	SAMN09812204	49,692,184	73%	43%
SRR7755416	SRX4611429	SRP158841	SAMN09812205	41,322,358	75%	42%
SRR7755415	SRX4611430	SRP158841	SAMN09812206	47,178,968	63%	39%
SRR11031496	SRX7683760	SRP247453	SAMN14047989	110,130,768	83%	21%
SRR11031495	SRX7683761	SRP247453	SAMN14047990	98,081,928	78%	21%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Artiodactyla GenBank	32,288	14,685 (45.48%)	14,685 (45.48%)	75.81%	82.96%
Artiodactyla known RefSeq (NP_)	20,030	18,714 (93.43%)	18,714 (93.43%)	77.40%	87.41%
Homo sapiens GenBank	149,247	85,536 (57.31%)	85,536 (57.31%)	70.26%	77.48%
Homo sapiens known RefSeq (NP_)	62,869	45,703 (72.70%)	45,703 (72.70%)	78.78%	83.56%

Comparison of the current and previous annotations

The annotation produced for this release (102) was compared to the annotation in the previous release (100) for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	Ca_bactrianus_MBC_1.0 (Current) to Ca_bactrianus_MBC_1.0 (Previous)
Identical	6%
Minor changes	51%
Major changes	14%
New	23%
Deprecated	4%
Other	6%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences