NCBI Candoia aspera Annotation Release GCF_035149785.1-RS_2024_02

The genome sequence records for Candoia aspera RefSeq assembly GCF_035149785.1 (rCanAsp1.hap2) were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as "GCF_035149785.1-RS_2024_02".

Date of Entrez queries for transcripts and proteins: Feb 15 2024
Date of submission of annotation to the public databases: Feb 22 2024
Software version: 10.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
rCanAsp1.hap2	GCF_035149785.1	Vertebrate Genomes Project	01-18-2024	Reference	18 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	rCanAsp1.hap2
Genes and pseudogenes	20,506
protein-coding	18,397
non-coding	1,482
Transcribed pseudogenes	0
Non-transcribed pseudogenes	501
genes with variants	5,994
Immunoglobulin/T-cell receptor gene segments	109
other	17
mRNAs	29,719
fully-supported	26,811
with > 5% ab initio	1,407
partial	98
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	29,719
non-coding RNAs	2,307
fully-supported	1,784
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	2,107
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	29,828
fully-supported	26,811
with > 5% ab initio	1,572
partial	128
with major correction(s)	265
known RefSeq (NP_)	0
model RefSeq (XP_)	29,719

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	19,896	38,390	16,413	50	1,282,126
All transcripts	32,026	2,973	2,398	50	116,773
mRNA	29,719	3,104	2,503	165	116,773
misc_RNA	669	2,586	2,115	107	10,118
tRNA	200	74	73	71	87
lncRNA	1,116	1,046	669	121	9,970
snoRNA	175	112	101	50	319
snRNA	89	136	116	61	197
rRNA	41	843	119	118	3,996
Single-exon transcripts	1,328	1,375	966	165	10,073
coding transcripts (NM_/XM_ )	1,328	1,375	966	165	10,073
CDSs	29,719	1,915	1,443	96	115,566
Exons	208,940	274	135	1	17,100
in coding transcripts (NM_/XM_ )	204,847	272	135	1	17,100
in non-coding transcripts (NR_/XR_ )	9,821	247	129	8	9,857
Introns	187,790	4,245	1,352	30	1,056,664
in coding transcripts (NM_/XM_ )	185,144	4,069	1,349	30	670,660
in non-coding transcripts (NR_/XR_ )	8,244	7,556	1,367	30	1,056,664

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.62	1	1	29
Number of exons per transcript	11.53	9	1	327

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the sauropsida_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 18397 coding genes, 17919 genes had a protein with an alignment covering 50% or more of the query and 11503 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
rCanAsp1.hap2	GCF_035149785.1	33.05%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez Nucleotide, Entrez Protein, and SRA, and aligned to the genome.

Transcript alignments

The alignments of the following transcripts with Splign were used for gene prediction:

No transcript evidence was used in this annotation

RNA-Seq alignments

The alignments of the following RNA-Seq reads with STAR were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	3,201,227,576	53%	23%	196,352
SAMN00009871	20228791	trigeminal ganglion (Corallus hortulanus, SAMN00009871)	25,812,175	62%	19%	116,881
SAMN00009872	20228791	dorsal root ganglion (Corallus hortulanus, SAMN00009872)	12,218,945	59%	14%	98,064
SAMN01823442	NA	liver (Candoia aspera, Adult, SAMN01823442)	70,096,418	85%	33%	133,138
SAMN02223084	NA	Generic sample from Boa constrictor (Boa constrictor, female, SAMN02223084)	25,971,656	64%	16%	106,761
SAMN02223085	NA	Generic sample from Boa constrictor (Boa constrictor, male, SAMN02223085)	25,971,656	66%	16%	103,653
SAMN03246838	NA	adult, blood (Boa constrictor, SAMN03246838)	32,080,658	10%	26%	38,671
SAMN09242002	NA	MIGS Eukaryotic sample from Boa constrictor (Boa constrictor, SAMN09242002)	24,368,156	83%	42%	87,140
SAMN09242003	NA	MIGS Eukaryotic sample from Boa constrictor (Boa constrictor, SAMN09242003)	16,811,446	66%	33%	129,251
SAMN09242004	NA	MIGS Eukaryotic sample from Boa constrictor (Boa constrictor, SAMN09242004)	31,088,276	52%	22%	102,734
SAMN09242005	NA	MIGS Eukaryotic sample from Boa constrictor (Boa constrictor, SAMN09242005)	15,348,542	41%	24%	112,298
SAMN09242006	NA	MIGS Eukaryotic sample from Boa constrictor (Boa constrictor, SAMN09242006)	16,473,462	80%	41%	93,504
SAMN09242007	NA	MIGS Eukaryotic sample from Boa constrictor (Boa constrictor, SAMN09242007)	16,446,780	67%	33%	121,901
SAMN09242008	NA	MIGS Eukaryotic sample from Boa constrictor (Boa constrictor, SAMN09242008)	13,805,634	38%	23%	113,221
SAMN09242009	NA	MIGS Eukaryotic sample from Boa constrictor (Boa constrictor, SAMN09242009)	13,239,300	57%	33%	138,676
SAMN09242010	NA	MIGS Eukaryotic sample from Boa constrictor (Boa constrictor, SAMN09242010)	16,257,786	56%	30%	125,293
SAMN18039723	NA	small intestine (Boa constrictor, female, SAMN18039723)	117,888,110	46%	20%	113,170
SAMN18039724	NA	small intestine (Boa constrictor, male, SAMN18039724)	117,927,904	50%	20%	121,352
SAMN18039725	NA	small intestine (Boa constrictor, male, SAMN18039725)	128,921,656	57%	22%	148,035
SAMN18039726	NA	small intestine (Boa constrictor, male, SAMN18039726)	115,807,810	51%	20%	123,027
SAMN18039727	NA	small intestine (Boa constrictor, male, SAMN18039727)	113,025,652	54%	22%	116,284
SAMN18039728	NA	small intestine (Boa constrictor, male, SAMN18039728)	137,139,930	66%	25%	144,827
SAMN18039729	NA	small intestine (Boa constrictor, female, SAMN18039729)	130,599,642	57%	22%	143,300
SAMN18039730	NA	small intestine (Boa constrictor, male, SAMN18039730)	120,601,360	61%	22%	139,095
SAMN18039731	NA	small intestine (Boa constrictor, male, SAMN18039731)	117,099,928	67%	24%	139,418
SAMN18039732	NA	small intestine (Boa constrictor, male, SAMN18039732)	140,029,816	66%	26%	150,952
SAMN18039733	NA	small intestine (Boa constrictor, male, SAMN18039733)	124,197,008	53%	24%	146,945
SAMN18039734	NA	small intestine (Boa constrictor, male, SAMN18039734)	166,779,352	65%	26%	153,427
SAMN18039735	NA	small intestine (Boa constrictor, male, SAMN18039735)	138,139,834	53%	24%	144,162
SAMN18039736	NA	small intestine (Boa constrictor, male, SAMN18039736)	117,984,992	63%	24%	142,337
SAMN18039737	NA	small intestine (Boa constrictor, female, SAMN18039737)	147,768,202	63%	22%	149,482
SAMN18039738	NA	small intestine (Boa constrictor, male, SAMN18039738)	179,667,924	55%	19%	154,646
SAMN18039739	NA	small intestine (Boa constrictor, female, SAMN18039739)	146,588,116	38%	23%	131,876
SAMN18039740	NA	small intestine (Boa constrictor, female, SAMN18039740)	140,094,360	38%	18%	149,779
SAMN18039741	NA	small intestine (Boa constrictor, female, SAMN18039741)	105,479,760	46%	16%	134,494
SAMN18039742	NA	small intestine (Boa constrictor, male, SAMN18039742)	97,917,730	37%	21%	96,322
SAMN18039743	NA	small intestine (Boa constrictor, female, SAMN18039743)	117,672,032	43%	18%	143,578
SAMN18039744	NA	small intestine (Boa constrictor, female, SAMN18039744)	123,905,568	1%	12%	30,482

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR037817	SRX017694	SRP002110	SAMN00009871	12,989,432	62%	19%
SRR037818	SRX017694	SRP002110	SAMN00009871	12,822,743	62%	19%
SRR037819	SRX017695	SRP002110	SAMN00009872	12,218,945	59%	14%
SRR629646	SRX209124	SRP017457	SAMN01823442	70,096,418	85%	33%
SRR941236	SRX326514	SRP026494	SAMN02223084	25,971,656	64%	16%
SRR941243	SRX326520	SRP026494	SAMN02223085	25,971,656	66%	16%
SRR1693194	SRX793980	SRP050457	SAMN03246838	32,080,658	10%	26%
SRR7206975	SRX4115958	SRP148755	SAMN09242002	12,208,052	83%	42%
SRR7206974	SRX4115959	SRP148755	SAMN09242002	12,160,104	83%	42%
SRR7206973	SRX4115960	SRP148755	SAMN09242003	8,418,274	66%	33%
SRR7206972	SRX4115961	SRP148755	SAMN09242003	8,393,172	66%	33%
SRR7206971	SRX4115962	SRP148755	SAMN09242004	15,568,642	52%	22%
SRR7206970	SRX4115963	SRP148755	SAMN09242004	15,519,634	52%	22%
SRR7206969	SRX4115964	SRP148755	SAMN09242005	7,688,474	41%	24%
SRR7206968	SRX4115965	SRP148755	SAMN09242005	7,660,068	41%	24%
SRR7206977	SRX4115956	SRP148755	SAMN09242006	8,252,600	80%	41%
SRR7206976	SRX4115957	SRP148755	SAMN09242006	8,220,862	80%	41%
SRR7206965	SRX4115968	SRP148755	SAMN09242007	8,236,812	67%	33%
SRR7206964	SRX4115969	SRP148755	SAMN09242007	8,209,968	67%	33%
SRR7206967	SRX4115966	SRP148755	SAMN09242008	13,805,634	38%	23%
SRR7206966	SRX4115967	SRP148755	SAMN09242009	13,239,300	57%	33%
SRR7206963	SRX4115970	SRP148755	SAMN09242010	16,257,786	56%	30%
SRR13772102	SRX10157758	SRP307767	SAMN18039723	117,888,110	46%	20%
SRR13772101	SRX10157759	SRP307767	SAMN18039724	117,927,904	50%	20%
SRR13772090	SRX10157770	SRP307767	SAMN18039725	128,921,656	57%	22%
SRR13772087	SRX10157773	SRP307767	SAMN18039726	115,807,810	51%	20%
SRR13772086	SRX10157774	SRP307767	SAMN18039727	113,025,652	54%	22%
SRR13772085	SRX10157775	SRP307767	SAMN18039728	137,139,930	66%	25%
SRR13772084	SRX10157776	SRP307767	SAMN18039729	130,599,642	57%	22%
SRR13772083	SRX10157777	SRP307767	SAMN18039730	120,601,360	61%	22%
SRR13772082	SRX10157778	SRP307767	SAMN18039731	117,099,928	67%	24%
SRR13772081	SRX10157779	SRP307767	SAMN18039732	140,029,816	66%	26%
SRR13772100	SRX10157760	SRP307767	SAMN18039733	124,197,008	53%	24%
SRR13772099	SRX10157761	SRP307767	SAMN18039734	166,779,352	65%	26%
SRR13772098	SRX10157762	SRP307767	SAMN18039735	138,139,834	53%	24%
SRR13772097	SRX10157763	SRP307767	SAMN18039736	117,984,992	63%	24%
SRR13772096	SRX10157764	SRP307767	SAMN18039737	147,768,202	63%	22%
SRR13772095	SRX10157765	SRP307767	SAMN18039738	179,667,924	55%	19%
SRR13772094	SRX10157766	SRP307767	SAMN18039739	146,588,116	38%	23%
SRR13772093	SRX10157767	SRP307767	SAMN18039740	140,094,360	38%	18%
SRR13772092	SRX10157768	SRP307767	SAMN18039741	105,479,760	46%	16%
SRR13772091	SRX10157769	SRP307767	SAMN18039742	97,917,730	37%	21%
SRR13772089	SRX10157771	SRP307767	SAMN18039743	117,672,032	43%	18%
SRR13772088	SRX10157772	SRP307767	SAMN18039744	123,905,568	1%	12%

Protein alignments

The alignments of the following proteins with ProSplign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Pogona vitticeps high-quality model RefSeq (XP_)	12,377	12,020 (97.12%)	12,020 (97.12%)	72.84%	83.16%
Protobothrops mucrosquamatus high-quality model RefSeq (XP_)	6,159	6,112 (99.24%)	6,112 (99.24%)	76.89%	88.97%
Anolis carolinensis high-quality model RefSeq (XP_)	16,461	15,647 (95.05%)	15,647 (95.05%)	71.11%	81.44%
Xenopus known RefSeq (NP_)	19,243	18,270 (94.94%)	18,270 (94.94%)	69.11%	78.98%
Sauropsida GenBank	31,451	27,937 (88.83%)	27,937 (88.83%)	69.52%	77.72%
Sauropsida known RefSeq (NP_)	10,239	9,745 (95.18%)	9,745 (95.18%)	71.87%	81.46%
Homo sapiens GenBank	151,087	126,918 (84.00%)	126,918 (84.00%)	62.94%	72.72%
Homo sapiens known RefSeq (NP_)	67,643	59,530 (88.01%)	59,530 (88.01%)	69.85%	76.89%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences