NCBI Pantherophis guttatus Annotation Release 100

The RefSeq genome records for Pantherophis guttatus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Pantherophis guttatus Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: May 13 2020
Date of submission of annotation to the public databases: May 15 2020
Software version: 8.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
UNIGE_PanGut_3.0	GCF_001185365.1	University of Geneva	04-13-2020	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	UNIGE_PanGut_3.0
Genes and pseudogenes	25,732
protein-coding	19,907
non-coding	5,246
transcribed pseudogenes	0
non-transcribed pseudogenes	432
genes with variants	9,264
immunoglobulin/T-cell receptor gene segments	147
other	0
mRNAs	41,701
fully-supported	39,165
with > 5% ab initio	938
partial	1,262
with filled gap(s)	6
known RefSeq (NM_)	0
model RefSeq (XM_)	41,701
non-coding RNAs	7,530
fully-supported	6,956
with > 5% ab initio	0
partial	2
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	7,294
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	41,848
fully-supported	39,165
with > 5% ab initio	1,121
partial	1,294
with major correction(s)	377
known RefSeq (NP_)	0
model RefSeq (XP_)	41,701

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	25,153	34,003	13,536	61	2,119,615
All transcripts	49,231	3,095	2,440	61	103,315
mRNA	41,701	3,458	2,759	126	103,315
misc_RNA	1,001	3,074	2,426	101	14,116
tRNA	236	74	73	71	84
lncRNA	5,955	849	581	96	8,401
snoRNA	184	111	97	62	319
snRNA	127	121	107	61	200
guide_RNA	15	192	141	88	345
rRNA	12	327	119	119	2,515
Single-exon transcripts	1,216	1,536	1,035	126	11,221
coding transcripts (NM_/XM_ )	1,216	1,536	1,035	126	11,221
CDSs	41,701	2,021	1,464	96	102,117
Exons	236,055	293	137	1	19,510
in coding transcripts (NM_/XM_ )	219,730	292	137	1	19,510
in non-coding transcripts (NR_/XR_ )	23,258	270	138	2	10,940
Introns	209,626	4,856	1,527	30	1,124,168
in coding transcripts (NM_/XM_ )	198,689	4,776	1,505	30	1,124,168
in non-coding transcripts (NR_/XR_ )	17,657	5,837	1,842	30	581,949

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.97	1	1	50
Number of exons per transcript	11.39	8	1	318

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 19907 coding genes, 19166 genes had a protein with an alignment covering 50% or more of the query and 11616 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
UNIGE_PanGut_3.0	GCF_001185365.1	6.22%	39.71%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	60	59 (98.33%)	57 (95.00%)	99.49%	99.03%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	1,395,855,292	70%	23%	279,854
SAMEA1689204	25449103	Pgu1SCG-sc-2012-02-15T14:50:44Z-1361323 (Pantherophis guttatus, SAMEA1689204)	23,806,510	63%	37%	118,001
SAMEA1689215	25449103	Pgu2SCG-sc-2012-02-15T14:50:44Z-1361324 (Pantherophis guttatus, SAMEA1689215)	27,504,812	61%	36%	122,818
SAMEA1689216	25449103	Pgu4SCG-sc-2012-03-22T09:30:40Z-1385640 (Pantherophis guttatus, SAMEA1689216)	26,282,776	79%	27%	121,892
SAMEA1689221	25449103	Pgu1SCG-sc-2012-02-15T14:50:36Z-1361313 (Pantherophis guttatus, SAMEA1689221)	25,681,642	78%	27%	146,087
SAMEA1689229	25449103	Pgu1SK-sc-2012-02-15T14:50:25Z-1361303 (Pantherophis guttatus, SAMEA1689229)	15,724,742	85%	30%	106,868
SAMN08449336	NA	Adult, Eye (Pantherophis guttatus, SAMN08449336)	33,020,266	74%	35%	153,707
SAMN08742068	NA	2.5 dpo, posterior trunk (post-cloacal tissue) (Pantherophis guttatus, SAMN08742068)	34,707,234	85%	28%	158,976
SAMN08742069	NA	2.5 dpo, posterior trunk (post-cloacal tissue) (Pantherophis guttatus, SAMN08742069)	36,861,453	85%	28%	157,512
SAMN08742070	NA	3.5 dpo, posterior trunk (post-cloacal tissue) (Pantherophis guttatus, SAMN08742070)	39,219,549	84%	27%	161,105
SAMN08742071	NA	3.5 dpo, posterior trunk (post-cloacal tissue) (Pantherophis guttatus, SAMN08742071)	38,296,466	84%	27%	159,353
SAMN08742072	NA	4.5 dpo, posterior trunk (post-cloacal tissue) (Pantherophis guttatus, SAMN08742072)	40,688,659	82%	27%	163,739
SAMN08742073	NA	4.5 dpo, posterior trunk (post-cloacal tissue) (Pantherophis guttatus, SAMN08742073)	41,633,003	83%	27%	166,284
SAMN08742074	NA	5.5 dpo, posterior trunk (post-cloacal tissue) (Pantherophis guttatus, SAMN08742074)	39,270,723	83%	28%	164,460
SAMN08742075	NA	5.5 dpo, posterior trunk (post-cloacal tissue) (Pantherophis guttatus, SAMN08742075)	42,065,528	83%	28%	164,419
SAMN12141872	NA	brain (Pantherophis guttatus, adult, male, SAMN12141872)	43,609,769	76%	17%	173,915
SAMN12141873	NA	brain (Pantherophis guttatus, adult, female, SAMN12141873)	34,281,246	76%	18%	168,144
SAMN12141874	NA	testes (Pantherophis guttatus, newborn, male, SAMN12141874)	28,045,260	78%	21%	174,710
SAMN12141875	NA	ovary (Pantherophis guttatus, newborn, female, SAMN12141875)	31,748,499	78%	21%	161,667
SAMN12141876	NA	cerebellum (Pantherophis guttatus, adult, female, SAMN12141876)	22,608,408	78%	19%	164,942
SAMN12141877	NA	kidney (Pantherophis guttatus, adult, female, SAMN12141877)	23,396,418	77%	27%	146,355
SAMN12141878	NA	liver (Pantherophis guttatus, adult, female, SAMN12141878)	51,573,530	77%	31%	144,181
SAMN12141879	NA	ovary (Pantherophis guttatus, adult, female, SAMN12141879)	28,480,444	79%	24%	158,432
SAMN12141880	NA	testes (Pantherophis guttatus, adult, male, SAMN12141880)	31,078,841	76%	24%	188,421
SAMN12141881	NA	cerebellum (Pantherophis guttatus, adult, male, SAMN12141881)	28,084,557	79%	22%	163,612
SAMN12141882	NA	liver (Pantherophis guttatus, adult, male, SAMN12141882)	25,407,734	80%	34%	133,462
SAMN12141883	NA	heart (Pantherophis guttatus, adult, female, SAMN12141883)	62,410,596	67%	24%	163,848
SAMN12141884	NA	ventral skin (Pantherophis guttatus, adult, male, SAMN12141884)	43,827,058	79%	29%	139,323
SAMN12141885	NA	dorsal skin (Pantherophis guttatus, adult, male, SAMN12141885)	40,346,524	80%	27%	134,874
SAMN12141886	NA	brain, testes, kidney (Pantherophis guttatus, adult, male, SAMN12141886)	145,009,520	51%	16%	151,820
SAMN12141887	NA	embryonic tissues (Pantherophis guttatus, not determined, SAMN12141887)	129,750,834	48%	13%	92,882
SAMN12141888	NA	Vomeronasal Organ (Pantherophis guttatus, adult, male, SAMN12141888)	32,910,339	54%	19%	149,880
SAMN12141889	NA	Vomeronasal Organ (Pantherophis guttatus, adult, female, SAMN12141889)	33,472,734	56%	13%	76,001
SAMN13153857	31804475	Cerebellum (Pantherophis guttatus, Adult, SAMN13153857)	95,049,618	57%	7%	154,074

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
ERR216307	ERX190972	ERP001222	SAMEA1689204	23,806,510	63%	37%
ERR216308	ERX190973	ERP001222	SAMEA1689215	27,504,812	61%	36%
ERR216323	ERX190988	ERP001222	SAMEA1689216	26,282,776	79%	27%
ERR216317	ERX190982	ERP001222	SAMEA1689221	25,681,642	78%	27%
ERR216298	ERX190963	ERP001222	SAMEA1689229	15,724,742	85%	30%
SRR6668168	SRX3644877	SRP132105	SAMN08449336	33,020,266	74%	35%
SRR6868475	SRX3822954	SRP136054	SAMN08742068	34,707,234	85%	28%
SRR6868482	SRX3822947	SRP136054	SAMN08742069	36,861,453	85%	28%
SRR6868479	SRX3822950	SRP136054	SAMN08742070	39,219,549	84%	27%
SRR6868480	SRX3822949	SRP136054	SAMN08742071	38,296,466	84%	27%
SRR6868485	SRX3822944	SRP136054	SAMN08742072	40,688,659	82%	27%
SRR6868486	SRX3822943	SRP136054	SAMN08742073	41,633,003	83%	27%
SRR6868483	SRX3822946	SRP136054	SAMN08742074	39,270,723	83%	28%
SRR6868484	SRX3822945	SRP136054	SAMN08742075	42,065,528	83%	28%
SRR9596701	SRX6362424	SRP211966	SAMN12141872	43,609,769	76%	17%
SRR9596700	SRX6362425	SRP211966	SAMN12141873	34,281,246	76%	18%
SRR9596699	SRX6362426	SRP211966	SAMN12141874	28,045,260	78%	21%
SRR9596698	SRX6362427	SRP211966	SAMN12141875	31,748,499	78%	21%
SRR9596717	SRX6362408	SRP211966	SAMN12141876	22,608,408	78%	19%
SRR9596716	SRX6362409	SRP211966	SAMN12141877	23,396,418	77%	27%
SRR9596715	SRX6362410	SRP211966	SAMN12141878	51,573,530	77%	31%
SRR9596714	SRX6362411	SRP211966	SAMN12141879	28,480,444	79%	24%
SRR9596713	SRX6362412	SRP211966	SAMN12141880	31,078,841	76%	24%
SRR9596712	SRX6362413	SRP211966	SAMN12141881	28,084,557	79%	22%
SRR9596707	SRX6362418	SRP211966	SAMN12141882	25,407,734	80%	34%
SRR9596706	SRX6362419	SRP211966	SAMN12141883	62,410,596	67%	24%
SRR9596709	SRX6362416	SRP211966	SAMN12141884	43,827,058	79%	29%
SRR9596708	SRX6362417	SRP211966	SAMN12141885	40,346,524	80%	27%
SRR9596703	SRX6362422	SRP211966	SAMN12141886	76,050,060	52%	16%
SRR9596702	SRX6362423	SRP211966	SAMN12141886	68,959,460	51%	16%
SRR9596705	SRX6362420	SRP211966	SAMN12141887	67,853,382	48%	13%
SRR9596704	SRX6362421	SRP211966	SAMN12141887	61,897,452	47%	13%
SRR9596711	SRX6362414	SRP211966	SAMN12141888	32,910,339	54%	19%
SRR9596710	SRX6362415	SRP211966	SAMN12141889	33,472,734	56%	13%
SRR10360869	SRX7070003	SRP227378	SAMN13153857	95,049,618	57%	7%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Pogona vitticeps high-quality model RefSeq (XP_)	12,377	11,836 (95.63%)	11,836 (95.63%)	71.97%	80.51%
Protobothrops mucrosquamatus high-quality model RefSeq (XP_)	6,159	5,923 (96.17%)	5,923 (96.17%)	77.96%	87.77%
Anolis carolinensis high-quality model RefSeq (XP_)	11,381	10,740 (94.37%)	10,740 (94.37%)	70.18%	79.48%
Xenopus GenBank	31,816	8,634 (27.14%)	8,634 (27.14%)	67.65%	72.73%
Xenopus known RefSeq (NP_)	19,656	18,367 (93.44%)	18,367 (93.44%)	68.74%	76.30%
Sauropsida GenBank	29,322	16,772 (57.20%)	16,772 (57.20%)	68.33%	73.50%
Sauropsida known RefSeq (NP_)	8,133	7,450 (91.60%)	7,450 (91.60%)	70.89%	78.74%
Same-species GenBank	60	58 (96.67%)	58 (96.67%)	74.08%	80.92%
Homo sapiens GenBank	144,552	72,305 (50.02%)	72,305 (50.02%)	63.09%	70.88%
Homo sapiens known RefSeq (NP_)	57,234	39,101 (68.32%)	39,101 (68.32%)	68.91%	73.93%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences