NCBI Solanum dulcamara Annotation Release GCF_947179165.1-RS_2023_05

The genome sequence records for Solanum dulcamara RefSeq assembly GCF_947179165.1 (daSolDulc1.2) were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as "GCF_947179165.1-RS_2023_05".

Date of Entrez queries for transcripts and proteins: May 5 2023
Date of submission of annotation to the public databases: May 8 2023
Software version: 10.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
daSolDulc1.2	GCF_947179165.1	WELLCOME SANGER INSTITUTE	12-02-2022	Reference	12 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	daSolDulc1.2
Genes and pseudogenes	35,381
protein-coding	25,374
non-coding	6,215
Transcribed pseudogenes	93
Non-transcribed pseudogenes	3,699
genes with variants	6,681
Immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	36,107
fully-supported	32,890
with > 5% ab initio	2,671
partial	122
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	36,107
non-coding RNAs	9,020
fully-supported	4,986
with > 5% ab initio	0
partial	2
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	8,446
pseudo transcripts	93
fully-supported	84
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	93
CDSs	36,107
fully-supported	32,890
with > 5% ab initio	2,723
partial	122
with major correction(s)	137
known RefSeq (NP_)	0
model RefSeq (XP_)	36,107

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	31,589	4,929	3,147	61	138,611
All transcripts	45,127	1,803	1,566	61	27,832
mRNA	36,107	1,909	1,651	181	16,863
misc_RNA	1,828	2,370	2,018	167	10,423
tRNA	574	74	73	71	88
lncRNA	3,162	1,603	801	96	27,832
snoRNA	362	106	103	61	221
snRNA	253	136	118	98	200
rRNA	2,841	1,030	156	117	3,424
Single-exon transcripts	4,000	1,271	1,086	198	5,871
coding transcripts (NM_/XM_ )	4,000	1,271	1,086	198	5,871
CDSs	36,107	1,410	1,173	114	16,386
Exons	168,176	317	165	3	16,100
in coding transcripts (NM_/XM_ )	157,083	312	162	3	8,719
in non-coding transcripts (NR_/XR_ )	16,262	339	159	10	16,100
Introns	136,404	921	314	30	137,420
in coding transcripts (NM_/XM_ )	128,629	872	299	30	137,420
in non-coding transcripts (NR_/XR_ )	12,663	1,428	516	30	60,345

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.44	1	1	33
Number of exons per transcript	6.04	4	1	79

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the solanales_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 25374 coding genes, 22808 genes had a protein with an alignment covering 50% or more of the query and 10851 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
daSolDulc1.2	GCF_947179165.1	42.62%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez Nucleotide, Entrez Protein, and SRA, and aligned to the genome.

Transcript alignments

The alignments of the following transcripts with Splign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	13	13 (100.00%)	12 (92.31%)	99.42%	96.32%
Solanum lycopersicum known RefSeq (NM_/NR_)	2,957	2,785 (94.18%)	1,599 (54.08%)	91.07%	94.33%
Solanum lycopersicum Genbank	17,990	16,648 (92.54%)	9,297 (51.68%)	90.99%	93.46%
Solanum lycopersicum EST	300,934	194,803 (64.73%)	160,565 (53.36%)	91.94%	97.96%
Solanum known RefSeq (NM_/NR_)	56	45 (80.36%)	35 (62.50%)	91.82%	98.06%
Solanum Genbank	1,587	1,181 (74.42%)	652 (41.08%)	92.56%	97.29%
Solanum EST	177,787	102,951 (57.91%)	81,335 (45.75%)	91.24%	97.36%
Solanum tuberosum known RefSeq (NM_/NR_)	1,122	989 (88.15%)	737 (65.69%)	93.04%	97.15%
Solanum tuberosum Genbank	3,042	2,496 (82.05%)	1,596 (52.47%)	92.97%	97.14%
Solanum tuberosum EST	250,110	153,968 (61.56%)	127,355 (50.92%)	92.24%	97.49%

RNA-Seq alignments

The alignments of the following RNA-Seq reads with STAR were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	2,113,738,628	85%	20%	162,393
SAMEA104170671	NA	shoot (Solanum dulcamara, SAMEA104170671)	26,319,096	90%	18%	111,198
SAMEA2782814	NA	detached leaves (Solanum dulcamara, SAMEA2782814)	53,506,560	91%	21%	121,151
SAMEA7522079	NA	LEAF (Solanum dulcamara, hermaphrodite and monoecious, SAMEA7522079)	38,879,132	79%	32%	119,489
SAMN01990740	NA	leave (Solanum dulcamara, SAMN01990740)	9,944,477	86%	21%	73,982
SAMN01990741	NA	leave (Solanum dulcamara, SAMN01990741)	8,375,559	63%	20%	65,936
SAMN01990742	NA	leave (Solanum dulcamara, SAMN01990742)	12,588,908	88%	18%	86,187
SAMN01990743	NA	leave (Solanum dulcamara, SAMN01990743)	17,023,315	69%	17%	82,119
SAMN01990744	NA	leave (Solanum dulcamara, SAMN01990744)	41,251,639	88%	24%	127,865
SAMN01990745	NA	leave (Solanum dulcamara, SAMN01990745)	38,244,888	88%	24%	127,160
SAMN01990746	NA	adventitious root primordia (Solanum dulcamara, SAMN01990746)	42,400,641	89%	9%	114,896
SAMN01990747	NA	adventitious root primordia (Solanum dulcamara, SAMN01990747)	45,733,130	89%	9%	116,796
SAMN01990748	NA	adventitious root primordia (Solanum dulcamara, SAMN01990748)	56,536,205	92%	9%	119,509
SAMN01990749	NA	adventitious root primordia (Solanum dulcamara, SAMN01990749)	37,684,429	92%	10%	113,144
SAMN01990750	NA	adventitious root primordia (Solanum dulcamara, SAMN01990750)	40,316,070	91%	10%	113,349
SAMN01990751	NA	adventitious root primordia (Solanum dulcamara, SAMN01990751)	43,327,773	90%	10%	112,389
SAMN01990752	NA	stem (Solanum dulcamara, SAMN01990752)	49,301,453	90%	9%	114,015
SAMN01990753	NA	stem (Solanum dulcamara, SAMN01990753)	45,914,734	92%	9%	114,304
SAMN01990754	NA	stem (Solanum dulcamara, SAMN01990754)	25,843,700	93%	9%	107,526
SAMN01990755	NA	stem (Solanum dulcamara, SAMN01990755)	47,390,168	74%	10%	105,108
SAMN01990756	NA	stem (Solanum dulcamara, SAMN01990756)	52,560,116	91%	9%	113,754
SAMN01990757	NA	stem (Solanum dulcamara, SAMN01990757)	29,284,322	92%	9%	105,615
SAMN03764702	26759219	leaf (Solanum dulcamara, SAMN03764702)	18,140,184	80%	25%	114,955
SAMN03764703	26759219	leaf (Solanum dulcamara, SAMN03764703)	12,817,842	80%	25%	107,060
SAMN03764704	26759219	leaf (Solanum dulcamara, SAMN03764704)	26,980,305	78%	25%	118,256
SAMN03764705	26759219	leaf (Solanum dulcamara, SAMN03764705)	25,606,009	80%	24%	117,244
SAMN03764706	26759219	leaf (Solanum dulcamara, SAMN03764706)	17,894,569	82%	25%	113,973
SAMN03764707	26759219	leaf (Solanum dulcamara, SAMN03764707)	20,682,798	78%	25%	113,961
SAMN03764708	26759219	leaf (Solanum dulcamara, SAMN03764708)	20,276,096	79%	25%	114,310
SAMN03764727	26759219	leaf (Solanum dulcamara, SAMN03764727)	18,359,821	79%	24%	113,492
SAMN03764728	26759219	leaf (Solanum dulcamara, SAMN03764728)	16,213,122	72%	24%	110,226
SAMN03764729	26759219	leaf (Solanum dulcamara, SAMN03764729)	20,396,077	69%	24%	112,140
SAMN03764730	26759219	leaf (Solanum dulcamara, SAMN03764730)	14,229,264	82%	23%	108,251
SAMN03764731	26759219	leaf (Solanum dulcamara, SAMN03764731)	19,034,110	70%	24%	111,239
SAMN03764732	26759219	leaf (Solanum dulcamara, SAMN03764732)	17,604,864	77%	24%	109,282
SAMN03764733	26759219	leaf (Solanum dulcamara, SAMN03764733)	11,420,777	80%	24%	106,824
SAMN03764734	26759219	leaf (Solanum dulcamara, SAMN03764734)	16,610,933	72%	24%	110,440
SAMN03764735	26759219	leaf (Solanum dulcamara, SAMN03764735)	24,625,542	78%	24%	117,346
SAMN03764736	26759219	leaf (Solanum dulcamara, SAMN03764736)	22,248,904	79%	24%	116,940
SAMN03764737	26759219	leaf (Solanum dulcamara, SAMN03764737)	22,020,787	76%	25%	116,962
SAMN28540075	NA	Young leaves, mature leaves, young fruits, mature fruits, flowers, flower buds, roots, sepals, leaf buds and stems (Solanum dulcamara, Mixed age, SAMN28540075)	1,004,782,672	86%	23%	151,292

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
ERR634926	ERX591478	ERP007073	SAMEA2782814	53,506,560	91%	21%
ERR2040627	ERX2099684	ERP023948	SAMEA104170671	26,319,096	90%	18%
ERR6688685	ERX6313454	ERP131576	SAMEA7522079	38,879,132	79%	32%
SRR822337	SRX256697	SRP020226	SAMN01990740	9,944,477	86%	21%
SRR808085	SRX256712	SRP020226	SAMN01990741	8,375,559	63%	20%
SRR808064	SRX256713	SRP020226	SAMN01990742	12,588,908	88%	18%
SRR808043	SRX256714	SRP020226	SAMN01990743	17,023,315	69%	17%
SRR799447	SRX256717	SRP020226	SAMN01990744	41,251,639	88%	24%
SRR799443	SRX256720	SRP020226	SAMN01990745	38,244,888	88%	24%
SRR799323	SRX256721	SRP020226	SAMN01990746	42,400,641	89%	9%
SRR799322	SRX256722	SRP020226	SAMN01990747	45,733,130	89%	9%
SRR799321	SRX256723	SRP020226	SAMN01990748	56,536,205	92%	9%
SRR799320	SRX256724	SRP020226	SAMN01990749	37,684,429	92%	10%
SRR799319	SRX256725	SRP020226	SAMN01990750	40,316,070	91%	10%
SRR799318	SRX256730	SRP020226	SAMN01990751	43,327,773	90%	10%
SRR799317	SRX256733	SRP020226	SAMN01990752	49,301,453	90%	9%
SRR799316	SRX256734	SRP020226	SAMN01990753	45,914,734	92%	9%
SRR799315	SRX256735	SRP020226	SAMN01990754	25,843,700	93%	9%
SRR799314	SRX256736	SRP020226	SAMN01990755	47,390,168	74%	10%
SRR799311	SRX256737	SRP020226	SAMN01990756	52,560,116	91%	9%
SRR799310	SRX256738	SRP020226	SAMN01990757	29,284,322	92%	9%
SRR2056036	SRX1053391	SRP059232	SAMN03764702	18,140,184	80%	25%
SRR2056037	SRX1053392	SRP059232	SAMN03764703	12,817,842	80%	25%
SRR2056038	SRX1053393	SRP059232	SAMN03764704	26,980,305	78%	25%
SRR2056039	SRX1053394	SRP059232	SAMN03764705	25,606,009	80%	24%
SRR2056040	SRX1053395	SRP059232	SAMN03764706	17,894,569	82%	25%
SRR2056041	SRX1053396	SRP059232	SAMN03764707	20,682,798	78%	25%
SRR2056042	SRX1053397	SRP059232	SAMN03764708	20,276,096	79%	25%
SRR2056025	SRX1053380	SRP059232	SAMN03764727	18,359,821	79%	24%
SRR2056026	SRX1053381	SRP059232	SAMN03764728	16,213,122	72%	24%
SRR2056027	SRX1053382	SRP059232	SAMN03764729	20,396,077	69%	24%
SRR2056028	SRX1053383	SRP059232	SAMN03764730	14,229,264	82%	23%
SRR2056029	SRX1053384	SRP059232	SAMN03764731	19,034,110	70%	24%
SRR2056030	SRX1053385	SRP059232	SAMN03764732	17,604,864	77%	24%
SRR2056031	SRX1053386	SRP059232	SAMN03764733	11,420,777	80%	24%
SRR2056032	SRX1053387	SRP059232	SAMN03764734	16,610,933	72%	24%
SRR2056033	SRX1053388	SRP059232	SAMN03764735	24,625,542	78%	24%
SRR2056034	SRX1053389	SRP059232	SAMN03764736	22,248,904	79%	24%
SRR2056035	SRX1053390	SRP059232	SAMN03764737	22,020,787	76%	25%
SRR20722277	SRX16742740	SRP388917	SAMN28540075	160,644,492	81%	21%
SRR20722276	SRX16742741	SRP388917	SAMN28540075	136,840,572	91%	24%
SRR20722275	SRX16742742	SRP388917	SAMN28540075	342,005,750	83%	24%
SRR20722274	SRX16742743	SRP388917	SAMN28540075	124,092,678	91%	20%
SRR20722273	SRX16742744	SRP388917	SAMN28540075	118,319,992	91%	24%
SRR20722272	SRX16742745	SRP388917	SAMN28540075	122,879,188	89%	24%

Protein alignments

The alignments of the following proteins with ProSplign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Arabidopsis thaliana known RefSeq (NP_)	48,147	41,659 (86.52%)	41,659 (86.52%)	67.09%	72.49%
Solanaceae GenBank	7,363	7,020 (95.34%)	7,020 (95.34%)	74.72%	86.37%
Solanaceae known RefSeq (NP_)	5,564	5,468 (98.27%)	5,468 (98.27%)	76.49%	87.66%
Solanum lycopersicum high-quality model RefSeq (XP_)	19,675	19,032 (96.73%)	19,032 (96.73%)	75.43%	86.36%
Same-species GenBank	10	10 (100.00%)	10 (100.00%)	79.00%	91.08%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences