NCBI Trifolium pratense Annotation Release 100

The RefSeq genome records for Trifolium pratense were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Trifolium pratense Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Jan 12 2022
Date of submission of annotation to the public databases: Jan 15 2022
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ARS_RC_1.1	GCF_020283565.1	USDA ARS	10-13-2021	Reference	9 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ARS_RC_1.1
Genes and pseudogenes	43,682
protein-coding	33,610
non-coding	7,722
Transcribed pseudogenes	3
Non-transcribed pseudogenes	2,347
genes with variants	8,817
Immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	47,824
fully-supported	42,171
with > 5% ab initio	4,484
partial	188
with filled gap(s)	5
known RefSeq (NM_)	0
model RefSeq (XM_)	47,824
non-coding RNAs	16,255
fully-supported	12,154
with > 5% ab initio	0
partial	4
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	15,411
pseudo transcripts	3
fully-supported	3
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	3
CDSs	47,931
fully-supported	42,171
with > 5% ab initio	4,610
partial	188
with major correction(s)	145
known RefSeq (NP_)	0
model RefSeq (XP_)	47,931

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	41,332	3,810	2,882	58	232,355
All transcripts	64,079	1,946	1,666	58	28,170
mRNA	47,824	2,003	1,719	171	17,138
misc_RNA	4,089	2,420	2,038	190	13,619
tRNA	837	74	73	58	89
lncRNA	8,067	1,995	1,406	98	28,170
snoRNA	549	106	106	59	232
snRNA	311	150	151	98	201
rRNA	2,402	1,144	156	105	3,442
Single-exon transcripts	5,710	1,368	1,135	171	8,917
coding transcripts (NM_/XM_ )	5,710	1,368	1,135	171	8,917
CDSs	47,931	1,446	1,185	93	16,638
Exons	229,399	348	177	1	18,090
in coding transcripts (NM_/XM_ )	203,770	342	173	1	9,225
in non-coding transcripts (NR_/XR_ )	32,142	359	179	2	18,090
Introns	178,659	573	222	30	94,842
in coding transcripts (NM_/XM_ )	161,528	560	217	30	94,842
in non-coding transcripts (NR_/XR_ )	23,571	641	253	30	46,558

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.57	1	1	50
Number of exons per transcript	6.34	5	1	78

BUSCO analysis of gene annotation

BUSCO v4.1.4 (Simão et al 2015, PMID: 26059717) was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the fabales_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation (C:complete [S:single-copy, D:duplicated], F:fragmented, M:missing, n:number of genes used).

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 33503 coding genes, 28964 genes had a protein with an alignment covering 50% or more of the query and 12800 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
ARS_RC_1.1	GCF_020283565.1	35.71%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	191	182 (95.29%)	170 (89.01%)	99.18%	97.82%
Same-species EST	38,109	32,298 (84.75%)	30,548 (80.16%)	99.14%	99.16%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	2,332,507,265	83%	29%	221,710
SAMEA2241758	NA	trifolium transcriptome (Trifolium pratense, SAMEA2241758)	75,161,408	63%	25%	152,178
SAMN02356536	24912738	Leaf tissue (Trifolium pratense, SAMN02356536)	45,081,744	83%	24%	132,937
SAMN02356537	24912738	Leaf tissue (Trifolium pratense, SAMN02356537)	53,464,948	89%	24%	147,114
SAMN02356538	24912738	Leaf tissue (Trifolium pratense, SAMN02356538)	47,085,342	89%	25%	156,007
SAMN02356539	24912738	Leaf tissue (Trifolium pratense, SAMN02356539)	54,738,278	88%	25%	156,727
SAMN04244347	NA	Leaf (Trifolium pratense, Two months, SAMN04244347)	7,535,743	60%	34%	94,182
SAMN04244349	NA	Two weeks after seven to eight leaf stage for root and at flowering stage for the flower tissue, Root and flower (Trifolium pratense, Two months, SAMN04244349)	2,009,490	57%	34%	79,441
SAMN04244350	NA	Root (Trifolium pratense, Two months, SAMN04244350)	3,175,828	60%	35%	93,368
SAMN06477793	NA	leaf (Trifolium pratense, SAMN06477793)	51,470,150	84%	37%	159,823
SAMN06477794	NA	leaf (Trifolium pratense, SAMN06477794)	52,074,244	86%	34%	155,652
SAMN06477795	NA	leaf (Trifolium pratense, SAMN06477795)	51,390,604	87%	35%	161,706
SAMN06477796	NA	leaf (Trifolium pratense, SAMN06477796)	52,909,440	84%	37%	152,249
SAMN06477797	NA	leaf (Trifolium pratense, SAMN06477797)	51,135,388	87%	34%	163,736
SAMN06477798	NA	leaf (Trifolium pratense, SAMN06477798)	55,063,838	87%	34%	165,086
SAMN07969538	NA	Nodules (Trifolium pratense, 30 days post inoculation, SAMN07969538)	157,770,118	89%	30%	182,491
SAMN12612961	NA	shoot and leaf (Trifolium pratense, SAMN12612961)	96,781,244	59%	25%	155,022
SAMN12612962	NA	shoot and leaf (Trifolium pratense, SAMN12612962)	108,962,816	78%	24%	147,759
SAMN12612963	NA	shoot and leaf (Trifolium pratense, SAMN12612963)	93,803,926	83%	25%	153,733
SAMN12612964	NA	shoot and leaf (Trifolium pratense, SAMN12612964)	117,096,928	83%	25%	142,103
SAMN12612965	NA	shoot and leaf (Trifolium pratense, SAMN12612965)	142,358,660	85%	25%	159,462
SAMN12612966	NA	shoot and leaf (Trifolium pratense, SAMN12612966)	116,290,398	84%	24%	153,038
SAMN12612967	NA	shoot and leaf (Trifolium pratense, SAMN12612967)	98,922,720	87%	25%	155,018
SAMN12612968	NA	shoot and leaf (Trifolium pratense, SAMN12612968)	34,786,246	85%	25%	152,834
SAMN12612969	NA	shoot and leaf (Trifolium pratense, 124-130das, SAMN12612969)	114,926,152	82%	24%	166,411
SAMN12612970	NA	shoot and leaf (Trifolium pratense, 124-130das, SAMN12612970)	97,454,512	83%	24%	160,664
SAMN12612971	NA	shoot and leaf (Trifolium pratense, 124-130das, SAMN12612971)	89,359,764	78%	24%	162,882
SAMN12612972	NA	shoot and leaf (Trifolium pratense, 124-130das, SAMN12612972)	105,338,658	81%	24%	163,690
SAMN15856878	NA	Root (Trifolium pratense, 2 mo, SAMN15856878)	41,207,976	87%	38%	152,071
SAMN15856879	NA	Root (Trifolium pratense, 2 mo, SAMN15856879)	47,760,394	88%	37%	156,780
SAMN15856880	NA	Root (Trifolium pratense, 2 mo, SAMN15856880)	52,729,766	89%	36%	152,979
SAMN15856881	NA	Root (Trifolium pratense, 2 mo, SAMN15856881)	45,187,696	87%	35%	149,896
SAMN15856882	NA	Root (Trifolium pratense, 2 mo, SAMN15856882)	42,148,554	88%	37%	155,094
SAMN15856883	NA	Root (Trifolium pratense, 2 mo, SAMN15856883)	42,011,462	88%	37%	155,099
SAMN15856884	NA	Root (Trifolium pratense, 2 mo, SAMN15856884)	45,708,206	87%	38%	161,454
SAMN15856885	NA	Root (Trifolium pratense, 2 mo, SAMN15856885)	39,604,624	87%	42%	154,338

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
ERR351508	ERX324291	ERP004049	SAMEA2241758	75,161,408	63%	25%
SRR987949	SRX351791	SRP029945	SAMN02356536	45,081,744	83%	24%
SRR987951	SRX351917	SRP029945	SAMN02356537	53,464,948	89%	24%
SRR987952	SRX351918	SRP029945	SAMN02356538	47,085,342	89%	25%
SRR987953	SRX351919	SRP029945	SAMN02356539	54,738,278	88%	25%
SRR2899578	SRX1418468	SRP065808	SAMN04244347	3,898,545	55%	34%
SRR2899580	SRX1418475	SRP065808	SAMN04244347	1,254,053	58%	34%
SRR2899588	SRX1418477	SRP065808	SAMN04244347	1,356,687	76%	35%
SRR2899654	SRX1418498	SRP065808	SAMN04244347	1,026,458	62%	35%
SRR2899624	SRX1418479	SRP065808	SAMN04244349	2,009,490	57%	34%
SRR2899628	SRX1418483	SRP065808	SAMN04244350	1,247,193	72%	38%
SRR2899631	SRX1418486	SRP065808	SAMN04244350	1,928,635	52%	33%
SRR5312545	SRX2612371	SRP101379	SAMN06477793	51,470,150	84%	37%
SRR5312544	SRX2612370	SRP101379	SAMN06477794	52,074,244	86%	34%
SRR5312543	SRX2612369	SRP101379	SAMN06477795	51,390,604	87%	35%
SRR5312542	SRX2612368	SRP101379	SAMN06477796	52,909,440	84%	37%
SRR5312541	SRX2612367	SRP101379	SAMN06477797	51,135,388	87%	34%
SRR5312540	SRX2612366	SRP101379	SAMN06477798	55,063,838	87%	34%
SRR6251246	SRX3358188	SRP123540	SAMN07969538	157,770,118	89%	30%
SRR10008577	SRX6746842	SRP218961	SAMN12612961	96,781,244	59%	25%
SRR10008576	SRX6746843	SRP218961	SAMN12612962	108,962,816	78%	24%
SRR10008575	SRX6746844	SRP218961	SAMN12612963	93,803,926	83%	25%
SRR10008574	SRX6746845	SRP218961	SAMN12612964	117,096,928	83%	25%
SRR10008573	SRX6746846	SRP218961	SAMN12612965	142,358,660	85%	25%
SRR10008572	SRX6746847	SRP218961	SAMN12612966	116,290,398	84%	24%
SRR10008571	SRX6746848	SRP218961	SAMN12612967	98,922,720	87%	25%
SRR10008570	SRX6746849	SRP218961	SAMN12612968	34,786,246	85%	25%
SRR10008569	SRX6746850	SRP218961	SAMN12612969	114,926,152	82%	24%
SRR10008568	SRX6746851	SRP218961	SAMN12612970	97,454,512	83%	24%
SRR10008567	SRX6746852	SRP218961	SAMN12612971	89,359,764	78%	24%
SRR10008578	SRX6746841	SRP218961	SAMN12612972	105,338,658	81%	24%
SRR12476494	SRX8970456	SRP278075	SAMN15856878	41,207,976	87%	38%
SRR12476493	SRX8970457	SRP278075	SAMN15856879	47,760,394	88%	37%
SRR12476492	SRX8970458	SRP278075	SAMN15856880	52,729,766	89%	36%
SRR12476499	SRX8970451	SRP278075	SAMN15856881	45,187,696	87%	35%
SRR12476498	SRX8970452	SRP278075	SAMN15856882	42,148,554	88%	37%
SRR12476497	SRX8970453	SRP278075	SAMN15856883	42,011,462	88%	37%
SRR12476496	SRX8970454	SRP278075	SAMN15856884	45,708,206	87%	38%
SRR12476495	SRX8970455	SRP278075	SAMN15856885	39,604,624	87%	42%

SRA Long Read Alignment Statistics

The following long read RNA-Seq reads (PacBio, Oxford Nanopore, 454, or other long-read sequencing technologies) from the Sequence Read Archive were also used for gene prediction:

Run	Sample	Number of reads	Number (%) of sequences aligned by Minimap2	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
All	NA	8329742	6832684 (82.02%)	6329694 (75.98%)	99.51	96.07
SRR15433788	SAMN20750280	8329742	6832684 (82.02%)	6329694 (75.98%)	99.51	96.07

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Arabidopsis thaliana known RefSeq (NP_)	48,148	41,823 (86.86%)	41,823 (86.86%)	66.96%	72.90%
Fabaceae GenBank	43,244	40,635 (93.97%)	40,635 (93.97%)	73.71%	85.88%
Fabaceae known RefSeq (NP_)	8,350	8,159 (97.71%)	8,159 (97.71%)	72.71%	84.51%
Same-species GenBank	164	161 (98.17%)	161 (98.17%)	80.35%	87.85%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences