NCBI Equus quagga Updated Annotation Release 100.20220407

The RefSeq genome records for Equus quagga were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

Updated Annotation Release 100.20220407 is an update of NCBI Equus quagga Annotation Release 100. The known RefSeq transcripts (with NM_ and NR_ prefixes) that were current on Apr 7 2022 were placed on the genome and used to update the annotated features. In addition, model RefSeq predicted in the last full annotation (Annotation Release 100) that were still current on Apr 7 2022 were included in the updated annotation. These models were not re-calculated for this update. For more information on the evidence used for generating the model RefSeq, please consult the report for NCBI Equus quagga Annotation Release 100.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Equus quagga Updated Annotation Release 100.20220407

Annotation release ID: 100.20220407
Date of Entrez queries for transcripts and proteins: Apr 7 2022
Date of submission of annotation to the public databases: Apr 11 2022
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
UCLA_HA_Equagga_1.0	GCF_021613505.1	University of California Los Angeles	01-31-2022	Reference	23 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	UCLA_HA_Equagga_1.0
Genes and pseudogenes	27,599
protein-coding	21,068
non-coding	4,094
Transcribed pseudogenes	0
Non-transcribed pseudogenes	2,227
genes with variants	9,816
Immunoglobulin/T-cell receptor gene segments	191
other	19
mRNAs	48,850
fully-supported	46,749
with > 5% ab initio	1,012
partial	559
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	48,850
non-coding RNAs	7,486
fully-supported	6,089
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	7,009
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	49,054
fully-supported	46,749
with > 5% ab initio	1,153
partial	576
with major correction(s)	725
known RefSeq (NP_)	0
model RefSeq (XP_)	48,863

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	25,181	42,276	13,205	49	2,273,580
All transcripts	56,336	3,442	2,766	49	105,289
mRNA	48,850	3,657	2,966	204	105,289
misc_RNA	2,131	3,397	2,715	204	16,017
tRNA	475	74	73	60	87
lncRNA	3,958	1,987	1,226	120	15,917
snoRNA	446	112	95	49	730
snRNA	438	113	107	61	199
rRNA	19	684	153	119	4,693
Single-exon transcripts	2,221	1,249	948	270	12,900
coding transcripts (NM_/XM_ )	2,221	1,249	948	270	12,900
CDSs	48,863	2,089	1,518	183	104,052
Exons	252,170	316	137	2	24,665
in coding transcripts (NM_/XM_ )	238,294	304	136	2	24,665
in non-coding transcripts (NR_/XR_ )	26,789	363	139	4	12,334
Introns	226,542	5,691	1,421	30	1,019,475
in coding transcripts (NM_/XM_ )	216,541	5,696	1,417	30	1,019,475
in non-coding transcripts (NR_/XR_ )	22,527	4,924	1,421	31	358,246

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.26	1	1	50
Number of exons per transcript	12.27	9	1	323

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the laurasiatheria_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences