NCBI Phacochoerus africanus Annotation Release 100

The RefSeq genome records for Phacochoerus africanus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Phacochoerus africanus Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Apr 13 2022
Date of submission of annotation to the public databases: Apr 17 2022
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ROS_Pafr_v1	GCF_016906955.1	The Roslin Institute	02-19-2021	Reference	17 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ROS_Pafr_v1
Genes and pseudogenes	27,930
protein-coding	20,199
non-coding	4,638
Transcribed pseudogenes	43
Non-transcribed pseudogenes	2,921
genes with variants	10,162
Immunoglobulin/T-cell receptor gene segments	104
other	25
mRNAs	48,052
fully-supported	46,250
with > 5% ab initio	912
partial	68
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	48,052
non-coding RNAs	7,473
fully-supported	5,526
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	6,993
pseudo transcripts	43
fully-supported	34
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	43
CDSs	48,156
fully-supported	46,250
with > 5% ab initio	1,011
partial	69
with major correction(s)	1,086
known RefSeq (NP_)	0
model RefSeq (XP_)	48,052

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	24,862	43,976	13,458	41	2,144,556
All transcripts	55,525	3,027	2,442	41	107,172
mRNA	48,052	3,274	2,649	129	107,172
misc_RNA	1,611	2,883	2,461	112	16,074
tRNA	480	74	73	70	87
lncRNA	3,927	1,507	1,128	81	15,729
snoRNA	547	110	111	41	329
snRNA	880	113	107	60	198
rRNA	3	119	119	119	119
Single-exon transcripts	2,261	1,247	951	132	14,072
coding transcripts (NM_/XM_ )	2,259	1,248	951	132	14,072
non-coding transcripts (NR_/XR_ )	2	940	1,202	677	1,202
CDSs	48,052	2,020	1,479	96	105,918
Exons	243,570	293	139	1	17,106
in coding transcripts (NM_/XM_ )	230,398	285	138	1	17,106
in non-coding transcripts (NR_/XR_ )	24,155	314	143	3	12,976
Introns	216,901	5,921	1,552	30	902,955
in coding transcripts (NM_/XM_ )	207,562	5,802	1,537	30	902,955
in non-coding transcripts (NR_/XR_ )	20,001	6,490	1,712	40	541,679

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.26	1	1	50
Number of exons per transcript	11.95	9	1	347

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the cetartiodactyla_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 20199 coding genes, 19860 genes had a protein with an alignment covering 50% or more of the query and 17368 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
ROS_Pafr_v1	GCF_016906955.1	35.90%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	21	21 (100.00%)	19 (90.48%)	99.13%	96.39%
Homo sapiens known RefSeq (NM_/NR_)	82,878	69,446 (83.79%)	16,760 (20.22%)	89.35%	81.86%
Homo sapiens Genbank	347,439	155,552 (44.77%)	54,279 (15.62%)	89.93%	89.77%
Artiodactyla known RefSeq (NM_/NR_)	21,940	20,713 (94.41%)	13,327 (60.74%)	93.25%	96.12%
Artiodactyla Genbank	77,167	67,912 (88.01%)	47,159 (61.11%)	96.11%	97.81%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	4,205,461,454	85%	19%	244,443
SAMN07187855	cell culture (Phacochoerus africanus, SAMN07187855)	21,120,432	90%	27%	138,203
SAMN18321886	Caecum (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321886)	239,526,912	92%	19%	175,241
SAMN18321887	Diaphragm (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321887)	255,201,552	93%	20%	167,199
SAMN18321888	Heart atrium (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321888)	247,331,828	92%	17%	187,891
SAMN18321889	Kidney medulla (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321889)	246,038,250	91%	16%	199,727
SAMN18321890	Lymph node mesenteric (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321890)	245,875,806	88%	17%	199,507
SAMN18321891	Ovarian follicles (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321891)	242,493,098	91%	19%	196,589
SAMN18321892	Retina (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321892)	240,086,002	89%	17%	202,645
SAMN18321893	Salivary gland parotid (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321893)	270,091,686	91%	29%	186,675
SAMN18321894	Skel musc long dorsal (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321894)	260,415,656	93%	24%	165,766
SAMN18321895	Skin snout (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321895)	233,281,032	93%	22%	160,138
SAMN18321896	Spleen (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321896)	253,009,350	91%	18%	196,521
SAMN18321897	Stomach fundus mucosa (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321897)	272,262,202	92%	19%	188,016
SAMN18321898	Tonsils (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321898)	267,662,212	90%	19%	198,875
SAMN18321899	Uterus (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321899)	251,322,748	91%	19%	202,839
SAMN18321900	PBMC (Phacochoerus africanus, 8 years 2 month 25 day, female, SAMN18321900)	233,700,030	87%	14%	179,327
SAMN19245999	W10-Heart (Phacochoerus africanus, not determined, SAMN19245999)	139,806,630	30%	15%	156,974
SAMN19246000	W10-Liver (Phacochoerus africanus, not determined, SAMN19246000)	105,337,632	3%	1%	14,202
SAMN19246001	W10-Lung (Phacochoerus africanus, not determined, SAMN19246001)	98,657,338	46%	16%	144,364
SAMN19246002	W10-Spleen (Phacochoerus africanus, not determined, SAMN19246002)	82,241,058	49%	16%	162,677

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR5647657	SRX2880702	SRP108483	SAMN07187855	21,120,432	90%	27%
SRR13984010	SRX10361432	SRP310982	SAMN18321886	48,085,606	93%	19%
SRR13984003	SRX10361439	SRP310982	SAMN18321886	48,279,764	92%	19%
SRR13983986	SRX10361456	SRP310982	SAMN18321886	47,903,438	92%	19%
SRR13983970	SRX10361472	SRP310982	SAMN18321886	45,804,990	92%	19%
SRR13983953	SRX10361489	SRP310982	SAMN18321886	49,453,114	92%	19%
SRR13984009	SRX10361433	SRP310982	SAMN18321887	51,243,060	93%	20%
SRR13984002	SRX10361440	SRP310982	SAMN18321887	51,509,166	93%	20%
SRR13983985	SRX10361457	SRP310982	SAMN18321887	51,124,826	92%	20%
SRR13983969	SRX10361473	SRP310982	SAMN18321887	48,890,768	92%	20%
SRR13983952	SRX10361490	SRP310982	SAMN18321887	52,433,732	93%	20%
SRR13984001	SRX10361441	SRP310982	SAMN18321888	49,927,546	92%	17%
SRR13983998	SRX10361444	SRP310982	SAMN18321888	49,782,222	92%	17%
SRR13983984	SRX10361458	SRP310982	SAMN18321888	49,658,016	92%	17%
SRR13983968	SRX10361474	SRP310982	SAMN18321888	47,157,238	91%	17%
SRR13983951	SRX10361491	SRP310982	SAMN18321888	50,806,806	92%	17%
SRR13984000	SRX10361442	SRP310982	SAMN18321889	49,928,266	91%	16%
SRR13983987	SRX10361455	SRP310982	SAMN18321889	49,686,718	91%	16%
SRR13983983	SRX10361459	SRP310982	SAMN18321889	49,674,598	91%	16%
SRR13983967	SRX10361475	SRP310982	SAMN18321889	46,978,820	90%	16%
SRR13983950	SRX10361492	SRP310982	SAMN18321889	49,769,848	91%	16%
SRR13983999	SRX10361443	SRP310982	SAMN18321890	50,158,718	88%	17%
SRR13983982	SRX10361460	SRP310982	SAMN18321890	49,872,352	88%	17%
SRR13983976	SRX10361466	SRP310982	SAMN18321890	49,853,678	88%	17%
SRR13983966	SRX10361476	SRP310982	SAMN18321890	46,785,664	88%	17%
SRR13983949	SRX10361493	SRP310982	SAMN18321890	49,205,394	88%	17%
SRR13983997	SRX10361445	SRP310982	SAMN18321891	48,718,338	91%	19%
SRR13983981	SRX10361461	SRP310982	SAMN18321891	48,411,588	91%	19%
SRR13983965	SRX10361477	SRP310982	SAMN18321891	48,510,810	91%	19%
SRR13983964	SRX10361478	SRP310982	SAMN18321891	46,652,490	91%	19%
SRR13983948	SRX10361494	SRP310982	SAMN18321891	50,199,872	91%	19%
SRR13983996	SRX10361446	SRP310982	SAMN18321892	48,661,826	89%	17%
SRR13983980	SRX10361462	SRP310982	SAMN18321892	48,336,204	89%	17%
SRR13983963	SRX10361479	SRP310982	SAMN18321892	45,795,126	89%	17%
SRR13983954	SRX10361488	SRP310982	SAMN18321892	48,433,402	89%	17%
SRR13983947	SRX10361495	SRP310982	SAMN18321892	48,859,444	89%	17%
SRR13983995	SRX10361447	SRP310982	SAMN18321893	54,281,494	91%	29%
SRR13983979	SRX10361463	SRP310982	SAMN18321893	53,721,212	91%	29%
SRR13983962	SRX10361480	SRP310982	SAMN18321893	51,900,050	90%	29%
SRR13983946	SRX10361496	SRP310982	SAMN18321893	56,294,616	91%	29%
SRR13983943	SRX10361499	SRP310982	SAMN18321893	53,894,314	91%	29%
SRR13983994	SRX10361448	SRP310982	SAMN18321894	52,459,310	93%	24%
SRR13983978	SRX10361464	SRP310982	SAMN18321894	52,173,302	93%	24%
SRR13983961	SRX10361481	SRP310982	SAMN18321894	51,121,524	92%	24%
SRR13983945	SRX10361497	SRP310982	SAMN18321894	52,116,066	93%	24%
SRR13983937	SRX10361505	SRP310982	SAMN18321894	52,545,454	93%	24%
SRR13983993	SRX10361449	SRP310982	SAMN18321895	47,127,472	93%	22%
SRR13983977	SRX10361465	SRP310982	SAMN18321895	46,721,240	92%	22%
SRR13983960	SRX10361482	SRP310982	SAMN18321895	45,761,808	92%	22%
SRR13983944	SRX10361498	SRP310982	SAMN18321895	46,717,826	93%	21%
SRR13983936	SRX10361506	SRP310982	SAMN18321895	46,952,686	93%	22%
SRR13984008	SRX10361434	SRP310982	SAMN18321896	50,379,684	92%	18%
SRR13983992	SRX10361450	SRP310982	SAMN18321896	50,251,554	91%	18%
SRR13983975	SRX10361467	SRP310982	SAMN18321896	49,939,584	91%	18%
SRR13983959	SRX10361483	SRP310982	SAMN18321896	49,689,122	91%	18%
SRR13983942	SRX10361500	SRP310982	SAMN18321896	52,749,406	91%	18%
SRR13984007	SRX10361435	SRP310982	SAMN18321897	54,076,536	92%	19%
SRR13983991	SRX10361451	SRP310982	SAMN18321897	54,143,724	92%	19%
SRR13983974	SRX10361468	SRP310982	SAMN18321897	53,687,612	91%	19%
SRR13983958	SRX10361484	SRP310982	SAMN18321897	53,796,152	91%	19%
SRR13983941	SRX10361501	SRP310982	SAMN18321897	56,558,178	92%	19%
SRR13984006	SRX10361436	SRP310982	SAMN18321898	53,294,896	90%	19%
SRR13983990	SRX10361452	SRP310982	SAMN18321898	53,332,148	90%	19%
SRR13983973	SRX10361469	SRP310982	SAMN18321898	52,917,532	89%	19%
SRR13983957	SRX10361485	SRP310982	SAMN18321898	52,831,046	89%	19%
SRR13983940	SRX10361502	SRP310982	SAMN18321898	55,286,590	90%	18%
SRR13984005	SRX10361437	SRP310982	SAMN18321899	50,529,588	91%	19%
SRR13983989	SRX10361453	SRP310982	SAMN18321899	50,508,010	91%	19%
SRR13983972	SRX10361470	SRP310982	SAMN18321899	50,301,024	91%	19%
SRR13983956	SRX10361486	SRP310982	SAMN18321899	49,606,598	90%	19%
SRR13983939	SRX10361503	SRP310982	SAMN18321899	50,377,528	91%	19%
SRR13984004	SRX10361438	SRP310982	SAMN18321900	46,489,092	87%	14%
SRR13983988	SRX10361454	SRP310982	SAMN18321900	46,315,488	87%	14%
SRR13983971	SRX10361471	SRP310982	SAMN18321900	46,098,174	87%	14%
SRR13983955	SRX10361487	SRP310982	SAMN18321900	45,942,270	86%	14%
SRR13983938	SRX10361504	SRP310982	SAMN18321900	48,855,006	87%	14%
SRR14583716	SRX10931103	SRP320424	SAMN19245999	139,806,630	30%	15%
SRR14583715	SRX10931104	SRP320424	SAMN19246000	105,337,632	3%	1%
SRR14583714	SRX10931105	SRP320424	SAMN19246001	98,657,338	46%	16%
SRR14583713	SRX10931106	SRP320424	SAMN19246002	82,241,058	49%	16%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Artiodactyla GenBank	32,381	31,258 (96.53%)	31,258 (96.53%)	78.22%	87.14%
Artiodactyla known RefSeq (NP_)	20,030	19,684 (98.27%)	19,684 (98.27%)	78.23%	90.49%
Homo sapiens GenBank	149,454	137,825 (92.22%)	137,825 (92.22%)	71.81%	81.65%
Homo sapiens known RefSeq (NP_)	63,852	60,906 (95.39%)	60,906 (95.39%)	78.94%	86.78%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences