NCBI Cygnus olor Annotation Release 100

The RefSeq genome records for Cygnus olor were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Cygnus olor Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Apr 2 2021
Date of submission of annotation to the public databases: Apr 8 2021
Software version: 8.6

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
bCygOlo1.pri.v2	GCF_009769625.2	Vertebrate Genomes Project	03-31-2021	Reference	37 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	bCygOlo1.pri.v2
Genes and pseudogenes	23,209
protein-coding	16,150
non-coding	6,899
Transcribed pseudogenes	0
Non-transcribed pseudogenes	103
genes with variants	9,999
Immunoglobulin/T-cell receptor gene segments	57
other	0
mRNAs	47,781
fully-supported	46,809
with > 5% ab initio	370
partial	158
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	47,781
non-coding RNAs	12,133
fully-supported	11,412
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	11,730
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	47,851
fully-supported	46,809
with > 5% ab initio	471
partial	158
with major correction(s)	1,138
known RefSeq (NP_)	0
model RefSeq (XP_)	47,794

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	23,049	30,754	11,138	59	1,590,996
All transcripts	59,914	3,879	3,116	59	95,572
mRNA	47,781	4,222	3,468	144	95,572
misc_RNA	1,996	3,878	2,780	101	23,132
tRNA	401	74	73	65	87
lncRNA	9,416	2,427	1,519	86	29,115
snoRNA	194	109	93	62	322
snRNA	63	142	149	59	194
guide_RNA	14	179	140	130	312
rRNA	49	286	119	118	4,116
Single-exon transcripts	586	1,962	1,334	192	22,448
coding transcripts (NM_/XM_ )	586	1,962	1,334	192	22,448
CDSs	47,794	2,220	1,638	96	94,326
Exons	243,023	362	140	1	27,100
in coding transcripts (NM_/XM_ )	214,670	326	138	1	26,333
in non-coding transcripts (NR_/XR_ )	38,015	527	155	2	27,100
Introns	214,363	4,188	1,016	30	709,779
in coding transcripts (NM_/XM_ )	193,746	4,001	975	30	709,779
in non-coding transcripts (NR_/XR_ )	30,094	5,166	1,340	30	469,967

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.63	1	1	50
Number of exons per transcript	12.62	9	1	256

BUSCO analysis of gene annotation

BUSCO v4.0.2 (Simão et al 2015, PMID: 26059717) was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the aves_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation (C:complete [S:single-copy, D:duplicated], F:fragmented, M:missing, n:number of genes used).

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 16137 coding genes, 15578 genes had a protein with an alignment covering 50% or more of the query and 10857 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
bCygOlo1.pri.v2	GCF_009769625.2		20.64%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

No transcript evidence was used in this annotation

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	2,368,651,928	82%	30%	467,648
SAMEA6952882	NA	Endothelial phenotype, (Cygnus atratus, male, SAMEA6952882)	93,299,964	59%	44%	158,498
SAMEA6952883	NA	Endothelial phenotype, (Cygnus atratus, male, SAMEA6952883)	105,510,706	90%	45%	170,482
SAMEA6952884	NA	Endothelial phenotype, (Cygnus atratus, male, SAMEA6952884)	91,956,280	59%	44%	159,069
SAMEA6952885	NA	Endothelial phenotype, (Cygnus atratus, male, SAMEA6952885)	79,839,948	60%	43%	158,159
SAMEA6952886	NA	Endothelial phenotype, (Cygnus atratus, male, SAMEA6952886)	110,261,514	90%	46%	165,733
SAMEA6952887	NA	Endothelial phenotype, (Cygnus atratus, male, SAMEA6952887)	87,062,826	91%	45%	165,248
SAMN02898389	27608918	hypophysis; hypothalamus; ovary (Anser cygnoides, mature, SAMN02898389)	331,616,546	65%	28%	376,442
SAMN03322565	NA	Gonad (Anser cygnoides, female, First reproductive year, SAMN03322565)	262,690,542	89%	21%	374,028
SAMN03322566	NA	Spleen (Anser cygnoides, female, First reproductive year, SAMN03322566)	239,392,204	86%	19%	325,347
SAMN03322567	NA	Gonad (Anser cygnoides, male, First reproductive year, SAMN03322567)	242,049,806	87%	17%	382,076
SAMN03322568	NA	Spleen (Anser cygnoides, male, First reproductive year, SAMN03322568)	239,945,418	87%	20%	321,815
SAMN09787202	NA	dorsal skin (Anser anser, embryonic day (E13, E18,E28), SAMN09787202)	485,026,174	89%	39%	224,757

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
ERR4242866	ERX4201660	ERP122327	SAMEA6952882	93,299,964	59%	44%
ERR4236132	ERX4194926	ERP122327	SAMEA6952883	105,510,706	90%	45%
ERR4239012	ERX4197806	ERP122327	SAMEA6952884	91,956,280	59%	44%
ERR4238303	ERX4197097	ERP122327	SAMEA6952885	79,839,948	60%	43%
ERR4238025	ERX4196819	ERP122327	SAMEA6952886	110,261,514	90%	46%
ERR4237982	ERX4196776	ERP122327	SAMEA6952887	87,062,826	91%	45%
SRR1502193	SRX642065	SRP043697	SAMN02898389	66,704,750	73%	28%
SRR1502195	SRX642065	SRP043697	SAMN02898389	88,303,932	63%	28%
SRR1502197	SRX642082	SRP043697	SAMN02898389	88,303,932	63%	28%
SRR1502198	SRX647227	SRP043697	SAMN02898389	88,303,932	63%	28%
SRR1795999	SRX866322	SRP053228	SAMN03322565	55,824,994	89%	21%
SRR1796000	SRX866323	SRP053228	SAMN03322565	48,747,188	89%	21%
SRR1796002	SRX866324	SRP053228	SAMN03322565	51,762,792	90%	21%
SRR1796003	SRX866325	SRP053228	SAMN03322565	54,418,038	89%	21%
SRR1796004	SRX866326	SRP053228	SAMN03322565	51,937,530	90%	23%
SRR1796005	SRX866327	SRP053228	SAMN03322566	50,402,122	87%	20%
SRR1796006	SRX866328	SRP053228	SAMN03322566	53,037,682	87%	20%
SRR1796008	SRX866330	SRP053228	SAMN03322566	44,832,392	86%	20%
SRR1796009	SRX866331	SRP053228	SAMN03322566	40,202,222	86%	18%
SRR1796010	SRX866332	SRP053228	SAMN03322566	50,917,786	87%	19%
SRR1796012	SRX866333	SRP053228	SAMN03322567	53,749,894	87%	17%
SRR1796011	SRX866334	SRP053228	SAMN03322567	46,815,172	87%	18%
SRR1796013	SRX866335	SRP053228	SAMN03322567	49,776,600	88%	17%
SRR1796014	SRX866336	SRP053228	SAMN03322567	45,205,986	86%	16%
SRR1796015	SRX866338	SRP053228	SAMN03322567	46,502,154	87%	16%
SRR1796016	SRX866339	SRP053228	SAMN03322568	44,218,884	89%	23%
SRR1796017	SRX866351	SRP053228	SAMN03322568	49,808,694	86%	19%
SRR1796019	SRX866354	SRP053228	SAMN03322568	43,863,442	87%	19%
SRR1796020	SRX866355	SRP053228	SAMN03322568	50,908,104	87%	20%
SRR1796021	SRX866356	SRP053228	SAMN03322568	51,146,294	88%	18%
SRR7663627	SRX4524234	SRP156879	SAMN09787202	72,214,054	89%	40%
SRR7663626	SRX4524235	SRP156879	SAMN09787202	52,416,882	88%	40%
SRR7663625	SRX4524236	SRP156879	SAMN09787202	51,066,402	89%	40%
SRR7663624	SRX4524237	SRP156879	SAMN09787202	51,704,342	89%	37%
SRR7663623	SRX4524238	SRP156879	SAMN09787202	53,263,218	89%	38%
SRR7663622	SRX4524239	SRP156879	SAMN09787202	49,794,350	89%	39%
SRR7663621	SRX4524240	SRP156879	SAMN09787202	48,238,510	88%	41%
SRR7663620	SRX4524241	SRP156879	SAMN09787202	55,460,086	89%	38%
SRR7663619	SRX4524242	SRP156879	SAMN09787202	50,868,330	89%	37%

SRA Long Read Alignment Statistics

The following long read RNA-Seq reads (PacBio, Oxford Nanopore, 454, or other long-read sequencing technologies) from the Sequence Read Archive were also used for gene prediction:

Run	Sample	Number of reads	Number (%) of sequences aligned by Minimap2	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
All	NA	2540407	2524257 (99.36%)	2340694 (92.13%)	98.53	99.19
ERR4235910	SAMEA6952754	819815	817070 (99.66%)	764364 (93.23%)	98.63	99.23
ERR4235911	SAMEA6952755	1344013	1338500 (99.58%)	1223656 (91.04%)	98.45	99.12
ERR4235914	SAMEA6952757	39053	38044 (97.41%)	34992 (89.60%)	98.33	98.95
ERR4235915	SAMEA6952757	337526	330643 (97.96%)	317682 (94.12%)	98.64	99.36

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Aves GenBank	15,116	8,684 (57.45%)	8,684 (57.45%)	74.20%	84.17%
Aves known RefSeq (NP_)	8,231	7,826 (95.08%)	7,826 (95.08%)	78.06%	86.06%
Gallus gallus high-quality model RefSeq (XP_)	9,371	9,013 (96.18%)	9,013 (96.18%)	77.74%	84.55%
Homo sapiens known RefSeq (NP_)	61,693	40,926 (66.34%)	40,926 (66.34%)	70.68%	76.13%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences