NCBI Chiroxiphia lanceolata Annotation Release 100

The RefSeq genome records for Chiroxiphia lanceolata were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Chiroxiphia lanceolata Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Feb 3 2020
Date of submission of annotation to the public databases: Feb 29 2020
Software version: 8.3

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
bChiLan1.pri	GCF_009829145.1	Vertebrate Genomes Project	01-03-2020	Reference	35 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	bChiLan1.pri
Genes and pseudogenes	20,299
protein-coding	15,873
non-coding	4,307
transcribed pseudogenes	3
non-transcribed pseudogenes	107
genes with variants	8,470
immunoglobulin/T-cell receptor gene segments	9
other	0
mRNAs	39,412
fully-supported	38,083
with > 5% ab initio	658
partial	127
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	39,412
non-coding RNAs	7,216
fully-supported	6,673
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	6,951
pseudo transcripts	3
fully-supported	3
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	3
CDSs	39,421
fully-supported	38,083
with > 5% ab initio	759
partial	127
with major correction(s)	1,236
known RefSeq (NP_)	0
model RefSeq (XP_)	39,412

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	20,180	31,030	11,714	61	1,309,047
All transcripts	46,628	3,608	2,888	61	99,528
mRNA	39,412	3,950	3,205	174	99,528
misc_RNA	1,506	3,313	2,664	99	16,154
tRNA	265	74	72	71	84
lncRNA	5,167	1,454	750	79	26,404
snoRNA	181	110	97	61	324
snRNA	44	141	141	61	197
guide_RNA	14	193	165	129	318
rRNA	39	227	119	119	4,282
Single-exon transcripts	547	1,771	1,186	174	12,651
coding transcripts (NM_/XM_ )	547	1,771	1,186	174	12,651
CDSs	39,412	2,180	1,572	96	98,322
Exons	221,672	306	133	1	21,648
in coding transcripts (NM_/XM_ )	205,065	297	132	1	17,106
in non-coding transcripts (NR_/XR_ )	26,750	330	134	2	21,648
Introns	198,493	3,850	967	30	1,171,552
in coding transcripts (NM_/XM_ )	186,674	3,792	944	30	1,171,552
in non-coding transcripts (NR_/XR_ )	21,730	4,183	1,257	30	827,577

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.33	1	1	50
Number of exons per transcript	12.77	10	1	301

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 15873 coding genes, 15391 genes had a protein with an alignment covering 50% or more of the query and 10687 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
bChiLan1.pri	GCF_009829145.1	7.11%	19.21%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species long SRA	53,879	52,384 (97.23%)	48,090 (89.26%)	99.71%	99.45%
Aves known RefSeq (NM_/NR_)	1,709	1,549 (90.64%)	988 (57.81%)	92.53%	94.67%
Aves Genbank	11,737	9,050 (77.11%)	5,102 (43.47%)	91.32%	95.62%
Aves EST	156,317	81,529 (52.16%)	52,754 (33.75%)	91.72%	97.24%
Gallus gallus known RefSeq (NM_/NR_)	7,374	6,008 (81.48%)	2,652 (35.96%)	90.40%	86.52%
Gallus gallus Genbank	30,564	19,592 (64.10%)	7,958 (26.04%)	90.14%	87.54%
Gallus gallus EST	600,147	167,124 (27.85%)	102,591 (17.09%)	90.54%	96.91%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	1,363,121,047	78%	27%	333,768
SAMN00004739	NA	whole brain (Manacus vitellinus, SAMN00004739)	325,390	77%	14%	14,116
SAMN04125355	26745669	pectoralis, testosterone (Manacus vitellinus, SAMN04125355)	55,782,566	72%	25%	79,839
SAMN04125356	26745669	pectoralis, testosterone (Manacus vitellinus, SAMN04125356)	59,864,278	67%	22%	71,462
SAMN04125357	26745669	pectoralis, testosterone (Manacus vitellinus, SAMN04125357)	59,581,898	65%	25%	91,648
SAMN04125358	26745669	scapulohumeralis caudalis, testosterone (Manacus vitellinus, SAMN04125358)	37,899,244	76%	28%	101,637
SAMN04125359	26745669	scapulohumeralis caudalis, testosterone (Manacus vitellinus, SAMN04125359)	54,100,212	79%	26%	116,783
SAMN04125360	26745669	scapulohumeralis caudalis, testosterone (Manacus vitellinus, SAMN04125360)	49,924,286	75%	27%	93,408
SAMN04125361	26745669	pectoralis, no testosterone (Manacus vitellinus, SAMN04125361)	56,104,046	67%	24%	76,663
SAMN04125362	26745669	pectoralis, no testosterone (Manacus vitellinus, SAMN04125362)	58,380,226	72%	26%	93,312
SAMN04125363	26745669	pectoralis, no testosterone (Manacus vitellinus, SAMN04125363)	77,429,742	65%	24%	93,810
SAMN04125364	26745669	scapulohumeralis caudalis, no testosterone (Manacus vitellinus, SAMN04125364)	65,806,788	72%	25%	88,046
SAMN04125365	26745669	scapulohumeralis caudalis, no testosterone (Manacus vitellinus, SAMN04125365)	55,886,088	76%	27%	93,976
SAMN04125366	26745669	scapulohumeralis caudalis, no testosterone (Manacus vitellinus, SAMN04125366)	57,200,148	76%	27%	91,726
SAMN04943274	NA	kidney, brain, liver (Manacus vitellinus, male, SAMN04943274)	251,755,194	81%	31%	265,569
SAMN04967793	NA	Liver, gonad, muscle (Lepidothrix coronata, male, SAMN04967793)	285,374,278	85%	32%	208,927
SAMN08640351	NA	testes (Pipra filicauda, adult, male, SAMN08640351)	16,905,547	88%	26%	158,843
SAMN08640352	NA	testes (Pipra filicauda, adult, male, SAMN08640352)	22,706,132	87%	26%	171,760
SAMN08640353	NA	brain (Pipra filicauda, adult, male, SAMN08640353)	32,806,460	87%	17%	159,663
SAMN08640354	NA	brain (Pipra filicauda, adult, male, SAMN08640354)	25,597,883	88%	17%	152,423
SAMN08640355	NA	brain (Pipra filicauda, adult, male, SAMN08640355)	19,471,827	88%	17%	137,027
SAMN08640356	NA	brain (Pipra filicauda, adult, male, SAMN08640356)	20,218,814	89%	16%	131,221

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR029477	SRX012420	SRP001362	SAMN00004739	165,165	77%	14%
SRR029478	SRX012420	SRP001362	SAMN00004739	160,225	77%	14%
SRR2545929	SRX1299445	SRP064385	SAMN04125355	55,782,566	72%	25%
SRR2545930	SRX1299446	SRP064385	SAMN04125356	59,864,278	67%	22%
SRR2545931	SRX1299447	SRP064385	SAMN04125357	59,581,898	65%	25%
SRR2545932	SRX1299448	SRP064385	SAMN04125358	37,899,244	76%	28%
SRR2545933	SRX1299449	SRP064385	SAMN04125359	54,100,212	79%	26%
SRR2545934	SRX1299450	SRP064385	SAMN04125360	49,924,286	75%	27%
SRR2545935	SRX1299451	SRP064385	SAMN04125361	56,104,046	67%	24%
SRR2545936	SRX1299452	SRP064385	SAMN04125362	58,380,226	72%	26%
SRR2545937	SRX1299453	SRP064385	SAMN04125363	77,429,742	65%	24%
SRR2545938	SRX1299454	SRP064385	SAMN04125364	65,806,788	72%	25%
SRR2545939	SRX1299455	SRP064385	SAMN04125365	55,886,088	76%	27%
SRR2545940	SRX1299456	SRP064385	SAMN04125366	57,200,148	76%	27%
SRR3476292	SRX1742932	SRP074374	SAMN04943274	251,755,194	81%	31%
SRR3493972	SRX1753859	SRP074756	SAMN04967793	285,374,278	85%	32%
SRR6811834	SRX3768877	SRP134047	SAMN08640351	16,905,547	88%	26%
SRR6811833	SRX3768878	SRP134047	SAMN08640352	22,706,132	87%	26%
SRR6811836	SRX3768875	SRP134047	SAMN08640353	32,806,460	87%	17%
SRR6811835	SRX3768876	SRP134047	SAMN08640354	25,597,883	88%	17%
SRR6811838	SRX3768873	SRP134047	SAMN08640355	19,471,827	88%	17%
SRR6811837	SRX3768874	SRP134047	SAMN08640356	20,218,814	89%	16%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Pseudopodoces humilis high-quality model RefSeq (XP_)	10,444	9,935 (95.13%)	9,935 (95.13%)	79.99%	86.77%
Xenopus known RefSeq (NP_)	19,657	17,924 (91.18%)	17,924 (91.18%)	69.73%	78.83%
Aves GenBank	15,021	8,030 (53.46%)	8,030 (53.46%)	70.76%	83.70%
Aves known RefSeq (NP_)	7,917	7,504 (94.78%)	7,504 (94.78%)	76.90%	85.15%
Columba livia high-quality model RefSeq (XP_)	8,292	7,965 (96.06%)	7,965 (96.06%)	77.66%	85.32%
Gallus gallus high-quality model RefSeq (XP_)	9,466	9,019 (95.28%)	9,019 (95.28%)	75.87%	82.56%
Parus major high-quality model RefSeq (XP_)	12,103	9,995 (82.58%)	9,995 (82.58%)	78.42%	85.24%
Homo sapiens known RefSeq (NP_)	56,488	38,260 (67.73%)	38,260 (67.73%)	70.72%	76.13%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences