NCBI Falco rusticolus Annotation Release 100

The RefSeq genome records for Falco rusticolus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Falco rusticolus Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Nov 6 2020
Date of submission of annotation to the public databases: Nov 9 2020
Software version: 8.5

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
bFalRus1.pri	GCF_015220075.1	Vertebrate Genomes Project	11-03-2020	Reference	25 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	bFalRus1.pri
Genes and pseudogenes	19,484
protein-coding	15,894
non-coding	3,407
transcribed pseudogenes	3
non-transcribed pseudogenes	155
genes with variants	8,791
immunoglobulin/T-cell receptor gene segments	25
other	0
mRNAs	41,376
fully-supported	40,291
with > 5% ab initio	481
partial	102
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	41,376
non-coding RNAs	6,837
fully-supported	6,283
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	6,569
pseudo transcripts	3
fully-supported	2
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	3
CDSs	41,414
fully-supported	40,291
with > 5% ab initio	597
partial	105
with major correction(s)	890
known RefSeq (NP_)	0
model RefSeq (XP_)	41,389

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	19,301	33,696	13,369	60	1,210,601
All transcripts	48,213	4,080	3,289	39	95,516
mRNA	41,376	4,194	3,441	189	95,516
misc_RNA	1,895	3,917	3,050	232	23,437
tRNA	266	74	73	66	89
lncRNA	4,388	3,576	2,032	39	31,224
snoRNA	205	107	92	62	318
snRNA	46	148	161	60	190
guide_RNA	18	176	139	129	292
rRNA	19	242	119	119	1,599
Single-exon transcripts	551	2,132	1,365	267	31,617
coding transcripts (NM_/XM_ )	551	2,132	1,365	267	31,617
CDSs	41,389	2,195	1,593	96	94,530
Exons	221,254	367	137	1	31,617
in coding transcripts (NM_/XM_ )	206,769	332	135	1	31,617
in non-coding transcripts (NR_/XR_ )	26,647	564	146	2	25,216
Introns	199,369	3,843	1,038	30	700,883
in coding transcripts (NM_/XM_ )	189,157	3,767	1,020	30	700,883
in non-coding transcripts (NR_/XR_ )	22,132	4,180	1,213	30	431,999

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.52	1	1	50
Number of exons per transcript	13.25	10	1	258

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 15881 coding genes, 15424 genes had a protein with an alignment covering 50% or more of the query and 10955 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
bFalRus1.pri	GCF_015220075.1	5.57%	18.15%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Aves known RefSeq (NM_/NR_)	9,092	7,780 (85.57%)	4,011 (44.12%)	91.10%	89.51%
Aves Genbank	42,410	29,675 (69.97%)	13,731 (32.38%)	90.76%	90.74%
Aves TSA	388,775	227,279 (58.46%)	10,646 (2.74%)	96.33%	97.27%
Aves EST	756,928	270,983 (35.80%)	169,280 (22.36%)	91.07%	97.03%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	2,645,000,383	87%	21%	240,999
SAMN01055120	9724766,12078637,23525076	Falco peregrinus reads sequenced by BGI (Falco, SAMN01055120)	79,571,270	84%	28%	117,520
SAMN01055121	23525076	Falco cherrug reads sequenced by BGI (Falco, SAMN01055121)	76,863,908	89%	27%	120,339
SAMN04525554	NA	liver (Falco sparverius, male, SAMN04525554)	17,456,911	85%	11%	67,941
SAMN04525555	NA	liver (Falco sparverius, female, SAMN04525555)	278,044,640	83%	10%	128,809
SAMN04525556	NA	liver (Falco sparverius, female, SAMN04525556)	39,385,559	87%	11%	96,464
SAMN04525557	NA	liver (Falco sparverius, male, SAMN04525557)	34,611,542	83%	10%	86,710
SAMN04525558	NA	liver (Falco sparverius, female, SAMN04525558)	38,265,091	85%	10%	84,456
SAMN04525559	NA	liver (Falco sparverius, female, SAMN04525559)	31,591,785	86%	10%	97,809
SAMN04525560	NA	liver (Falco sparverius, female, SAMN04525560)	39,434,485	87%	11%	95,073
SAMN04525561	NA	liver (Falco sparverius, female, SAMN04525561)	28,290,295	81%	10%	88,275
SAMN04525562	NA	liver (Falco sparverius, female, SAMN04525562)	34,294,777	83%	10%	90,148
SAMN04531380	NA	retina and cochlea (Falco tinnunculus, SAMN04531380)	100,655,654	85%	25%	191,560
SAMN04531384	NA	retina and cochlea (Falco subbuteo, SAMN04531384)	100,188,960	85%	27%	194,395
SAMN05831928	NA	Liver (Falco sparverius, 24 Hours, male, SAMN05831928)	78,042,186	89%	23%	162,306
SAMN06101813	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101813)	80,936,692	89%	25%	161,076
SAMN06101814	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101814)	87,855,668	89%	24%	167,948
SAMN06101815	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101815)	115,745,142	91%	24%	121,864
SAMN06101816	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101816)	84,219,458	90%	24%	164,984
SAMN06101817	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101817)	84,760,702	88%	22%	164,216
SAMN06101818	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101818)	84,100,882	89%	25%	167,278
SAMN06101819	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101819)	86,763,472	89%	24%	167,779
SAMN06101820	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101820)	84,265,484	89%	23%	163,876
SAMN06101821	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101821)	87,216,286	88%	23%	172,214
SAMN06101822	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101822)	100,556,822	88%	24%	173,231
SAMN06101823	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101823)	86,700,708	88%	24%	168,195
SAMN06101824	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101824)	95,780,238	87%	21%	171,275
SAMN06101825	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101825)	83,002,084	88%	24%	167,879
SAMN06101826	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101826)	89,253,326	89%	22%	168,199
SAMN06101827	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101827)	87,793,098	89%	24%	172,726
SAMN06101828	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101828)	86,581,814	89%	23%	170,169
SAMN06101829	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101829)	74,998,584	89%	25%	168,686
SAMN08398480	31464627	Blood (Falco tinnunculus, SAMN08398480)	64,250,348	91%	22%	143,077
SAMN13755181	NA	blood (Falco sparverius, SAMN13755181)	103,522,512	89%	8%	94,718

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR522906	SRX160529	SRP013939	SAMN01055120	79,571,270	84%	28%
SRR522907	SRX160530	SRP018394	SAMN01055121	76,863,908	89%	27%
SRR3203231	SRX1612734	SRP071126	SAMN04531380	100,655,654	85%	25%
SRR3203238	SRX1612742	SRP071126	SAMN04531384	100,188,960	85%	27%
SRR3217264	SRX1624460	SRP071583	SAMN04525554	17,456,911	85%	11%
SRR3217266	SRX1624462	SRP071583	SAMN04525555	278,044,640	83%	10%
SRR3217265	SRX1624461	SRP071583	SAMN04525556	39,385,559	87%	11%
SRR3217261	SRX1624457	SRP071583	SAMN04525557	34,611,542	83%	10%
SRR3217262	SRX1624458	SRP071583	SAMN04525558	38,265,091	85%	10%
SRR3217263	SRX1624459	SRP071583	SAMN04525559	31,591,785	86%	10%
SRR3217258	SRX1624454	SRP071583	SAMN04525560	39,434,485	87%	11%
SRR3217259	SRX1624455	SRP071583	SAMN04525561	28,290,295	81%	10%
SRR3217260	SRX1624456	SRP071583	SAMN04525562	34,294,777	83%	10%
SRR5070564	SRX2390435	SRP094478	SAMN05831928	78,042,186	89%	23%
SRR5270429	SRX2574481	SRP094478	SAMN06101813	80,936,692	89%	25%
SRR5270428	SRX2574480	SRP094478	SAMN06101814	87,855,668	89%	24%
SRR5270427	SRX2574479	SRP094478	SAMN06101815	115,745,142	91%	24%
SRR5270426	SRX2574478	SRP094478	SAMN06101816	84,219,458	90%	24%
SRR5270425	SRX2574477	SRP094478	SAMN06101817	84,760,702	88%	22%
SRR5270424	SRX2574476	SRP094478	SAMN06101818	84,100,882	89%	25%
SRR5270423	SRX2574475	SRP094478	SAMN06101819	86,763,472	89%	24%
SRR5270422	SRX2574474	SRP094478	SAMN06101820	84,265,484	89%	23%
SRR5270421	SRX2574473	SRP094478	SAMN06101821	87,216,286	88%	23%
SRR5270420	SRX2574472	SRP094478	SAMN06101822	100,556,822	88%	24%
SRR5270419	SRX2574471	SRP094478	SAMN06101823	86,700,708	88%	24%
SRR5270418	SRX2574470	SRP094478	SAMN06101824	95,780,238	87%	21%
SRR5270417	SRX2574469	SRP094478	SAMN06101825	83,002,084	88%	24%
SRR5270416	SRX2574468	SRP094478	SAMN06101826	89,253,326	89%	22%
SRR5270415	SRX2574467	SRP094478	SAMN06101827	87,793,098	89%	24%
SRR5270414	SRX2574466	SRP094478	SAMN06101828	86,581,814	89%	23%
SRR5270413	SRX2574465	SRP094478	SAMN06101829	74,998,584	89%	25%
SRR6650831	SRX3628421	SRP131743	SAMN08398480	64,250,348	91%	22%
SRR10853095	SRX7523253	SRP240625	SAMN13755181	103,522,512	89%	8%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Xenopus known RefSeq (NP_)	19,655	17,971 (91.43%)	17,971 (91.43%)	69.96%	79.09%
Aves GenBank	15,113	7,861 (52.01%)	7,861 (52.01%)	72.72%	83.79%
Aves known RefSeq (NP_)	7,926	7,525 (94.94%)	7,525 (94.94%)	77.67%	85.71%
Columba livia high-quality model RefSeq (XP_)	8,292	8,000 (96.48%)	8,000 (96.48%)	78.52%	86.18%
Gallus gallus high-quality model RefSeq (XP_)	9,464	9,050 (95.63%)	9,050 (95.63%)	76.85%	83.51%
Parus major high-quality model RefSeq (XP_)	11,979	11,557 (96.48%)	11,557 (96.48%)	77.90%	85.07%
Homo sapiens known RefSeq (NP_)	60,910	40,439 (66.39%)	40,439 (66.39%)	71.01%	76.28%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences