NCBI Drosophila guanche Annotation Release 100

The RefSeq genome records for Drosophila guanche were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Drosophila guanche Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: May 7 2020
Date of submission of annotation to the public databases: May 9 2020
Software version: 8.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
DGUA_6	GCF_900245975.1	CNAG	09-20-2018	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	DGUA_6
Genes and pseudogenes	14,306
protein-coding	13,307
non-coding	927
transcribed pseudogenes	0
non-transcribed pseudogenes	72
genes with variants	3,810
immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	22,824
fully-supported	22,151
with > 5% ab initio	336
partial	91
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	22,824
non-coding RNAs	1,449
fully-supported	1,054
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	1,186
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	22,824
fully-supported	22,151
with > 5% ab initio	354
partial	91
with major correction(s)	155
known RefSeq (NP_)	0
model RefSeq (XP_)	22,824

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	14,234	6,113	1,973	47	259,483
All transcripts	24,273	2,804	1,987	47	49,111
mRNA	22,824	2,907	2,074	168	49,111
misc_RNA	350	3,108	2,514	190	27,220
tRNA	263	74	73	71	84
lncRNA	704	845	630	90	7,309
snoRNA	75	107	89	47	304
snRNA	30	147	140	68	252
guide_RNA	3	151	143	137	173
rRNA	24	134	119	119	179
Single-exon transcripts	1,829	1,177	988	168	7,133
coding transcripts (NM_/XM_ )	1,829	1,177	988	168	7,133
CDSs	22,824	2,290	1,554	96	48,183
Exons	72,522	475	262	1	13,161
in coding transcripts (NM_/XM_ )	70,172	481	265	1	13,161
in non-coding transcripts (NR_/XR_ )	3,794	335	189	2	10,345
Introns	54,749	1,483	73	30	188,030
in coding transcripts (NM_/XM_ )	53,239	1,454	72	30	188,030
in non-coding transcripts (NR_/XR_ )	2,922	2,107	91	30	119,705

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.72	1	1	50
Number of exons per transcript	6.47	5	1	61

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Drosophila melanogaster known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 13307 coding genes, 12874 genes had a protein with an alignment covering 50% or more of the query and 8916 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Drosophila melanogaster known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker and RepeatMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
DGUA_6	GCF_900245975.1	10.84%	30.27%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	1	1 (100.00%)	1 (100.00%)	99.38%	100.00%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	2,282,067,778	68%	13%	73,907
SAMD00047170	27897175	abdomen (Drosophila obscura, female, SAMD00047170)	106,852,574	73%	12%	40,658
SAMD00047171	27897175	abdomen (Drosophila obscura, male, SAMD00047171)	103,750,828	71%	12%	48,828
SAMD00047172	27897175	accessory gland (Drosophila obscura, male, SAMD00047172)	106,800,740	30%	15%	36,317
SAMD00047173	27897175	abdomen without gonad (Drosophila obscura, female, SAMD00047173)	101,070,604	62%	14%	42,788
SAMD00047174	27897175	abdomen without gonad (Drosophila obscura, male, SAMD00047174)	127,490,066	69%	13%	48,610
SAMD00047175	27897175	head (Drosophila obscura, female, SAMD00047175)	92,117,650	66%	14%	48,175
SAMD00047176	27897175	head (Drosophila obscura, male, SAMD00047176)	112,394,992	61%	15%	48,965
SAMD00047177	27897175	whole body (Drosophila obscura, female, SAMD00047177)	105,832,478	74%	12%	44,303
SAMD00047178	27897175	whole body (Drosophila obscura, male, SAMD00047178)	104,489,530	77%	12%	43,863
SAMD00047179	27897175	ovary (Drosophila obscura, female, SAMD00047179)	95,609,612	66%	16%	36,917
SAMD00047180	27897175	whole body (Drosophila obscura, female, SAMD00047180)	98,367,256	68%	13%	46,434
SAMD00047181	27897175	whole body (Drosophila obscura, male, SAMD00047181)	112,070,676	69%	14%	49,019
SAMD00047182	27897175	thorax (Drosophila obscura, female, SAMD00047182)	108,573,652	67%	16%	43,246
SAMD00047183	27897175	thorax (Drosophila obscura, male, SAMD00047183)	101,589,274	68%	17%	43,138
SAMD00047184	27897175	testis (Drosophila obscura, male, SAMD00047184)	100,160,586	43%	12%	40,209
SAMD00047185	27897175	whole body (Drosophila obscura, female, SAMD00047185)	113,701,834	67%	15%	50,893
SAMD00047186	27897175	whole body (Drosophila obscura, male, SAMD00047186)	115,390,622	62%	13%	53,620
SAMEA104164777	29947749	DSBR1 (Drosophila guanche, SAMEA104164777)	373,081,792	87%	10%	59,189
SAMN12123134	NA	whole body (Drosophila subobscura, SAMN12123134)	102,723,012	73%	22%	56,268

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
DRR055200	DRX049884	DRP003328	SAMD00047170	49,974,638	70%	12%
DRR055201	DRX049885	DRP003328	SAMD00047170	56,877,936	75%	13%
DRR055202	DRX049886	DRP003328	SAMD00047171	47,421,428	71%	12%
DRR055203	DRX049887	DRP003328	SAMD00047171	56,329,400	71%	12%
DRR055204	DRX049888	DRP003328	SAMD00047172	50,809,040	28%	15%
DRR055205	DRX049889	DRP003328	SAMD00047172	55,991,700	33%	14%
DRR055206	DRX049890	DRP003328	SAMD00047173	50,241,406	57%	14%
DRR055207	DRX049891	DRP003328	SAMD00047173	50,829,198	67%	14%
DRR055208	DRX049892	DRP003328	SAMD00047174	63,929,942	71%	13%
DRR055209	DRX049893	DRP003328	SAMD00047174	63,560,124	68%	13%
DRR055210	DRX049894	DRP003328	SAMD00047175	41,307,106	66%	14%
DRR055211	DRX049895	DRP003328	SAMD00047175	50,810,544	66%	14%
DRR055212	DRX049896	DRP003328	SAMD00047176	62,612,276	60%	15%
DRR055213	DRX049897	DRP003328	SAMD00047176	49,782,716	63%	15%
DRR055214	DRX049898	DRP003328	SAMD00047177	49,572,118	70%	14%
DRR055215	DRX049899	DRP003328	SAMD00047177	56,260,360	77%	11%
DRR055216	DRX049900	DRP003328	SAMD00047178	54,137,012	75%	12%
DRR055217	DRX049901	DRP003328	SAMD00047178	50,352,518	78%	13%
DRR055218	DRX049902	DRP003328	SAMD00047179	49,252,676	64%	16%
DRR055219	DRX049903	DRP003328	SAMD00047179	46,356,936	67%	16%
DRR055220	DRX049904	DRP003328	SAMD00047180	41,354,916	67%	13%
DRR055221	DRX049905	DRP003328	SAMD00047180	57,012,340	69%	13%
DRR055222	DRX049906	DRP003328	SAMD00047181	61,485,688	69%	14%
DRR055223	DRX049907	DRP003328	SAMD00047181	50,584,988	68%	13%
DRR055224	DRX049908	DRP003328	SAMD00047182	47,308,810	68%	16%
DRR055225	DRX049909	DRP003328	SAMD00047182	61,264,842	66%	17%
DRR055226	DRX049910	DRP003328	SAMD00047183	49,806,292	67%	17%
DRR055227	DRX049911	DRP003328	SAMD00047183	51,782,982	68%	18%
DRR055228	DRX049912	DRP003328	SAMD00047184	55,124,844	44%	12%
DRR055229	DRX049913	DRP003328	SAMD00047184	45,035,742	42%	12%
DRR055230	DRX049914	DRP003328	SAMD00047185	54,738,774	70%	15%
DRR055231	DRX049915	DRP003328	SAMD00047185	58,963,060	64%	15%
DRR055232	DRX049916	DRP003328	SAMD00047186	61,377,542	61%	13%
DRR055233	DRX049917	DRP003328	SAMD00047186	54,013,080	62%	13%
ERR2037048	ERX2096105	ERP024082	SAMEA104164777	38,578,818	88%	10%
ERR2037049	ERX2096106	ERP024082	SAMEA104164777	35,699,480	85%	10%
ERR2037050	ERX2096107	ERP024082	SAMEA104164777	24,089,944	91%	10%
ERR2037051	ERX2096108	ERP024082	SAMEA104164777	33,365,786	90%	8%
ERR2037052	ERX2096109	ERP024082	SAMEA104164777	20,692,070	91%	9%
ERR2037053	ERX2096110	ERP024082	SAMEA104164777	20,871,792	90%	9%
ERR2037054	ERX2096111	ERP024082	SAMEA104164777	50,593,336	84%	11%
ERR2037055	ERX2096112	ERP024082	SAMEA104164777	26,977,626	90%	11%
ERR2037056	ERX2096113	ERP024082	SAMEA104164777	22,612,040	88%	11%
ERR2037057	ERX2096114	ERP024082	SAMEA104164777	46,598,028	86%	11%
ERR2037058	ERX2096115	ERP024082	SAMEA104164777	23,293,196	88%	11%
ERR2037059	ERX2096116	ERP024082	SAMEA104164777	29,709,676	85%	11%
SRR9586630	SRX6352473	SRP211757	SAMN12123134	102,723,012	73%	22%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Insecta GenBank	77,595	56,868 (73.29%)	56,868 (73.29%)	64.32%	61.67%
Drosophila melanogaster GenBank	28,019	12,047 (43.00%)	12,047 (43.00%)	74.42%	80.98%
Drosophila melanogaster known RefSeq (NP_)	30,157	20,674 (68.55%)	20,674 (68.55%)	75.67%	83.99%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences