NCBI Acanthaster planci Annotation Release 100

The RefSeq genome records for Acanthaster planci were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Acanthaster planci Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Aug 7 2017
Date of submission of annotation to the public databases: Aug 8 2017
Software version: 7.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
OKI-Apl_1.0	GCF_001949145.1	Okinawa Institute of Science and Technology	12-20-2016	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	OKI-Apl_1.0
Genes and pseudogenes	18,244
protein-coding	16,468
non-coding	1,732
pseudogenes	44
genes with variants	7,266
mRNAs	33,201
fully-supported	31,433
with > 5% ab initio	943
partial	215
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	33,201
Other RNAs	3,020
fully-supported	2,703
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	2,703
CDSs	33,201
fully-supported	31,433
with > 5% ab initio	1,059
partial	215
with major correction(s)	221
known RefSeq (NP_)	0
model RefSeq (XP_)	33,201

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	18,200	16,857	9,705	71	355,418
All transcripts	36,221	4,003	3,339	71	54,929
mRNA	33,201	4,185	3,492	219	54,929
misc_RNA	670	3,141	2,352	126	20,776
tRNA	317	74	73	71	84
lncRNA	2,033	1,919	1,496	119	13,429
Single-exon transcripts	810	2,535	2,047	327	17,868
coding transcripts (NM_/XM_ )	810	2,535	2,047	327	17,868
CDSs	33,201	2,094	1,491	195	52,875
Exons	190,404	374	150	1	20,103
in coding transcripts (NM_/XM_ )	183,499	368	150	1	20,103
in non-coding transcripts (NR_/XR_ )	10,037	440	149	3	13,210
Introns	172,461	1,844	676	30	172,144
in coding transcripts (NM_/XM_ )	167,405	1,807	672	30	172,144
in non-coding transcripts (NR_/XR_ )	8,049	2,711	829	30	99,216

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.01	1	1	46
Number of exons per transcript	11.37	8	1	181

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 16468 coding genes, 12398 genes had a protein with an alignment covering 50% or more of the query and 2783 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
OKI-Apl_1.0	GCF_001949145.1	0.96%	20.40%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	7	6 (85.71%)	6 (85.71%)	98.33%	98.16%
Eleutherozoa known RefSeq (NM_/NR_)	492	84 (17.07%)	13 (2.64%)	87.76%	97.15%
Eleutherozoa Genbank	3,852	479 (12.44%)	74 (1.92%)	88.51%	86.71%
Eleutherozoa EST	359,335	10,048 (2.80%)	4,658 (1.30%)	89.02%	95.87%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	941,368,730	88%	13%	209,821
SAMD00063048	25394327	testis, (Acanthaster planci, SAMD00063048)	87,410,010	89%	15%	171,034
SAMD00063049	25394327	podia, (Acanthaster planci, SAMD00063049)	95,894,110	90%	14%	154,239
SAMD00063050	25394327	spine, (Acanthaster planci, SAMD00063050)	92,054,776	90%	7%	107,994
SAMD00063051	25394327	stomach, (Acanthaster planci, SAMD00063051)	81,835,298	90%	14%	154,847
SAMD00063052	25394327	body-wall, (Acanthaster planci, SAMD00063052)	81,190,108	90%	5%	104,757
SAMD00063053	25394327	testis, (Acanthaster planci, SAMD00063053)	15,414,606	89%	15%	119,379
SAMD00063054	25394327	podia, (Acanthaster planci, SAMD00063054)	41,713,806	90%	12%	145,397
SAMD00063055	25394327	spine, (Acanthaster planci, SAMD00063055)	77,984,926	89%	11%	156,832
SAMD00063056	25394327	mouth, (Acanthaster planci, female, SAMD00063056)	5,378,002	90%	5%	43,439
SAMD00063057	25394327	nerve, (Acanthaster planci, female, SAMD00063057)	43,021,166	88%	13%	137,527
SAMD00063058	25394327	nerve, (Acanthaster planci, male, SAMD00063058)	29,208,712	87%	13%	132,344
SAMD00063059	25394327	nerve, (Acanthaster planci, female, SAMD00063059)	31,277,784	87%	12%	141,421
SAMD00063060	25394327	oocyte, (Acanthaster planci, SAMD00063060)	77,274,416	84%	19%	154,474
SAMD00063061	25394327	oki-early-gastrula mRNA (Acanthaster planci, SAMD00063061)	30,300,796	87%	15%	135,563
SAMD00063062	25394327	oki-middle-gastrula mRNA (Acanthaster planci, SAMD00063062)	98,444,216	86%	14%	161,397
SAMN02689703	NA	testes (Acanthaster planci, adult, male, SAMN02689703)	52,965,998	89%	15%	143,827

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
DRR072325	DRX066263	DRP003378	SAMD00063048	87,410,010	89%	15%
DRR072326	DRX066264	DRP003378	SAMD00063049	95,894,110	90%	14%
DRR072327	DRX066265	DRP003378	SAMD00063050	92,054,776	90%	7%
DRR072328	DRX066266	DRP003378	SAMD00063051	81,835,298	90%	14%
DRR072329	DRX066267	DRP003378	SAMD00063052	81,190,108	90%	5%
DRR072330	DRX066268	DRP003378	SAMD00063053	15,414,606	89%	15%
DRR072331	DRX066269	DRP003378	SAMD00063054	41,713,806	90%	12%
DRR072332	DRX066270	DRP003378	SAMD00063055	77,984,926	89%	11%
DRR072333	DRX066271	DRP003378	SAMD00063056	5,378,002	90%	5%
DRR072334	DRX066272	DRP003378	SAMD00063057	43,021,166	88%	13%
DRR072335	DRX066273	DRP003378	SAMD00063058	29,208,712	87%	13%
DRR072336	DRX066274	DRP003378	SAMD00063059	31,277,784	87%	12%
DRR072337	DRX066275	DRP003378	SAMD00063060	77,274,416	84%	19%
DRR072338	DRX066276	DRP003378	SAMD00063061	30,300,796	87%	15%
DRR072339	DRX066277	DRP003378	SAMD00063062	98,444,216	86%	14%
SRR1197243	SRX493873	SRP051695	SAMN02689703	52,965,998	89%	15%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Saccoglossus kowalevskii GenBank	267	224 (83.90%)	224 (83.90%)	63.60%	44.02%
Saccoglossus kowalevskii high-quality model RefSeq (XP_)	6,124	4,589 (74.93%)	4,589 (74.93%)	64.66%	59.42%
Saccoglossus kowalevskii known RefSeq (NP_)	474	432 (91.14%)	432 (91.14%)	67.75%	56.16%
Trichoplax adhaerens GenBank	83	80 (96.39%)	80 (96.39%)	66.00%	80.39%
Same-species GenBank	7	7 (100.00%)	7 (100.00%)	74.46%	89.14%
Eleutherozoa GenBank	3,139	1,981 (63.11%)	1,981 (63.11%)	69.11%	67.25%
Eleutherozoa known RefSeq (NP_)	427	379 (88.76%)	379 (88.76%)	67.88%	62.87%
Crassostrea gigas GenBank	722	399 (55.26%)	399 (55.26%)	68.10%	68.96%
Crassostrea gigas high-quality model RefSeq (XP_)	22,081	11,672 (52.86%)	11,672 (52.86%)	56.25%	35.93%
Crassostrea gigas known RefSeq (NP_)	141	112 (79.43%)	112 (79.43%)	66.48%	61.31%
Nematostella vectensis GenBank	420	351 (83.57%)	351 (83.57%)	63.93%	41.94%
Saccharomyces cerevisiae S288C known RefSeq (NP_)	5,983	1,613 (26.96%)	1,613 (26.96%)	57.85%	45.64%
Hydra vulgaris GenBank	543	294 (54.14%)	294 (54.14%)	61.14%	46.04%
Hydra vulgaris known RefSeq (NP_)	198	120 (60.61%)	120 (60.61%)	58.12%	35.10%
Schistosoma mansoni GenBank	1,379	385 (27.92%)	385 (27.92%)	63.01%	57.59%
Caenorhabditis elegans GenBank	2,393	1,339 (55.95%)	1,339 (55.95%)	59.61%	43.38%
Caenorhabditis elegans known RefSeq (NP_)	28,225	8,731 (30.93%)	8,731 (30.93%)	57.70%	37.27%
Drosophila melanogaster GenBank	27,999	13,307 (47.53%)	13,307 (47.53%)	58.98%	42.01%
Drosophila melanogaster known RefSeq (NP_)	30,469	15,071 (49.46%)	15,071 (49.46%)	59.30%	41.67%
Strongylocentrotus purpuratus high-quality model RefSeq (XP_)	13,741	10,688 (77.78%)	10,688 (77.78%)	60.60%	47.95%
Ciona intestinalis GenBank	1,249	753 (60.29%)	753 (60.29%)	60.19%	37.11%
Ciona intestinalis high-quality model RefSeq (XP_)	10,476	6,218 (59.35%)	6,218 (59.35%)	57.23%	40.99%
Ciona intestinalis known RefSeq (NP_)	950	630 (66.32%)	630 (66.32%)	59.47%	37.33%
Branchiostoma floridae GenBank	430	320 (74.42%)	320 (74.42%)	63.88%	43.99%
Branchiostoma belcheri high-quality model RefSeq (XP_)	7,530	6,557 (87.08%)	6,557 (87.08%)	60.79%	49.91%
Homo sapiens GenBank	128,833	76,626 (59.48%)	76,626 (59.48%)	59.80%	44.75%
Homo sapiens known RefSeq (NP_)	49,232	33,323 (67.69%)	33,323 (67.69%)	58.64%	40.79%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences