NCBI Crassostrea virginica Annotation Release 100

The RefSeq genome records for Crassostrea virginica were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Crassostrea virginica Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Sep 11 2017
Date of submission of annotation to the public databases: Sep 14 2017
Software version: 7.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
C_virginica-3.0	GCF_002022765.2	McDonnell Genome Institute - Washington University School of Medicine	09-01-2017	Reference	11 assembled chromosomes

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	C_virginica-3.0
Genes and pseudogenes	39,493
protein-coding	34,596
non-coding	4,230
pseudogenes	667
genes with variants	11,221
mRNAs	60,201
fully-supported	55,367
with > 5% ab initio	3,014
partial	148
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	60,201
Other RNAs	6,986
fully-supported	6,422
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	6,422
CDSs	60,201
fully-supported	55,367
with > 5% ab initio	3,233
partial	148
with major correction(s)	2,176
known RefSeq (NP_)	0
model RefSeq (XP_)	60,201

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	38,826	10,828	5,829	70	529,353
All transcripts	67,187	2,669	2,072	70	64,880
mRNA	60,201	2,829	2,206	183	64,880
misc_RNA	1,672	2,526	2,072	99	15,610
tRNA	564	74	73	70	84
lncRNA	4,750	998	705	81	7,692
Single-exon transcripts	1,833	1,538	1,317	303	8,149
coding transcripts (NM_/XM_ )	1,833	1,538	1,317	303	8,149
CDSs	60,201	1,960	1,398	183	63,900
Exons	346,293	273	138	1	11,272
in coding transcripts (NM_/XM_ )	328,467	273	138	1	11,272
in non-coding transcripts (NR_/XR_ )	24,866	245	129	2	10,888
Introns	305,507	1,357	419	30	484,530
in coding transcripts (NM_/XM_ )	292,259	1,352	417	30	484,530
in non-coding transcripts (NR_/XR_ )	19,903	1,430	445	30	415,446

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.74	1	1	50
Number of exons per transcript	10.81	7	1	250

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 34596 coding genes, 19030 genes had a protein with an alignment covering 50% or more of the query and 3974 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
C_virginica-3.0	GCF_002022765.2	5.52%	36.19%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	76	74 (97.37%)	63 (82.89%)	98.38%	98.76%
Same-species EST	14,559	10,471 (71.92%)	8,887 (61.04%)	98.49%	99.16%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	1,915,658,231	40%	37%	389,201
SAMN00780095	The eastern oyster Crassostrea virginica (Crassostrea virginica, SAMN00780095)	52,857,842	66%	29%	245,864
SAMN02402291	adductor muscle, gill, mantle (Crassostrea virginica, SAMN02402291)	802,264	66%	45%	97,043
SAMN02402292	adductor muscle, gill, mantle (Crassostrea virginica, SAMN02402292)	578,687	65%	47%	90,046
SAMN02786835	Whole oyster, no signs of ROD, challenge-resistant, 1 day (Crassostrea virginica, 1 year old, SAMN02786835)	33,468,655	53%	27%	230,873
SAMN02797576	Whole oyster, no signs of ROD, challenge-resistant, control, 5d (Crassostrea virginica, 1 year old, SAMN02797576)	31,965,765	73%	29%	241,065
SAMN02797577	Whole oyster, no signs of ROD, challenge-resistant, control, 15d (Crassostrea virginica, 1 year old, SAMN02797577)	31,170,987	67%	25%	234,137
SAMN02797578	Whole oyster, no signs of ROD, challenge-resistant, control, 30d (Crassostrea virginica, 1 year old, SAMN02797578)	30,759,369	70%	24%	228,317
SAMN02797579	Whole oyster, no signs of ROD, challenge-resistant, 1d (Crassostrea virginica, 1 year old, SAMN02797579)	32,354,340	67%	30%	241,783
SAMN02797580	Whole oyster, no signs of ROD, challenge-resistant, 5d (Crassostrea virginica, 1 year old, SAMN02797580)	31,965,765	73%	29%	241,065
SAMN02797581	Whole oyster, no signs of ROD, challenge-resistant, 15d (Crassostrea virginica, 1 year old, SAMN02797581)	31,374,190	63%	24%	225,512
SAMN02797582	Whole oyster, no signs of ROD, challenge-resistant, 30d (Crassostrea virginica, 1 year old, SAMN02797582)	32,215,032	60%	27%	231,666
SAMN02797583	Whole oyster, no signs of ROD, challenge-susceptible, 1d (Crassostrea virginica, 1 year old, SAMN02797583)	32,820,079	70%	28%	244,373
SAMN02797584	Whole oyster, 5% mortality, challenge-susceptible, 5d (Crassostrea virginica, 1 year old, SAMN02797584)	32,761,787	68%	31%	234,756
SAMN02797585	Whole oyster, 15% mortality, challenge-susceptible, 15d (Crassostrea virginica, 1 year old, Not determined, SAMN02797585)	30,378,384	63%	23%	212,884
SAMN02797586	Whole oyster, ROD, 30% mortality, challenge-susceptible, 30d (Crassostrea virginica, 1 year old, SAMN02797586)	31,859,180	67%	25%	229,734
SAMN05216814	gill (Crassostrea virginica, unknown, not determined, SAMN05216814)	96,422,612	65%	31%	290,522
SAMN05216824	Digestive Gland (Crassostrea virginica, unknown, not determined, SAMN05216824)	783,963	54%	55%	103,547
SAMN06617317	Sample from Crassostrea virginica (Crassostrea virginica, 16 d, SAMN06617317)	134,810,996	33%	44%	260,682
SAMN06617318	mantle (Crassostrea virginica, SAMN06617318)	117,159,696	36%	43%	266,714
SAMN06617319	Sample from Crassostrea virginica (Crassostrea virginica, 12 d, SAMN06617319)	152,029,268	19%	41%	242,039
SAMN06617320	Sample from Crassostrea virginica (Crassostrea virginica, 16 d, SAMN06617320)	144,400,862	34%	45%	263,726
SAMN06617321	digestive (Crassostrea virginica, SAMN06617321)	138,783,036	38%	46%	284,624
SAMN06617322	Sample from Crassostrea virginica (Crassostrea virginica, 5 d, SAMN06617322)	150,273,516	23%	41%	247,852
SAMN06617323	Sample from Crassostrea virginica (Crassostrea virginica, 12 d, SAMN06617323)	122,181,730	23%	49%	243,336
SAMN06617324	gill (Crassostrea virginica, SAMN06617324)	124,711,320	37%	44%	276,920
SAMN06617325	Sample from Crassostrea virginica (Crassostrea virginica, 5 d, SAMN06617325)	158,969,060	27%	45%	253,454
SAMN06617326	adductor muscle (Crassostrea virginica, SAMN06617326)	137,799,846	39%	42%	213,764

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR404226	SRX118365	SRP010711	SAMN00780095	52,857,842	66%	29%
SRR1029667	SRX377511	SRP032997	SAMN02402291	802,264	66%	45%
SRR1029668	SRX377513	SRP032997	SAMN02402292	578,687	65%	47%
SRR1293904	SRX547011	SRP042090	SAMN02786835	33,468,655	53%	27%
SRR1298417	SRX550313	SRP042090	SAMN02797576	31,965,765	73%	29%
SRR1298710	SRX551516	SRP042090	SAMN02797577	31,170,987	67%	25%
SRR1298421	SRX551498	SRP042090	SAMN02797578	30,759,369	70%	24%
SRR1298693	SRX551503	SRP042090	SAMN02797579	32,354,340	67%	30%
SRR1298711	SRX551517	SRP042090	SAMN02797580	31,965,765	73%	29%
SRR1298698	SRX551505	SRP042090	SAMN02797581	31,374,190	63%	24%
SRR1298701	SRX551506	SRP042090	SAMN02797582	32,215,032	60%	27%
SRR1298703	SRX551509	SRP042090	SAMN02797583	32,820,079	70%	28%
SRR1298704	SRX551510	SRP042090	SAMN02797584	32,761,787	68%	31%
SRR1298708	SRX551512	SRP042090	SAMN02797585	30,378,384	63%	23%
SRR1298387	SRX551514	SRP042090	SAMN02797586	31,859,180	67%	25%
SRR3649424	SRX1831878	SRP076299	SAMN05216814	678,822	61%	56%
SRR3649587	SRX1831891	SRP076299	SAMN05216814	95,743,790	65%	31%
SRR3649790	SRX1832039	SRP076299	SAMN05216824	783,963	54%	55%
SRR5357618	SRX2652899	SRP101653	SAMN06617317	134,810,996	33%	44%
SRR5357621	SRX2652893	SRP101653	SAMN06617318	117,159,696	36%	43%
SRR5357622	SRX2652895	SRP101653	SAMN06617319	152,029,268	19%	41%
SRR5357617	SRX2652894	SRP101653	SAMN06617320	144,400,862	34%	45%
SRR5357624	SRX2652897	SRP101653	SAMN06617321	138,783,036	38%	46%
SRR5357619	SRX2652896	SRP101653	SAMN06617322	150,273,516	23%	41%
SRR5357626	SRX2652898	SRP101653	SAMN06617323	122,181,730	23%	49%
SRR5357620	SRX2652892	SRP101653	SAMN06617324	124,711,320	37%	44%
SRR5357623	SRX2652900	SRP101653	SAMN06617325	158,969,060	27%	45%
SRR5357625	SRX2652901	SRP101653	SAMN06617326	137,799,846	39%	42%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Crassostrea gigas high-quality model RefSeq (XP_)	9,146	9,027 (98.70%)	9,027 (98.70%)	77.37%	89.14%
Mollusca GenBank	8,266	4,367 (52.83%)	4,367 (52.83%)	71.89%	74.75%
Mollusca known RefSeq (NP_)	473	411 (86.89%)	411 (86.89%)	74.54%	78.76%
Aplysia californica high-quality model RefSeq (XP_)	9,875	6,904 (69.91%)	6,904 (69.91%)	60.51%	52.84%
Same-species GenBank	55	48 (87.27%)	48 (87.27%)	86.47%	93.00%
Drosophila melanogaster known RefSeq (NP_)	30,469	15,674 (51.44%)	15,674 (51.44%)	62.74%	50.16%
Homo sapiens known RefSeq (NP_)	49,883	32,403 (64.96%)	32,403 (64.96%)	59.91%	45.33%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences