NCBI Brassica oleracea Annotation Release 100

The RefSeq genome records for Brassica oleracea were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Brassica oleracea Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Aug 20 2015
Date of submission of annotation to the public databases: Aug 25 2015
Software version: 6.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
BOL	GCF_000695525.1	CanSeq	05-27-2014	Reference	10 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	BOL
Genes and pseudogenes	53,027
protein-coding	44,305
non-coding	4,227
pseudogenes	4,495
genes with variants	9,542
mRNAs	56,610
fully-supported	48,013
with > 5% ab initio	7,098
partial	1,009
with filled gap(s)	405
known RefSeq (NM_)	0
model RefSeq (XM_)	56,610
Other RNAs	10,981
fully-supported	10,004
with > 5% ab initio	0
partial	2
with filled gap(s)	2
known RefSeq (NR_)	0
model RefSeq (XR_)	10,004
CDSs	56,610
fully-supported	48,013
with > 5% ab initio	7,238
partial	863
with major correction(s)	990
known RefSeq (NP_)	0
model RefSeq (XP_)	56,610

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	48,532	2,379	1,835	70	93,774
All transcripts	67,591	1,491	1,301	70	16,401
mRNA	56,610	1,559	1,355	114	16,401
misc_RNA	2,062	1,616	1,408	117	12,896
tRNA	977	74	73	70	87
lncRNA	7,942	1,142	1,036	81	7,021
Single-exon transcripts	8,538	1,066	897	114	6,659
coding transcripts (NM_/XM_ )	8,538	1,066	897	114	6,659
CDSs	56,610	1,247	1,041	105	16,095
Exons	266,772	288	166	1	11,054
in coding transcripts (NM_/XM_ )	243,310	290	165	1	11,054
in non-coding transcripts (NR_/XR_ )	28,245	255	167	2	6,695
Introns	208,723	250	97	30	88,685
in coding transcripts (NM_/XM_ )	192,601	243	97	30	70,030
in non-coding transcripts (NR_/XR_ )	20,649	305	107	30	88,685

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.37	1	1	49
Number of exons per transcript	5.63	4	1	63

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 44305 coding genes, 40739 genes had a protein with an alignment covering 50% or more of the query and 30714 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
BOL	GCF_000695525.1	2.84%	30.37%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with short reads and reported in the Short read transcript alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	638	601 (94.20%)	480 (75.24%)	98.90%	98.61%
Same-species EST	179,230	162,235 (90.52%)	146,407 (81.69%)	99.03%	99.32%

Short read transcript alignments

The following short reads (RNA-Seq) from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	1,816,466,884	84%	23%	253,126
SAMEA1094410	Brassica oleracea var. alboglabra (Brassica oleracea var. alboglabra, SAMEA1094410)	49,268,765	85%	9%	145,365
SAMEA1468342	Brassica oleracea var. alboglabra (Brassica oleracea var. alboglabra, SAMEA1468342)	226,700,688	90%	27%	178,853
SAMN01831063	callus (Brassica oleracea, SAMN01831063)	32,646,954	60%	15%	157,527
SAMN01831064	flower (Brassica oleracea, blooming in the same day, SAMN01831064)	40,411,902	88%	20%	157,221
SAMN01831065	leaf (Brassica oleracea, 7-week old, SAMN01831065)	42,603,002	87%	19%	155,828
SAMN01831066	root (Brassica oleracea, 7-week old, SAMN01831066)	35,995,912	87%	16%	142,985
SAMN01831067	silique (Brassica oleracea, 15-day after pollination, SAMN01831067)	56,537,648	88%	20%	166,182
SAMN01831068	stem (Brassica oleracea, 7-week old, SAMN01831068)	47,518,902	86%	18%	151,306
SAMN01831069	bud (Brassica oleracea, SAMN01831069)	42,221,896	87%	20%	161,281
SAMN02324429	clubroot root disease resistant strain (Brassica oleracea, SAMN02324429)	254,286,104	87%	25%	191,290
SAMN02324677	blackrot disease resistant and nonresistant (Brassica oleracea, SAMN02324677)	219,777	82%	46%	70,900
SAMN02371508	leaf, 33 days, cold tolerant/susceptible lines (Brassica oleracea var. capitata, SAMN02371508)	166,634,202	87%	23%	186,816
SAMN02399969	cabbage (Brassica oleracea var. capitata, SAMN02399969)	68,996,928	80%	21%	180,724
SAMN02404640	cabbage leaf (Brassica oleracea var. capitata, SAMN02404640)	68,502,320	80%	21%	170,575
SAMN02404641	cabbage seedling (Brassica oleracea var. capitata, SAMN02404641)	69,249,414	81%	22%	183,293
SAMN02404642	cabbage flower (Brassica oleracea var. capitata, SAMN02404642)	69,724,420	80%	22%	179,020
SAMN02404643	9-day cabbage seedling (Brassica oleracea var. capitata, SAMN02404643)	67,601,934	81%	22%	184,145
SAMN02404644	calcium-limited cabbage seedling (Brassica oleracea var. capitata, SAMN02404644)	57,379,970	80%	20%	179,990
SAMN02443788	Cotyledon (Brassica oleracea var. italica, post germination day 7, SAMN02443788)	18,410,080	50%	28%	145,786
SAMN02443789	Cotyledon (Brassica oleracea var. italica, post germination day 11, SAMN02443789)	19,783,640	62%	34%	150,878
SAMN02443790	Germinated Seed (Brassica oleracea var. italica, post germination day 0, SAMN02443790)	19,508,978	61%	34%	155,501
SAMN02443791	Cotyledon (Brassica oleracea var. italica, post germination day 3, SAMN02443791)	14,905,728	52%	31%	143,785
SAMN02443794	Euphylla (Brassica oleracea var. italica, post germination day 11, SAMN02443794)	12,497,526	48%	26%	141,414
SAMN02570439	cotyledons (Brassica oleracea var. oleracea, SAMN02570439)	54,555,054	87%	24%	167,502
SAMN02570440	true leaves (Brassica oleracea var. oleracea, SAMN02570440)	51,073,630	87%	25%	171,726
SAMN02715804	Leaf, Root, Flower, Pods (Brassica oleracea var. oleracea, missing, SAMN02715804)	124,426,036	87%	24%	198,809
SAMN03733606	Floral organ (Brassica oleracea, SAMN03733606)	52,950,290	91%	23%	181,959
SAMN03733607	Floral organ (Brassica oleracea, SAMN03733607)	51,855,184	91%	22%	183,583

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
ERR049710	ERX026776	ERP000954	SAMEA1094410	20,148,700	84%	9%
ERR049711	ERX026778	ERP000954	SAMEA1094410	29,120,065	85%	9%
ERR117405	ERX093679	ERP001385	SAMEA1468342	226,700,688	90%	27%
SRR630922	SRX209691	SRP017530	SAMN01831063	32,646,954	60%	15%
SRR630923	SRX209692	SRP017530	SAMN01831064	40,411,902	88%	20%
SRR630924	SRX209693	SRP017530	SAMN01831065	42,603,002	87%	19%
SRR630925	SRX209694	SRP017530	SAMN01831066	35,995,912	87%	16%
SRR630926	SRX209695	SRP017530	SAMN01831067	56,537,648	88%	20%
SRR630927	SRX209696	SRP017530	SAMN01831068	47,518,902	86%	18%
SRR630928	SRX209697	SRP017530	SAMN01831069	42,221,896	87%	20%
SRR955313	SRX337860	SRP029141	SAMN02324429	132,126,780	87%	25%
SRR955314	SRX337860	SRP029141	SAMN02324429	122,159,324	86%	25%
SRR955709	SRX338064	SRP029176	SAMN02324677	127,522	81%	43%
SRR955710	SRX338064	SRP029176	SAMN02324677	92,255	82%	51%
SRR1010133	SRX363386	SRP030771	SAMN02371508	41,857,488	87%	21%
SRR1557702	SRX686509	SRP030771	SAMN02371508	43,000,146	87%	25%
SRR1557703	SRX686510	SRP030771	SAMN02371508	40,024,608	87%	23%
SRR1557704	SRX686511	SRP030771	SAMN02371508	41,751,960	87%	25%
SRR1030667	SRX375471	SRP032830	SAMN02399969	68,996,928	80%	21%
SRR1032050	SRX378869	SRP032830	SAMN02404640	68,502,320	80%	21%
SRR1032097	SRX378870	SRP032830	SAMN02404641	69,249,414	81%	22%
SRR1032052	SRX378872	SRP032830	SAMN02404642	69,724,420	80%	22%
SRR1032051	SRX378871	SRP032830	SAMN02404643	67,601,934	81%	22%
SRR1032049	SRX378868	SRP032830	SAMN02404644	57,379,970	80%	20%
SRR1049402	SRX391680	SRP034015	SAMN02443788	18,410,080	50%	28%
SRR1049403	SRX391681	SRP034015	SAMN02443789	19,783,640	62%	34%
SRR1049400	SRX391678	SRP034015	SAMN02443790	19,508,978	61%	34%
SRR1049401	SRX391679	SRP034015	SAMN02443791	14,905,728	52%	31%
SRR1049404	SRX391682	SRP034015	SAMN02443794	12,497,526	48%	26%
SRR1104809	SRX423920	SRP035213	SAMN02570439	19,857,084	87%	24%
SRR1104810	SRX423921	SRP035213	SAMN02570439	17,873,288	87%	24%
SRR1104811	SRX423922	SRP035213	SAMN02570439	16,824,682	87%	24%
SRR1104812	SRX423923	SRP035213	SAMN02570440	11,540,434	86%	24%
SRR1104813	SRX423924	SRP035213	SAMN02570440	22,653,664	88%	25%
SRR1104814	SRX423925	SRP035213	SAMN02570440	16,879,532	87%	25%
SRR1213099	SRX507317	SRP040796	SAMN02715804	74,612,842	84%	23%
SRR1213100	SRX507318	SRP040796	SAMN02715804	16,629,870	93%	27%
SRR1213101	SRX507319	SRP040796	SAMN02715804	16,346,484	91%	24%
SRR1213102	SRX507320	SRP040796	SAMN02715804	16,836,840	93%	26%
SRR2039557	SRX1037948	SRP058688	SAMN03733606	52,950,290	91%	23%
SRR2039564	SRX1037957	SRP058688	SAMN03733607	51,855,184	91%	22%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Arabidopsis thaliana known RefSeq (NP_)	35,173	32,798 (93.25%)	32,798 (93.25%)	72.55%	81.60%
Brassica GenBank	3,067	2,851 (92.96%)	2,851 (92.96%)	76.21%	85.64%
Brassica known RefSeq (NP_)	203	203 (100.00%)	203 (100.00%)	77.56%	85.94%
Brassica rapa high-quality model RefSeq (XP_)	22,479	22,383 (99.57%)	22,383 (99.57%)	76.10%	85.07%
Same-species GenBank	601	544 (90.52%)	544 (90.52%)	74.17%	86.81%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences