NCBI Eutrema salsugineum Annotation Release 100

The RefSeq genome records for Eutrema salsugineum were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Eutrema salsugineum Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Feb 22 2018
Date of submission of annotation to the public databases: Feb 26 2018
Software version: 8.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Eutsalg1_0	GCF_000478725.1	Joint Genome Institute	11-05-2013	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Eutsalg1_0
Genes and pseudogenes	33,009
protein-coding	26,943
non-coding	4,453
transcribed pseudogenes	1
non-transcribed pseudogenes	1,612
genes with variants	4,755
immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	33,637
fully-supported	28,399
with > 5% ab initio	4,106
partial	117
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	33,637
non-coding RNAs	5,535
fully-supported	1,701
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	5,067
pseudo transcripts	1
fully-supported	1
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	1
CDSs	33,637
fully-supported	28,399
with > 5% ab initio	4,180
partial	117
with major correction(s)	1,119
known RefSeq (NP_)	0
model RefSeq (XP_)	33,637

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	31,396	2,168	1,835	65	134,264
All transcripts	39,172	1,493	1,317	65	16,425
mRNA	33,637	1,649	1,431	87	16,425
misc_RNA	716	1,905	1,626	192	12,952
tRNA	468	74	73	71	87
lncRNA	985	1,225	1,012	150	5,343
snoRNA	3,087	107	107	65	256
snRNA	64	145	145	98	204
rRNA	215	227	119	117	3,386
Single-exon transcripts	5,100	1,120	959	114	6,863
coding transcripts (NM_/XM_ )	5,100	1,120	959	114	6,863
CDSs	33,637	1,347	1,137	87	16,152
Exons	158,848	289	161	1	8,092
in coding transcripts (NM_/XM_ )	154,938	289	160	1	8,092
in non-coding transcripts (NR_/XR_ )	6,722	255	149	6	5,966
Introns	126,998	216	106	30	83,305
in coding transcripts (NM_/XM_ )	124,422	212	106	30	83,305
in non-coding transcripts (NR_/XR_ )	5,334	293	118	30	20,570

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.25	1	1	23
Number of exons per transcript	5.62	4	1	79

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 26943 coding genes, 25256 genes had a protein with an alignment covering 50% or more of the query and 19893 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Eutsalg1_0	GCF_000478725.1	1.89%	30.91%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	189	189 (100.00%)	139 (73.54%)	99.35%	99.78%
Same-species EST	6,537	6,283 (96.11%)	6,063 (92.75%)	99.23%	99.61%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	270,208,863	73%	23%	127,902
SAMN01766537	NA	Thellungiella transcriptome normalized (Eutrema salsugineum, SAMN01766537)	400,631	76%	48%	31,000
SAMN01766538	NA	Thellungiella transcriptome non-normalized (Eutrema salsugineum, SAMN01766538)	813,901	84%	62%	45,007
SAMN02297985	23984645	rosette leaf tissue (Eutrema salsugineum, SAMN02297985)	662,903	84%	62%	76,993
SAMN02297986	23984645	rosette leaf tissue (Eutrema salsugineum, SAMN02297986)	747,765	76%	58%	76,944
SAMN02297987	23984645	rosette leaf tissue (Eutrema salsugineum, SAMN02297987)	727,985	86%	65%	78,867
SAMN02297988	23984645	cauline leaf tissue (Eutrema salsugineum, SAMN02297988)	1,434,097	57%	71%	79,592
SAMN02297989	23984645	rosette leaf tissue (Eutrema salsugineum, SAMN02297989)	1,343,049	73%	66%	82,897
SAMN02297990	23984645	cauline leaf tissue (Eutrema salsugineum, SAMN02297990)	1,531,963	26%	73%	63,698
SAMN02297991	23984645	rosette leaf tissue (Eutrema salsugineum, SAMN02297991)	1,508,989	79%	67%	91,758
SAMN02297992	23984645	cauline leaf tissue (Eutrema salsugineum, SAMN02297992)	1,355,073	68%	63%	77,029
SAMN02297993	23984645	rosette leaf tissue (Eutrema salsugineum, SAMN02297993)	1,533,823	83%	56%	81,108
SAMN02297994	23984645	cauline leaf tissue (Eutrema salsugineum, SAMN02297994)	265,070	83%	54%	49,335
SAMN03097968	NA	rosette leaves (Eutrema salsugineum, 10 weeks, SAMN03097968)	33,503,621	79%	10%	94,479
SAMN03097969	NA	rosette leaves (Eutrema salsugineum, 10 weeks, SAMN03097969)	45,435,078	46%	8%	89,188
SAMN03097970	NA	rosette leaves (Eutrema salsugineum, 10 weeks, SAMN03097970)	33,356,795	81%	9%	91,927
SAMN03097971	NA	rosette leaves (Eutrema salsugineum, 10 weeks, SAMN03097971)	35,979,782	69%	10%	96,124
SAMN04276568	27457936	leaf (Eutrema salsugineum, SAMN04276568)	17,234,338	62%	38%	104,914
SAMN04276616	27457936	leaf (Eutrema salsugineum, SAMN04276616)	18,961,606	61%	35%	101,045
SAMN06309417	NA	stem leaves (Eutrema salsugineum, SAMN06309417)	73,412,394	90%	35%	120,926

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR597771	SRX197603	SRP016562	SAMN01766537	400,631	76%	48%
SRR597840	SRX197604	SRP016562	SAMN01766538	813,901	84%	62%
SRR945443	SRX329494	SRP028337	SAMN02297985	662,903	84%	62%
SRR945444	SRX329495	SRP028337	SAMN02297986	747,765	76%	58%
SRR945446	SRX329497	SRP028337	SAMN02297987	727,985	86%	65%
SRR1004972	SRX329500	SRP028337	SAMN02297988	706,797	37%	76%
SRR1004973	SRX329500	SRP028337	SAMN02297988	727,300	76%	69%
SRR945445	SRX329496	SRP028337	SAMN02297989	1,343,049	73%	66%
SRR1004970	SRX329499	SRP028337	SAMN02297990	760,012	25%	73%
SRR1004971	SRX329499	SRP028337	SAMN02297990	771,951	26%	73%
SRR1004966	SRX329493	SRP028337	SAMN02297991	743,412	68%	71%
SRR1004967	SRX329493	SRP028337	SAMN02297991	765,577	90%	64%
SRR1004974	SRX329501	SRP028337	SAMN02297992	668,625	67%	63%
SRR1004975	SRX329501	SRP028337	SAMN02297992	686,448	68%	62%
SRR1004968	SRX329498	SRP028337	SAMN02297993	772,858	83%	56%
SRR1004969	SRX329498	SRP028337	SAMN02297993	760,965	83%	56%
SRR945451	SRX329502	SRP028337	SAMN02297994	265,070	83%	54%
SRR1617535	SRX726532	SRP048695	SAMN03097968	33,503,621	79%	10%
SRR1617534	SRX737136	SRP048695	SAMN03097969	45,435,078	46%	8%
SRR1617532	SRX737135	SRP048695	SAMN03097970	33,356,795	81%	9%
SRR1617533	SRX737134	SRP048695	SAMN03097971	35,979,782	69%	10%
SRR2922650	SRX1436242	SRP066358	SAMN04276568	4,489,964	63%	38%
SRR2922651	SRX1436242	SRP066358	SAMN04276568	4,386,902	63%	38%
SRR2922652	SRX1436242	SRP066358	SAMN04276568	4,293,912	61%	38%
SRR2922653	SRX1436242	SRP066358	SAMN04276568	4,063,560	61%	38%
SRR2922646	SRX1436241	SRP066358	SAMN04276616	4,905,066	62%	35%
SRR2922647	SRX1436241	SRP066358	SAMN04276616	4,873,871	62%	35%
SRR2922648	SRX1436241	SRP066358	SAMN04276616	4,673,386	60%	35%
SRR2922649	SRX1436241	SRP066358	SAMN04276616	4,509,283	60%	35%
SRR5236359	SRX2543391	SRP099021	SAMN06309417	36,373,940	91%	38%
SRR5236972	SRX2543391	SRP099021	SAMN06309417	37,038,454	89%	32%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Brassicaceae GenBank	7,096	6,898 (97.21%)	6,898 (97.21%)	74.89%	86.27%
Brassicaceae known RefSeq (NP_)	857	850 (99.18%)	850 (99.18%)	73.26%	83.96%
Arabidopsis thaliana GenBank	53,518	51,958 (97.09%)	51,958 (97.09%)	75.33%	86.48%
Arabidopsis thaliana known RefSeq (NP_)	48,148	45,858 (95.24%)	45,858 (95.24%)	72.96%	83.63%
Brassica rapa high-quality model RefSeq (XP_)	24,242	23,920 (98.67%)	23,920 (98.67%)	73.66%	85.01%
Same-species GenBank	27	27 (100.00%)	27 (100.00%)	77.92%	88.92%
Arabidopsis lyrata subsp. lyrata high-quality model RefSeq (XP_)	17,385	17,118 (98.46%)	17,118 (98.46%)	73.36%	84.51%

Comparison of the current and previous annotations

The annotation produced for this release (100) was compared to the annotation in the previous release for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	Eutsalg1_0 (Current) to Eutsalg1_0 (Previous)
Identical	3%
Minor changes	55%
Major changes	15%
New	27%
Deprecated	7%
Other	1%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences