NCBI-generated RNA-seq count dataBETA

Why NCBI generates RNA-seq count data
Taxonomic Scope of NCBI-generated RNA-seq count data
What NCBI-generated RNA-seq count files are available
How NCBI generates RNA-seq count data
How to locate and access NCBI-generated RNA-seq count data
Limitations and caveats

Why NCBI generates RNA-seq count data

A major barrier to fully exploiting and reanalyzing the massive volumes of public RNA-seq data archived by SRA is the cost and effort required to consistently process raw RNA-seq reads into concise formats that summarize the expression results. To help address this need, the NCBI SRA and GEO teams have built a pipeline that precomputes RNA-seq gene expression counts and delivers them as count matrices that may be incorporated into commonly used differential expression analysis and visualization software.

Taxonomic Scope of NCBI-generated RNA-seq count data

NCBI generates RNA-seq count data for only human and mouse RNA-seq runs submitted to GEO. The human RNA-seq count data are currently available and the mouse RNA-seq count data are expected to become available in Summer, 2024.

What NCBI-generated RNA-seq count files are available

Counts have been generated for all historical human RNA-seq runs submitted to GEO. For each GEO Series, the following files are generated:

Series RNA-seq raw counts matrix

Series RNA-seq raw counts matrices are tab-delimited text files that may be suitable for input for differential expression analysis tools like DESeq2, edgeR or limma voom. The first column in the matrix contains unique Gene IDs that match the Gene ID column in the accompanying Human gene annotation table (see below). Remaining columns contain raw counts for each GEO Sample in the Series.

Series RNA-seq normalized counts matrix

Series RNA-seq normalized counts matrices are tab-delimited text files that may be suitable input for qualitative analysis and visualizing gene expression abundance. The first column in the matrix contains unique Gene IDs that match the Gene ID column in the accompanying Human gene annotation table (see below). Remaining columns contain counts for each GEO Sample in the Series.

These counts are normalized according to sequencing depth and gene length.

The FPKM counts represent Fragments Per Kilobase Million (for paired-end sequencing data) or Reads Per Kilobase Million (single end). Note that file is named FPKM in both cases. The TPM counts represent Transcripts Per Kilobase Million.

For more information about normalized counts, see FPKM, RPKM and TPM, and be aware of misuses.

Human gene annotation table

The Human gene annotation table is a tab-delimited text file intended to be used in conjunction with the raw and normalized count matrices. The first column contains unique Gene IDs that match the Gene ID column in the count files. The rest of the columns contain gene-level annotation, including gene name, gene symbols, description, gene type, chromosome location and gene ontology terms. This gene annotation is necessary for biological interpretation and analyses of RNA-seq count expression data.

How NCBI generates RNA-seq count data

Briefly, SRA runs where the organism is Homo sapiens and type is Transcriptomic are aligned to genome assembly GCA_000001405.15 using HISAT2. Runs that pass a 50% alignment rate are further processed with Subread featureCounts which outputs a raw count file for each run. For Human data, the Homo sapiens Annotation Release 109.20190905 was used for gene annotation. GEO further processes these SRR raw count files into GEO Series raw counts matrices. Data derived from single cell samples are skipped. In cases where there is more than one SRA run per GEO Sample, the raw counts are summed. Values in the raw count matrices are rounded so that they are compatible input for common differential expression analysis software. Using the raw counts as input, GEO then computes FPKM(RPKM) and TPM normalized values.

The pipeline has been applied to all historical human RNA-seq runs and continues to process new RNA-seq data as it is submitted to SRA, with a turnaround time of approximately one week. If the submitter makes edits to their submitted data (like replacing a fastq file), the pipeline will be re-run on the new data.

How to locate and access NCBI-generated RNA-seq count data

All GEO studies with NCBI-generated RNA-seq counts can be identified by searching GEO DataSets with "rnaseq counts"[Filter], and following the 'Download Data'. See an example Download page.

Alternatively, using the Series accession numbers retrieved with the above search, it is possible to construct links for direct file download, eg:
https://www.ncbi.nlm.nih.gov/geo/download/?type=rnaseq_counts&acc=GSE164073&format=file&file=GSE164073_raw_counts_GRCh38.p13_NCBI.tsv.gz.

Limitations and caveats

The SRA and GEO databases archive thousands of original RNA-seq studies submitted by the scientific community. These studies represent a large diversity of experimental types and designs, and contain data that are generated using a wide variety of library preparation methods and processing software. The NCBI pipeline runs on almost any RNA-seq dataset, within minimal checks on quality and suitability. Therefore, the user must be aware of the following limitations and caveats.

Counts may not match publication: The counts generated by the NCBI pipeline may not match results in the accompanying publication. RNA-seq data can be processed using many different software packages, parameter settings and filters. The NCBI pipeline represents just one of many possible processing approaches. It is likely the original submitter used different procedures to process their data, which can lead to somewhat different expression results from those generated by the NCBI pipeline. (Note that in the case of data submitted via GEO, processed data files matching the results in corresponding publication are usually supplied by the submitter and are available on the GEO records).
Minimal quality checks: The only parameter a run must pass for inclusion in the NCBI pipeline is that the run is of type 'transcriptomic' and it has a genome alignment rate over 50%.
Check that samples are comparable: Submitters often deposit more than one type of data (eg, RNA-seq and RIP-seq) in the same study, meaning that the RNA counts, even within a matrix, are not directly comparable. Other times, although samples are of the same type, they still may not be intended for comparison. Review the original records to determine if all the samples within a study are intended to be compared directly.
Caution with cross-study comparisons: Despite all NCBI count data being generated by the same pipeline, particular caution should be taken if attempting to compare counts from different studies. The NCBI pipeline makes no attempt to correct for laboratory biases or other confounding factors.
Normalized matrix files may not be sufficiently normalized: FPKM (RPKM) and TPM counts should not be used for quantitative comparisons across samples when the total RNA contents and its distributions are very different. As discussed by Zhao et al., "A common misconception is that RPKM and TPM values are already normalized, and thus should be comparable across samples or RNA-seq projects. However, RPKM and TPM represent the relative abundance of a transcript among a population of sequenced transcripts, and therefore depend on the composition of the RNA population in a sample. Quite often, it is reasonable to assume that total RNA concentration and distributions are very close across compared samples. Nevertheless, the sequenced RNA repertoires may differ significantly under different experimental conditions and/or across sequencing protocols; thus, the proportion of gene expression is not directly comparable in such cases".
Missing samples: Reasons for missing sample count data include the run didn't pass the 50% alignment rate or processing failed for a technical reason.

If you have any questions, feedback, or requests for future work in this area, please e-mail GEO.