Protected Data Usage Guide
Table of Contents
Contact: sra-tools@ncbi.nlm.nih.gov
The following guide will outline how to use the SRA Toolkit to access protected data from dbGaP. Versions of the SRA Toolkit newer than v2.10.2 no longer require configuration of the toolkit for use with protected data from dbGaP.
Steps for using the SRA Toolkit to access dbGaP data
- User must have an up to date version of the SRA Toolkit installed.
- Users that wish to access controlled-access data must first apply for approval. Please review the process at the dbGaP Authorized Access page.
-
Once granted access to a project, the PI may login and click the "get dbGaP repository
key" link next to the project to download the repository key. This file should
be closely guarded.
- Specify the repository key in your SRA Toolkit commands using the --ngc flag. For users that do not yet have an approved project, the test key prj_phs710EA_test.ngc is available for accessing a copy of 1000 Genomes data from NCBI. Downloading this key will allow users to test their toolkit configuration on encrypted data that is consented for public access.
Accessing Encrypted dbGaP Data
The latest versions of the SRA toolkit no longer require or allow setting a project’s workspace. However, you will need your project’s key to decrypt the dbGaP data. To prefetch dbGaP data, run the following command:
This will create a Run with a name like SRR0000001_dbgap_#####.sra. To decrypt this local copy of the Run, change the name of the Run file by removing the _dbgap#####
and provide the NGC key file in your SRA Toolkit command:
The SRA Toolkit vdb-decrypt program supports the decryption process for dbGaP phenotype and genotype files.
Users do not need to first download the SRA Run (SRR) files to access dbGaP protected sequences. The tools such as fasterq-dump or sam-dump can download all or part of an SRA Run file and its reference files automatically. Please see the usage instructions for the individual Toolkit utilities (e.g., fasterq-dump) for details. Below are some examples using the test key prj_phs710EA_test.ngc to directly download and convert Runs to fastq or sam/bam format.
Information about the fasterq-dump program and options used in the example can be found on the tool's documentation page.
The sam-dump utility can also be used to directly access aligned data in a dbGaP study without the need to download the data files first.