skip to main content

Protected Data Usage Guide

Contact: sra-tools@ncbi.nlm.nih.gov

The following guide will outline how to use the SRA Toolkit to access protected data from dbGaP. Versions of the SRA Toolkit newer than v2.10.2 no longer require configuration of the toolkit for use with protected data from dbGaP.

Steps for using the SRA Toolkit to access dbGaP data

  1. User must have an up to date version of the SRA Toolkit installed.
  2. Users that wish to access controlled-access data must first apply for approval. Please review the process at the dbGaP Authorized Access page.
  3. Once granted access to a project, the PI may login and click the "get dbGaP repository key" link next to the project to download the repository key. This file should be closely guarded.
  4. Specify the repository key in your SRA Toolkit commands using the --ngc flag. For users that do not yet have an approved project, the test key prj_phs710EA_test.ngc is available for accessing a copy of 1000 Genomes data from NCBI. Downloading this key will allow users to test their toolkit configuration on encrypted data that is consented for public access.

Accessing Encrypted dbGaP Data

The latest versions of the SRA toolkit no longer require or allow setting a project’s workspace. However, you will need your project’s key to decrypt the dbGaP data. To prefetch dbGaP data, run the following command:

$ prefetch --ngc prj_phsEXAMPLE.ngc SRR0000000

This will create a Run with a name like SRR0000001_dbgap_#####.sra. To decrypt this local copy of the Run, change the name of the Run file by removing the _dbgap#####

$ mv SRR0000000_dbgap_#####.sra SRR0000000.sra

and provide the NGC key file in your SRA Toolkit command:

$ fasterq-dump --ngc ngc prj_phsEXAMPLE.ngc SRR0000000.sra

The SRA Toolkit vdb-decrypt program supports the decryption process for dbGaP phenotype and genotype files.

$ vdb-decrypt –-ngc prj_phsEXAMPLE.ngc <encrypted_file>

Users do not need to first download the SRA Run (SRR) files to access dbGaP protected sequences. The tools such as fasterq-dump or sam-dump can download all or part of an SRA Run file and its reference files automatically. Please see the usage instructions for the individual Toolkit utilities (e.g., fasterq-dump) for details. Below are some examples using the test key prj_phs710EA_test.ngc to directly download and convert Runs to fastq or sam/bam format.

Information about the fasterq-dump program and options used in the example can be found on the tool's documentation page.

$ fasterq-dump –ngc prj_phs710EA_test.ngc SRR1219902

The sam-dump utility can also be used to directly access aligned data in a dbGaP study without the need to download the data files first.

$ sam-dump –ngc prj_phs710EA_test.ngc --aligned-region 15:28196787-28197287 SRR1219902

Links and help documents