Health
Pathogen Detection
Help
GCP
MicroBIGG-E data at Google Cloud Platform

MicroBIGG-E data at Google Cloud Platform

ALPHA RELEASE -- This is under active development and while we strive to maintain correctness, it is possible results may be unstable, unavailable, or incorrect at times. Please contact us by email at pd-help@ncbi.nlm.nih.gov before relying on this data for production analyses.

What data is available on the Google Cloud?
- Pathogen Detection Resources available on the Google Cloud
- Update Frequency
Getting started with BigQuery
Linking to Isolates Browser data in BigQuery
Example searches
Contig sequences
- Example:
  - Get the contig sequence for a contig with a point mutation in a specific assembly
    - First find the contig_url using BigQuery
    - Copy the gzipped contig file using the gs utility
Protein sequences
- Example:
  - Get the sequence of a single protein from MicroBIGG-E
    - Find the protein URL using BigQuery
    - Copy the gzipped protein FASTA file using the gs utility
  - Download all QUINOLONE resistance genes

What data is available on the Google Cloud?

For a list of all resources see Pathogen Detection Resources at Google Cloud Platform

The Microbial Browser for Genomic and Genetic Elements data is now publicly available in the ncbi-pathogen-detect.pdbrowser.microbigge table at Google BigQuery. This data includes all the fields available in the browser and can be searched using Google Standard SQL instead of the SOLR Query Language. This also permits programmatic access and more complex queries. MicroBIGG-E at BigQuery will also allow you to download tables exceeding the 100,000 row limit for the MicroBIGG-E web download. NCBI is piloting this in BigQuery to help users leverage the benefits of elastic scaling and parallel execution of queries. BigQuery has a large collection of client libraries that can be used within your workflow. You can also interact with it on a web browser as described below.

We also are storing the contig sequences and protein sequences for MicroBIGG-E hits in Google Storage buckets. See Contig sequences and Protein sequences below for more information.

Pathogen Detection Resources available on the Google Cloud

Update Frequency

The microbigge table at Google Cloud BigQuery is updated daily. For this reason the contents may not agree exactly with those shown in the MicroBIGG-E web browser. If you see unexpected discrepancies please let us know by emailing us at pd-help@ncbi.nlm.nih.gov.

Getting started with BigQuery

Our Getting started with BigQuery page has instructions on how to run queries with BigQuery.

Linking to Isolates Browser data in BigQuery

NCBI Pathogen Detection also maintains Isolates Browser data in the BigQuery table ncbi-pathogen-detec.pdbrowser.isolates. There are several fields in common between the two tables, but we generally recommend joining on the target_acc field. See Isolates Browser Data at Google Cloud Platform for examples of joining the two tables.

Example searches

Find all carbapenem resistance genes or point mutations in the database

SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%CARBAPENEM%'
ORDER BY element_symbol, closest_reference_acc, target_acc, protein_acc

Find all carbapenem resistance genes in the database

SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%CARBAPENEM%'
AND   subtype = 'AMR'
ORDER BY element_symbol, closest_reference_acc, target_acc, protein_acc

Find all AMRFinderPlus results from Salmonella genomes for further analysis

SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE taxgroup_name = 'Salmonella enterica'

Find elements on contigs that have both blaKPC-2 and blaTEM-1 genes

SELECT
    mb.contig_acc,
    mb.element_symbol
FROM
    `ncbi-pathogen-detect.pdbrowser.microbigge` mb
    JOIN ( SELECT DISTINCT
            mb1.contig_acc
        FROM
            `ncbi-pathogen-detect.pdbrowser.microbigge` mb1
            JOIN `ncbi-pathogen-detect.pdbrowser.microbigge` mb2 
                ON mb1.element_symbol = 'blaTEM-1'
                    AND mb1.contig_acc = mb2.contig_acc
                    AND mb2.element_symbol = 'blaKPC-2') contigs 
        ON contigs.contig_acc = mb.contig_acc
ORDER BY
    mb.contig_acc,
    mb.start_on_contig

Find the five most common known parC resistance mutations in Pathogen Detection analyzed isolates

SELECT element_symbol, count(*) num_found
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol like 'parC_%'
GROUP BY element_symbol
ORDER BY num_found DESC
LIMIT 5

Find the five most common AMR genes associated with quinolone resistance

SELECT element_symbol, subclass, count(*) num_found
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%QUINOLONE%'
AND   subtype = 'AMR'
GROUP BY element_symbol, subclass
ORDER BY num_found DESC
LIMIT 5

Contig sequences

Contig sequences in gzipped FASTA format are stored and accessible in the Google storage bucket ncbi-pathogen-assemblies and the paths to those contigs are listed in the ncbi-pathogen-detect.pdbrowser.microbigge field contig_url.

These can be accessed using the gsutil command-line program included with the Google Cloud CLI (Installation instructions). Or through the GCP BigQuery web interface. See Getting started with BigQuery for more information on how to use BigQuery.

Example:

Get the contig sequence for a contig with a point mutation in a specific assembly

First find the contig_url using BigQuery

SELECT contig_url
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol = 'ompK36_D135DGD'
AND biosample_acc = 'SAMN01057611';

The results should be:

contig_url
gs://ncbi-pathogen-assemblies/Klebsiella/9/640/NZ_CP008827.1.fna.gz

Copy the gzipped contig file using the `gs` utility

Enter the following at a unix shell command-line to copy the gzipped contig FASTA file to your computer. See the Google docs for more information on the gsutil program.

gsutil cp gs://ncbi-pathogen-assemblies/Klebsiella/9/640/NZ_CP008827.1.fna.gz .

Protein sequences

Protein sequences in gzipped FASTA format are stored and accessible in the Google Storage bucket ncbi-pathogen-assemblies and the paths to those files are listed in the ncbi-pathogen-detect.pdbrowser.microbigge field protein_url.

Example:

Get the sequence of a single protein from MicroBIGG-E

Find the protein URL using BigQuery

SELECT protein_url
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol = 'ompK36_D135DGD'
AND biosample_acc = 'SAMN01057611';

The results should be:

protein_url
gs://ncbi-pathogen-proteins/WP_/004/151/WP_004151112.1.faa.gz

Copy the gzipped protein FASTA file using the `gs` utility

Enter the following at a unix shell command-line to copy the gzipped contig FASTA file to your computer. See the Google docs for more information on the gsutil program.

gsutil cp gs://ncbi-pathogen-proteins/WP_/004/151/WP_004151112.1.faa.gz .

Download all QUINOLONE resistance genes

This example uses a linux or MacOS command-line, the Google cloud CLI, and the bash shell. See Install the Google Cloud CLI documentation from Google for instructions of how to install the CLI.

Authenticate the CLI to give it permissions on your Google Cloud project

See Initializing the gcloud CLI for more information.

gcloud auth login

Follow instructions to authenticate to google cloud

Download a list of URLs using `bq`

bq query --use_legacy_sql=false --format=csv --max_rows 300000 '
select distinct protein_url
from `ncbi-pathogen-detect.pdbrowser.microbigge`
where class = "QUINOLONE"
' > all_quinolone_urls.csv

Split the list to smaller lists

We do this because unix directories tend to have problems when there are too many files in one directory

split -d -l 3500 all_quinolone_urls.csv batch.

Use a shell loop to download the protein files

for file in batch.*
do
    mkdir $file.asm
    cat $file | gcloud alpha storage cp --read-paths-from-stdin $file.asm/
done

MicroBIGG-E data at Google Cloud Platform

What data is available on the Google Cloud?

Pathogen Detection Resources available on the Google Cloud

Update Frequency

Getting started with BigQuery

Linking to Isolates Browser data in BigQuery

Example searches

Find all carbapenem resistance genes or point mutations in the database

Find all carbapenem resistance genes in the database

Find all AMRFinderPlus results from Salmonella genomes for further analysis

Find elements on contigs that have both blaKPC-2 and blaTEM-1 genes

Find the five most common known parC resistance mutations in Pathogen Detection analyzed isolates

Find the five most common AMR genes associated with quinolone resistance

Contig sequences

Example:

Get the contig sequence for a contig with a point mutation in a specific assembly

First find the contig_url using BigQuery

Copy the gzipped contig file using the gs utility

Protein sequences

Example:

Get the sequence of a single protein from MicroBIGG-E

Find the protein URL using BigQuery

Copy the gzipped protein FASTA file using the gs utility

Download all QUINOLONE resistance genes

Authenticate the CLI to give it permissions on your Google Cloud project

Download a list of URLs using bq

Split the list to smaller lists

Use a shell loop to download the protein files

Copy the gzipped contig file using the `gs` utility

Copy the gzipped protein FASTA file using the `gs` utility

Download a list of URLs using `bq`