Structural Variation Data Hub

Here are the datasets most commonly requested by our users. For a complete listing of all dbVar data please see our Study Browser or our Variant Summary page.

Clinical Structural Variants
Common Structural Variants
Long Read Technology
Genome-wide surveys of structural variation
Datasets most accessed by users

Last updated: Monday, September 18, 2023

Clinical Structural Variants

All structural variants with clinical interpretations curated at ClinVar are included in a single dbVar study: Clinical Structural Variants (nstd102). Many of these variants were previously accessioned in separate studies at dbVar (e.g., nstd37, nstd101, etc.). The old accessions will be retired in 2021. A file linking old accessions to new is available here. The easiest way to browse all dbVar clinical variants is to visit the Clinical Structural Variants (nstd102) data track in NCBI's Variation Viewer or connect to the Public dbVar Hub at the UCSC Genome Browser.

Study	Download Regions; Calls	Search Variants	Description
Clinical Structural Variants (nstd102)	83,537; 88,133	nstd102 variants	Structural Variants with clinical assertions, submitted to ClinVar by external labs. dbVar now imports all placements from ClinVar as "submitted" and only remaps what is missing in order to place all variants on both GRCh37 and GRCh38. See Variant Summary counts for nstd102 in dbVar Variant Summary. See the latest statistics for nstd102 in Summary of nstd102 (Clinical Structural Variants).

Common Structural Variants

All common structural variants are included in a single dbVar study: NCBI Curated Common Structural Variants (nstd186). These variants are also accessioned in separate studies at dbVar (1000 Genomes Consortium Phase 3 Integrated SV (estd219), gnomAD Structural Variants (nstd166), DECIPHER Consensus CNVs (nstd183), Lee et. al 2020 (nstd194), Abel et. al 2020 (nstd200), Byrska-Bishop et. al 2022 (nstd206)). A file linking accessions between the studies is available here. The easiest way to browse all dbVar common variants is to visit the NCBI Curated Common Structural Variants (nstd186) data track in NCBI's Variation Viewer or connect to the Public dbVar Hub at the UCSC Genome Browser.

Study	Download Regions; Calls	Search Variants	Description
NCBI Curated Common Structural Variants (nstd186)	92,934; 111,219	nstd186 variants	A curated dataset of all structural variants in dbVar that meet the following criteria: were part of a study with at least 100 samples; included allele frequency data; had an allele frequency of >=0.01 in at least one population. Data content of this study is subject to change as new data become available. See Variant Summary counts for nstd186 in dbVar Variant Summary. See the latest statistics for nstd186 in Summary of nstd186 (NCBI Curated Common Structural Variants).

Long Read Technology

Long-read sequencing is better than short-read technologies at capturing large structural variation events. The following studies used long-read sequencing methods to detect SV.

Study	Download Regions; Calls	Search Variants	Description
Genome in a Bottle Structural Variants - Tier I, v0.6 (nstd175)	12,745; 12,745	nstd175 variants	The v0.6 Genome in a Bottle Consortium [www.genomeinabottle.org] structural variant (SV) benchmark set includes ~10,000 sequence-resolved insertions and deletions >49bp from the broadly-consented GIAB/Personal Genome Project Ashkenazi son (HG002/GM24385). These SVs, along with an accompanying benchmark BED file, are discovered and evaluated by multiple short, linked, and long read sequencing technologies and are intended as a benchmark for identifying false positive and false negative SV calls in any method. Original VCF files and the benchmark BED file can be found here. See Variant Summary counts for nstd175 in dbVar Variant Summary. PubMed:Genome in a Bottle.
PacBio Circular Consensus Sequencing of human male (nstd167)	30,218; 30,634	nstd167 variants	PacBio Circular Consensus Sequencing (CCS) of the human male HG002/NA24385 to evaluate the ability of highly-accurate long-read sequencing to identify small and large variants, to phase variants into haplotypes, and to assemble a genome de novo. See Variant Summary counts for nstd167 in dbVar Variant Summary. PubMed:Wenger et al. 2019.
Intermediate-sized deletions examined with Nanopore long-read sequencing (nstd171)	4,378; 4,378	nstd171 variants	Intermediate-sized deletions (30bp-5kbp) were identified from whole-genome sequencing data of a Japanese population using a two-stage identification process. Detected intermediate-sized deletions underwent stringent filtering and accuracy of the deletion calls were checked using data from Oxford Nanopore long-read sequencers. See Variant Summary counts for nstd171 in dbVar Variant Summary. PubMed:Wong et al. 2019.
Multi-platform discovery of haplotype-resolved structural variation in human genomes (nstd152)	103,985; 214,917	nstd152 variants	This is an integrated callset from three individuals (HG00514, HG00733, and NA19240) sequenced using Illumina, Illumina 3.5 kbp jumping libraries, Illumina 6kbp jumping libraries, PacBio, BioNano Genomics, 10x, Hi-C, and Strand-seq. See Variant Summary counts for nstd152 in dbVar Variant Summary. PubMed:Chaisson et al. 2019.
Major Structural Variant Alleles of the Human Genome (nstd162)	99,810; 342,842	nstd162 variants	Although the accuracy of the human reference genome is critical for basic and clinical research, structural variants (SVs) have been difficult to assess because data capable of resolving them have been limited. To address potential bias, we generated long-read sequence data on thirteen genomes. Systematically merging SVs yielded 95,827 sequence-resolved insertions, deletions, and inversions. Among these, we identified more than 1 Mbp of SVs shared among all genomes and more than 6.5 Mbp of SVs in the majority of genomes indicating errors or extreme minor alleles captured in the reference. See Variant Summary counts for nstd162 in dbVar Variant Summary. PubMed:Audano et al. 2019.
Discovery and genotyping of structural variation from long-read haploid genome sequence data (nstd137)	32,954; 35,154	nstd137 variants	In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. Using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that 82% of these variants have been missed as part of analysis of the 1000 Genomes Project. We estimate that this theoretical human diploid differs by as much as ~16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp when compared to short-read sequence data. See Variant Summary counts for nstd137 in dbVar Variant Summary. PubMed:Huddleston et al. 2016.

Genome-wide surveys of structural variation

The following are high-quality datasets that contain the results of genome-wide discovery surveys of CNVs and other Structural Variants from a wide variety of global populations.

Study	Download Regions; Calls	Search Variants	Description
Structural variants in gnomAD (nstd166)	304,733; 313,581	nstd166 variants	The v2.1 release of gnomAD-SV represents a catalogue of structural variants (SVs) discovered from whole-genome sequencing of 14,891 individuals at 32X mean coverage with 2x150bp Illumina reads. From this dataset, site-level SV data was able to be released for 10,847 unrelated individuals with appropriate consent for broad data sharing. For more information, please refer to Collins, Brand, et al., bioRxiv (2019), or the gnomAD-SV explainer. Original VCF files can be found here and, with dbVar accessions included, here. See Variant Summary counts for nstd166 in dbVar Variant Summary. PubMed:gnomAD_Structural_Variants.
Major Structural Variant Alleles of the Human Genome (nstd162)	99,810; 342,842	nstd162 variants	Although the accuracy of the human reference genome is critical for basic and clinical research, structural variants (SVs) have been difficult to assess because data capable of resolving them have been limited. To address potential bias, we generated long-read sequence data on thirteen genomes. Systematically merging SVs yielded 95,827 sequence-resolved insertions, deletions, and inversions. Among these, we identified more than 1 Mbp of SVs shared among all genomes and more than 6.5 Mbp of SVs in the majority of genomes indicating errors or extreme minor alleles captured in the reference. See Variant Summary counts for nstd162 in dbVar Variant Summary. PubMed:Audano et al. 2019.
Multi-platform discovery of haplotype-resolved structural variation in human genomes (nstd152)	103,985; 214,917	nstd152 variants	This is an integrated callset from three individuals (HG00514, HG00733, and NA19240) sequenced using Illumina, Illumina 3.5 kbp jumping libraries, Illumina 6kbp jumping libraries, PacBio, BioNano Genomics, 10x, Hi-C, and Strand-seq. See Variant Summary counts for nstd152 in dbVar Variant Summary. PubMed:Chaisson et al. 2019.
Genome in a Bottle Structural Variants - Tier I, v0.6 (nstd175)	12,745; 12,745	nstd175 variants	The v0.6 Genome in a Bottle Consortium [www.genomeinabottle.org] structural variant (SV) benchmark set includes ~10,000 sequence-resolved insertions and deletions >49bp from the broadly-consented GIAB/Personal Genome Project Ashkenazi son (HG002/GM24385). These SVs, along with an accompanying benchmark BED file, are discovered and evaluated by multiple short, linked, and long read sequencing technologies and are intended as a benchmark for identifying false positive and false negative SV calls in any method. Original VCF files and the benchmark BED file can be found here. See Variant Summary counts for nstd175 in dbVar Variant Summary. PubMed:Genome in a Bottle.
1000 Genomes Project (Phase 3 SV analysis) (estd219)	68,825; 8,812,557	estd219 variants	1000 Genomes Phase 3 structural variants as reported in a companion paper specifically dedicated to SV analysis. Much of these data are identical to those reported in the main paper as study estd214. See Variant Summary counts for estd219 in dbVar Variant Summary. PubMed:1000 Genomes Consortium Phase 3 Integrated SV.
Short Tandem Repeat (STR) Population Survey (nstd128)	1,328,521; 4,394,628	nstd128 variants	We report high quality genomes from 300 individuals from 142 diverse populations. As part of this study, we generated a comprehensive catalog of short tandem repeat (STR) genotypes. We used this call set to characterize allele frequency spectra, analyze sequence determinants of STR variation, and to identify common loss of function alleles. See Variant Summary counts for nstd128 in dbVar Variant Summary. PubMed:Mallick et al. 2016.
CNV Global Population Survey (nstd112)	15,012; 3,303,297	nstd112 variants	To explore the diversity and selective signatures of duplications and deletions in human copy number variation (CNV), we sequenced 236 individuals from 125 distinct human populations. We observed that duplications exhibit fundamentally different population genetic and selective signatures than deletions and are more likely to be stratified between human populations. We find that the proportion of CNV to SNV base pairs is greater among non-Africans than it is among African populations but we conclude that this difference is likely due to unique aspects of non-African population history as opposed to differences in CNV load. See Variant Summary counts for nstd112 in dbVar Variant Summary. PubMed:Sudmant et al. 2015.

Datasets most accessed by users

The following are the top 10 most accessed datasets in the last 12 months.

Study	Download Regions; Calls	Search Variants	Description