ClinVar Variations in VCF Format
This document describes the ClinVar set of human variations in the VCF format.
The files report on human variations with clinical assertions that have been mapped to assemblies GRCh37 and GRCh38. The files are provided at the ClinVar FTP repository ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/.
ClinVar Files
ClinVar VCF files currently represent all human variants with precise endpoints that have been reported to ClinVar.
These files exclude:
- Variants with imprecise endpoints, such as those identified by microarray. We plan to develop a second set of VCF files to represent these variants.
- Other variants at the same location that are registered in dbSNP but do not have an assertion in ClinVar.
- Variants that cannot be localized on the genome, such as variants reported in the literature with legacy nomenclature that cannot be confirmed.
ClinVar VCF files are allele-specific - each row represents a single allele at that position, rather than one row per rs number as in the dbSNP VCF files.
ClinVar provides VCF files for both GRCh37 and GRCh38.
Note that we use VCF version 4.1.
Table 1 below summarizes the files generated by ClinVar, with a brief overview of their content.
Each file in the table is built per assembly, as part of ClinVar’s monthly release and is located in the respective directory:
- ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/
- ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/
Files using the first version of the ClinVar VCF format (1.0) are archived in the following directories:
- ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_1.0/
- ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/archive_1.0/
Files using the new version of the ClinVar VCF format (2.0) are archived in the following directories:
- ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/
- ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/archive_2.0/
Notes on content
- The ID column (column 3) reports the ClinVar Variation ID.
- ClinVar accepts all IUPAC ambiguity codes for nucleotides. However, the VCF specification (https://samtools.github.io/hts-specs/VCFv4.2.pdf) only allows ambiguity code N. Thus ClinVar XML retains the actual ambiguous bases, but all ambiguous values are converted to N in the VCF files.
- Interpretations may be made on a single variant or a set of variants, such as a haplotype. Variants that have only been interpreted as part of a set of variants (i.e. no direct interpretation for the variant itself) are considered "included" variants. The VCF files include both variants with a direct interpretation and included variants. Included variants do not have an associated disease (CLNDN, CLNDISDB) or a clinical significance (CLNSIG). Instead there are three tags are specific to the included variants - CLNDNINCL, CLNDISDBINCL, and CLNSIGINCL (see below).
- Data reported in the INFO tags is aggregated by Variation ID. INFO tags that are retained from the old format are CLNDN, CLNDISDB, CLNSIG, GENEINFO, RS, SSR.
INFO Tag | Comment |
---|---|
AF_ESP, AF_EXAC, AF_TGP |
Allele frequency is reported in three tags, one for each source of data.
|
ALLELEID | The ClinVar Allele ID for the variant |
CLNDNINCL | Used only for “included” variants. ClinVar's preferred disease name for an interpretation for a haplotype or genotype that includes this variant |
CLNDISDBINCL | Used only for “included” variants. The database name and identifier for the disease name for an interpretation for a haplotype or genotype that includes this variant. Multiples are separated by a pipe |
CLNHGVS | The top-level genomic HGVS expression for the variant. This may be on an accession for the primary assembly or on an ALT LOCI |
CLNSIGINCL | Used only for “included” variants. The clinical significance of a haplotype or genotype that includes this variant. It is reported as pairs of Variation ID for the haplotype or genotype and the corresponding clinical significance |
CLNVI | Identifiers for the variant in other databases, e.g. OMIM Allelic variant IDs |
CLNVC | The type of variation, using terms from Sequence Ontology |
CLNVCSO | The Sequence Ontology identifier for the type of variation |
MC |
The predicted molecular consequence of the variant. It is reported as pairs of the Sequence Ontology (SO) identifier and the molecular consequence term joined by a vertical bar. Multiple values are separated by a comma. This tag replaces ASS, DSS, INT, NSF, NSM, NSN, R3, R5, SYN, U3, and U5 in the old format |