NCBI » GEO » Info » SOFT submission instructions for high-throughput sequencing dataLogin

SOFT submission instructions for high-throughput sequencing data

Overview Back to top

Simple Omnibus Format in Text (SOFT) is designed for batch submission (and download) of data. SOFT is a simple line-based, plain text format, meaning that SOFT files may be generated from common spreadsheet and database applications. A single SOFT file for a high-throughput sequencing submission can hold descriptive information for multiple Samples and a Series record.

Raw data files and processed data files should also be submitted. Due to the size of the files, high-throughput sequencing submissions should be FTP'd following the instructions on the main high-throughput sequence data submission page.

A SOFT submission template and an example are available as guidelines for SOFT file structure and preparation. The templates are a good starting point for understanding SOFT.

SOFT format structureBack to top

The following section explains the components and structure of a SOFT high-throughput sequencing submission.

  • Line-type characters: There are two different types of line that are recognized in SOFT. The presence of a caret (^) or bang (!) in the first character position indicates the line type. The two line-type characters are:
    SymbolDescriptionLine type
    ^caret linesentity indicator line
    !bang linesentity attribute line
  • Label-value pairs: Label-value pairs are the generic way that lines are organized. Data lines are the only line types that are not organized in label-value pairs. Label-value pairs have the form:
    • [line-type character] [label] = [value]
  • Entity types (caret lines): Entity type and its unique identifier are indicated as a label-value pair on the caret lines. The entity's unique ID is any string of characters different from any other entity ID within the document (i.e., locally unique). High-throughput sequencing submitters should supply entity types SAMPLE and SERIES.
    Entity typeExample entity indicator line
    Sample^SAMPLE = my_sample_name
    Series^SERIES = my_series_name
  • Attributes (bang lines): Entity attributes are contained in bang lines and immediately follow caret lines or other bang lines.

    The second column in the table indicates the 'number of allowed values' per attribute:

    • '1' indicates required, only one value allowed
    • '1 or more' indicates required, one or more values allowed
    • '0 or more' indicates not required, zero or more values allowed

    Human Genomic Data Submitted to Unrestricted-Access Repositories

    NIH-funded studies: If you plan to submit large-scale human genomic data, as defined by the NIH Genomic Data Sharing (GDS) Policy, to be maintained in an unrestricted-access NCBI database, NIH expects you to 1) submit an Institutional Certification to assure that the data submission and expectations defined in the NIH GDS Policy have been met, 2) register the study in NCBI BioProject regardless of where the data will ultimately reside (e.g., GenBank, SRA, GEO). If you have any questions about whether your research is subject to the NIH GDS Policy, please contact the relevant NIH Program Official and/or the Genomic Program Administrator. If you plan to submit genomic data from human specimens that would not be considered large-scale, it is your responsibility to ensure that the submitted information does not compromise participant privacy and is in accord with the original consent in addition to all applicable laws, regulations, and institutional policies.

    Non-NIH-funded studies: If your data are not NIH-funded, you are not required to comply with GDS policy but you must have the appropriate consent/permission to submit the data to a public database like GEO. GEO is not able to help interpret your consent forms, you should consult with your IRB on that. It is your responsibility to ensure that the submitted information does not compromise participant privacy and is in accord with the original consent in addition to all applicable laws, regulations, and institutional policies. If you do not have consent to make the data fully public in a database like GEO, you can apply to the NIH Office of Science Policy to find an NIH Institute that will sponsor your study in NCBI's dbGaP database. dbGaP has controlled access mechanisms and is an appropriate resource for hosting sensitive patient data. The sponsor would create a Data Access Request and Use Certification and define use restrictions for use in approving data access requests.

LabelNumber of allowed labelsAllowed values and constraintsContent guidelines
^SAMPLE 1 any, must be unique within local file Provide an identifier for this entity. This identifier is used only as an internal reference within a given file. The identifier will not appear on final GEO records.
!Sample_type 1 SRA !Sample_type = SRA
!Sample_title 1 string of length less than 120 characters, must be unique within local file and over all previously submitted Samples for that submitter Provide a unique title that describes this Sample.
!Sample_source_name 1 any Briefly identify the biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min.
!Sample_organism 1 or more use standard NCBI Taxonomy nomenclature Identify the organism(s) from which the biological material was derived.
!Sample_characteristics 1 or more 'Tag: Value' format Describe all available characteristics of the biological source, including factors not necessarily under investigation. Provide in 'Tag: Value' format, where 'Tag' is a type of characteristic (e.g. "gender", "strain", "tissue", "developmental stage", "tumor stage", etc), and 'Value' is the value for each tag (e.g. "female", "129SV", "brain", "embryo", etc). Include as many characteristics fields as necessary to thoroughly describe your Samples.
!Sample_molecule 1 total RNA, polyA RNA, cytoplasmic RNA, nuclear RNA, genomic DNA, or other Specify the type of molecule that was extracted from the biological material.
!Sample_biomaterial_provider 0 or more any Specify the name of the company, laboratory or person that provided the biological material.
!Sample_treatment_protocol 0 or more any Describe any treatments applied to the biological material prior to extract preparation. You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
!Sample_growth_protocol 0 or more any Describe the conditions that were used to grow or maintain organisms or cells prior to extract preparation. You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
!Sample_extract_protocol 1 or more any Describe the protocol used to isolate the extract material. You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
!Sample_library_construction_protocol 1 or more Describe the library construction protocol.
!Sample_library_strategy 1 or more See list of library strategy values below Sequencing technique for this library.
!Sample_instrument_model 0 or 1 See list of instrument models below Select an instrument model from the list.
!Sample_data_processing 1 or more any Provide details of how data were generated and calculated. For example, what software was used, how and to what were the reads aligned, what filtering parameters were applied, how were peaks calculated, etc. Include a separate 'data processing' attribute for each file type described.
!Sample_genome_build 1 UCSC or NCBI genome build number (e.g., hg18, mm9, human NCBI genome build 36, etc...), or reference sequence used for read alignment.
!Sample_processed_data_files_format_and_content Describe the supplementary file format and content.
!Sample_adapters 0 or 1 any For multiplexed/barcoded experiments, provide the adapter sequences and barcodes.
!Sample_description 0 or more any Include any additional information not provided in the other fields, or paste in broad descriptions that cannot be easily dissected into the other fields.
!Sample_supplementary_file 1 or more name of processed data file, or 'none' See processed data guidelines for additional instructions.
!Sample_raw_file_run* * See Raw data guidelines section See Raw data guidelines section for complete instructions.
!Sample_geo_accession 0 or 1 a valid Sample accession number (GSMxxx) Only use for performing updates to existing GEO records.
LabelNumber of allowed labelsAllowed values and constraintsContent guidelines
^SERIES 1 any, must be unique within local file Provide an identifier for this entity. This identifier is used only as an internal reference within a given file. The identifier will not appear on final GEO records.
!Series_title 1 string of length 1-255 characters, must be unique within local file and over all previously submitted Series for that submitter Provide a unique title that describes the overall study.
!Series_summary 1 or more any Summarize the goals and objectives of this study. The abstract from the associated publication may be suitable. You can include as much text as you need to thoroughly describe the study.
!Series_overall_design 1 any Provide a description of the experimental design. Indicate how many Samples are analyzed, if replicates are included, are there control and/or reference Samples, dye-swaps, etc.
!Series_pubmed_id 0 or more an integer Specify a valid PubMed identifier (PMID) that references a published article describing this study. Most commonly, this information is not available at the time of submission - it can be added later once the data are published.
!Series_web_link 0 or more valid URL Specify a Web link that directs users to supplementary information about the study. Please restrict to Web sites that you know are stable.
!Series_contributor 0 or more each value in the form 'firstname,middleinitial,lastname' or 'firstname,lastname': firstname must be at least one character and cannot contain spaces; middleinitial, if present, is one character; lastname is at least two characters and can contain spaces. List all people associated with this study.
!Series_supplementary_file 0 or more Name of supplementary files containing data for more than 1 sample.
!Series_sample_id 1 or more valid Sample identifiers Reference the Samples that make up this experiment. Reference the Sample accession numbers (GSMxxx) if the Samples already exists in GEO, or reference the ^Sample identifiers if they are being submitted in the same file.
!Series_geo_accession 0 or 1 a valid Series accession number (GSExxx) Only use for performing updates to existing GEO records.

List of library strategies:

  • RNA-Seq
  • miRNA-Seq
  • ncRNA-Seq
  • RNA-Seq (size fractionation)
  • RNA-Seq (CAGE)
  • RNA-Seq (RACE)
  • ChIP-Seq
  • MNase-Seq
  • MBD-Seq
  • MRE-Seq
  • Bisulfite-Seq
  • Bisulfite-Seq (reduced representation)
  • MeDIP-Seq
  • DNase-Hypersensitivity
  • Tn-Seq
  • FAIRE-seq
  • SELEX
  • RIP-Seq
  • ChIA-PET
  • OTHER

Processed data guidelinesBack to top

  • Processed data are required.
  • Requirements for processed data files are listed in the main high-throughput sequence data submission page.
  • Use the !Sample_supplementary_file and !Series_supplementary_file attributes to list the processed data files associated with the records.

Raw data guidelinesBack to top

  • Raw data files are required.
  • Accepted file types and other important information are found on the main high-throughput sequence data submission page.
  • A GEO Sample can include multiple runs. Each run should identified by a numbered suffix (_run1, _run2, etc…). A run can list multiple files.
  • Each raw file requires several attributes listing name, type, checksum, read length, instrument model, etc… The raw file attributes are listed below.

    Required raw file attributes

    • !Sample_raw_file_name_run1 =
    • !Sample_raw_file_type_run1 =
    • !Sample_raw_file_checksum_run1 =
    • !Sample_raw_file_read_length_run1 =
    • !Sample_raw_file_single_or_paired-end_run1 =
    • !Sample_raw_file_instrument_model_run1 =
    • !Sample_raw_file_name_run2 =
    • !Sample_raw_file_type_run2 =
    • !Sample_raw_file_checksum_run2 =
    • !Sample_raw_file_read_length_run2 =
    • !Sample_raw_file_single_or_paired-end_run2 =
    • !Sample_raw_file_instrument_model_run2 =
    Etc...

    Raw file attributes content

    • !Sample_raw_file_name_run1 = [list file name(s); comma-separated]
    • !Sample_raw_file_type_run1 = [list file type(s); controlled vocabulary; comma-separated]
    • !Sample_raw_file_checksum_run1 = [list md5 checksum(s); comma-separated]
    • !Sample_raw_file_read_length_run1 = [list read length(s); comma-separated]
    • !Sample_raw_file_single_or_paired-end_run1 = [enter "single" or "paired-end"]
    • !Sample_raw_file_instrument_model_run1 = [list instrument model; controlled vocabulary]
    • !Sample_raw_file_assembly_run1 = [Genome assembly used for BAM file; BAM files only]
    • !Sample_raw_file_insert_size_run1 = [for paired-end runs; optional]
    • !Sample_raw_file_standard_deviation_run1 = [for paired-end runs; optional]

    Examples:

    # Single-end fastq file

    • !Sample_raw_file_run1 = myfile.fastq
    • !Sample_raw_file_type_run1 = fastq
    • !Sample_raw_file_checksum_run1 = checksum
    • !Sample_raw_file_read_length_run1 = 36
    • !Sample_raw_file_single_or_paired-end_run1 = single
    • !Sample_raw_file_instrument_model_run1 = Illumina NextSeq 500

    # BAM file

    • !Sample_raw_file_run1 = myfile.bam
    • !Sample_raw_file_type_run1 = bam
    • !Sample_raw_file_checksum_run1 = checksum
    • !Sample_raw_file_read_length_run1 = 36
    • !Sample_raw_file_single_or_paired-end_run1 = single
    • !Sample_raw_file_instrument_model_run1 = Illumina HiSeq 2000
    • !Sample_raw_file_assembly_run1 = mm9

    # SOLiD single-end files

    • !Sample_raw_file_run1 = myfile.seq, myfile.qual
    • !Sample_raw_file_type_run1 = illumina_native_seq, illumina_native_qual
    • !Sample_raw_file_checksum_run1 = checksum1, checksum2
    • !Sample_raw_file_read_length_run1 = 36
    • !Sample_raw_file_single_or_paired-end_run1 = single
    • !Sample_raw_file_instrument_model_run1 = AB 5500xl Genetic Analyzer

    # Paired-end fastq files

    • !Sample_raw_file_run1 = myfile_r1.fastq, myfile_r2.fastq
    • !Sample_raw_file_type_run1 = fastq
    • !Sample_raw_file_checksum_run1 = checksum1, checksum2
    • !Sample_raw_file_read_length_run1 = 36, 36
    • !Sample_raw_file_single_or_paired-end_run1 = paired-end
    • !Sample_raw_file_insert_size_run1 = 200
    • !Sample_raw_file_standard_deviation_run1 = 15
    • !Sample_raw_file_instrument_model_run1 = Illumina HiSeq 2500

    # Paired-end SOLiD files

    • !Sample_raw_file_run1 = myfile1.seq, myfile1.qual, myfile2.seq, myfile2.qual
    • !Sample_raw_file_type_run1 = illumina_native_seq, illumina_native_qual
    • !Sample_raw_file_checksum_run1 = checksum1, checksum2, checksum3, checksum4
    • !Sample_raw_file_read_length_run1 = 36, 36
    • !Sample_raw_file_single_or_paired-end_run1 = paired-end
    • !Sample_raw_file_insert_size_run1 = 200
    • !Sample_raw_file_standard_deviation_run1 = 25
    • !Sample_raw_file_instrument_model_run1 = AB SOLiD 4 System

List of instrument models:

  • Illumina Genome Analyzer
  • Illumina Genome Analyzer II
  • Illumina Genome Analyzer IIx
  • Illumina HiSeq 2500
  • Illumina HiSeq 2000
  • Illumina HiSeq 1500
  • Illumina HiSeq 1000
  • Illumina MiSeq
  • Illumina HiScanSQ
  • Illumina NextSeq 500
  • NextSeq 500
  • HiSeq X Ten
  • HiSeq X Five
  • Illumina HiSeq 3000
  • Illumina HiSeq 4000
  • NextSeq 550
  • AB SOLiD System
  • AB SOLiD System 2.0
  • AB SOLiD System 3.0
  • AB SOLiD 3 Plus System
  • AB SOLiD 4 System
  • AB SOLiD 4hq System
  • AB SOLiD PI System
  • AB 5500 Genetic Analyzer
  • AB 5500xl Genetic Analyzer
  • AB 5500xl-W Genetic Analysis System
  • 454 GS
  • 454 GS 20
  • 454 GS FLX
  • 454 GS FLX+
  • 454 GS Junior
  • 454 GS FLX Titanium
  • Helicos HeliScope
  • PacBio RS
  • PacBio RS II
  • Complete Genomics
  • Ion Torrent PGM
  • Ion Torrent Proton

SOFT submission templatesBack to top

The following template and example can be used to help prepare SOFT submissions. Multiple Samples and Series can be concatenated into a single SOFT file.

Batch updates in SOFTBack to top

Batch updates can be performed in SOFT format - just include the attribute "!Sample_GEO_accession = GSMxxx" where GSMxxx indicates the accession number of the record to be updated (similarly, use !Series_GEO_accession = GSExxx). You can provide the entire SOFT record with the necessary revisions. Alternatively, you can simply provide the revised attributes.

Note that it is possible to perform SOFT updates on data that were submitted via other submission route, such as GEOarchive. Likewise, it is possible to perform Web updates on individual records that were originally uploaded in SOFT format.

Submit your SOFT update file by selecting the 'SOFT' option on the Direct Deposit page. Make sure to check the 'Update' box. Successful updates will be reflected immediately on your GEO records.

SOFT downloadBack to top

SOFT format is used not only for batch uploads and updates of data, but also for batch download. The only difference between SOFT input and output is a few additional attributes in the output, including:

_geo_accession
_status
_submission_date
_last_update_date
_row_count
_contact_name
_contact_email
_contact_institute
_contact_department
_contact_city
_contact_phone
_contact_fax
_contact_web_link
Series_type

All GEO data are available for download in SOFT format from our anonymous FTP site

Last modified: February 22, 2024