Accepted Read Data Formats

Single cell read data

Single cell read data can be submitted in BAM, CRAM or multi-fastq format.

BAM/CRAM format

Single cell read data can be submitted in the BAM or CRAM format using the following tags specified in the SAM Optional Fields Specification:

  • CB: Cell identifier

  • CR: Cellular barcode sequence bases (uncorrected)

  • CY: Phred quality of the cellular barcode sequence in the CR tag

Multi-fastq format

Multi-fastq data submissions can be made using the programmatic route or Webin-CLI. This is done by entering multiple file names and their respective read_type qualifiers. For more information please see:

Other read data

We recommend that all read data is submitted in the BAM or CRAM format. However, please note that a variety of other data formats are supported as well.

Sample de-multiplexing

Reads for different samples should be submitted using separate files. The only exception is when a BAM or CRAM file contains reads for a large number of samples intented to be always analysed together. In this case the sample associated with the read file should describe the sample group while the BAM or CRAM file should identify the sample for each read.

Standard formats

The following standard file formats are accepted and transformed into Fastq products:

  • cram

  • bam

  • fastq

CRAM format

Each submitted CRAM file must:

  • be compatible with the CRAM Format Specification

  • be readable with Samtools

  • contain only reference sequences that exist in the CRAM Reference Registry

  • be submitted as a separate run

  • use the .cram file name suffix (e.g. ‘a.cram’)

CRAM file names are required to end up with the .cram suffix (e.g. ‘a.cram’).

A CRAM index (CRAI) file is created by the archive for each submitted CRAM file and is available in the same directory as the CRAM file from which is was created. CRAM index file names start with the CRAM file name and end up with the .crai suffix (e.g. ‘a.cram.crai’ for CRAM file ‘a.cram’).

BAM format

Each submitted BAM file must:

PacBio BAM files

We support the submission of the following types of PacBio BAM files:

  • subread BAM files (*.subreads.bam)

  • CCS read BAM files (*.ccs.bam)

Fastq format

We recommend that read data is either submitted in BAM or CRAM format. However, single and paired reads are accepted as Fastq files that meet the following the requirements:

  • Quality scores must be in Phred scale.

  • Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.

  • No technical reads (e.g. adapters, linkers, barcodes, primers) are allowed.

  • Single reads must be submitted using a single Fastq file and can be submitted with or without read names.

  • Paired reads must be submitted using two Fastq files.

  • The first line for each read must start with ‘@’.

  • The base calls and quality scores must be separated by a line starting with ‘+’.

  • Paired read names must either use Casava 1.8 read names (regular expression: ^@(.+)( +|\\t+)([0-9]+):[YN]:[0-9]*[02468]($|:.*$) or must end with /1 or /2 optionally followed by a space and a comment.

  • Read names must not exceed a length of 256 characters.

  • The Fastq files must be compressed using gzip or bzip2.

  • The regular expression for bases is “^([ACGTNactgn.]*?)$”

Example of Fastq file containing single reads:

@read_name
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...

Example of Fastq file containing paired reads (prior to Casava 1.8):

@read_name/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@read_name/2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...

With Casava 1.8 the format of the ‘@’ line has changed and we accept this pattern too:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

Platform specific formats

Oxford Nanopore

Oxford Nanopore native data must be submitted as a single tar.gz archive containing basecalled fast5 files from Guppy, Metrichor, or Albacore.

For Metrichor, an example directory structure for run named XYZ:

XYZ/reads/downloads/fail/
XYZ/reads/downloads/pass/

How to archive all files in the XYZ downloads directory in a linux command line:

cd <directory containing XYZ directory>
tar -cvzf XYZ.tar.gz XYZ/reads/downloads/

PacBio

PacBio data submissions are supported in the platform specific native format.

One run consists of *.bax.h5, *.bas.h5 and xml files. Please note that these files must not be tarred.

SFF format

The SFF format is supported for the 454 and Ion Torrent platforms.

10x Genomics

To submit 10x Genomics data where read indexes exist, you must convert to BAM or CRAM format. The supported tags are defined in the SAM Optional Fields Specification

Deprecated formats

Read submissions are no longer accepted in the following formats:

  • SOLiD native format (support ended in 2010)

  • Illumina native format (support ended in 2010)

  • SOLiD csfasta/qual format (support ended in 2015)

  • Illumina qseq format (support ended in 2015)

  • Illumina scarf format (support ended in 2015)

  • SRF format (deprecated in 2015)

  • Complete Genomics native format (deprecated in 2021)