Accepted Read Data Formats¶
Single cell read data¶
Single cell read data can be submitted in BAM, CRAM or multi-fastq format.
BAM/CRAM format¶
Single cell read data can be submitted in the BAM or CRAM format using the following tags specified in the SAM Optional Fields Specification:
- CB: Cell identifier
- CR: Cellular barcode sequence bases (uncorrected)
- CY: Phred quality of the cellular barcode sequence in the CR tag
Multi-fastq format¶
Multi-fastq data submissions can be made using the programmatic route or Webin-CLI. This is done by entering multiple file names and their respective read_type qualifiers. For more information please see:
Other read data¶
We recommend that all read data is submitted in the BAM
or CRAM
format.
However, please note that a variety of other data formats are supported as well.
Sample de-multiplexing¶
Reads for different samples should be submitted using separate files. The
only exception is when a BAM
or CRAM
file contains reads for a large
number of samples intented to be always analysed together. In this case
the sample associated with the read file should describe the sample
group while the BAM
or CRAM
file should identify the sample
for each read.
Standard formats¶
The following standard file formats are accepted and transformed into Fastq products:
- cram
- bam
- fastq
CRAM format¶
Each submitted CRAM file must:
- be compatible with the CRAM Format Specification
- be readable with Samtools
- contain only reference sequences that exist in the CRAM Reference Registry
- be submitted as a separate run
- use the .cram file name suffix (e.g. ‘a.cram’)
CRAM file names are required to end up with the .cram suffix (e.g. ‘a.cram’).
A CRAM index (CRAI) file is created by the archive for each submitted CRAM file and is available in the same directory as the CRAM file from which is was created. CRAM index file names start with the CRAM file name and end up with the .crai suffix (e.g. ‘a.cram.crai’ for CRAM file ‘a.cram’).
BAM format¶
Each submitted BAM file must:
- be compatible with the SAM/BAM Format Specification
- be readable with Samtools
- be submitted as a separate run
- use the .bam file name suffix (e.g. ‘a.bam’)
PacBio BAM files¶
We support the submission of the following types of PacBio BAM files:
- subread BAM files (*.subreads.bam)
- CCS read BAM files (*.ccs.bam)
Fastq format¶
We recommend that read data is either submitted in BAM or CRAM format. However, single and paired reads are accepted as Fastq files that meet the following the requirements:
- Quality scores must be in Phred scale.
- Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.
- No technical reads (e.g. adapters, linkers, barcodes, primers) are allowed.
- Single reads must be submitted using a single Fastq file and can be submitted with or without read names.
- Paired reads must be submitted using two Fastq files.
- The first line for each read must start with ‘@’.
- The base calls and quality scores must be separated by a line starting with ‘+’.
- Paired read names must either use Casava 1.8 read names
(regular expression:
^@(.+)( +|\\t+)([0-9]+):[YN]:[0-9]*[02468]($|:.*$)
or must end with/1
or/2
optionally followed by a space and a comment. - The Fastq files must be compressed using gzip or bzip2.
- The regular expression for bases is “^([ACGTNactgn.]*?)$”
Example of Fastq file containing single reads:
@read_name
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...
Example of Fastq file containing paired reads (prior to Casava 1.8):
@read_name/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@read_name/2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...
With Casava 1.8 the format of the ‘@’ line has changed and we accept this pattern too:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Platform specific formats¶
Oxford Nanopore¶
Oxford Nanopore native data must be submitted as a single tar.gz archive containing basecalled fast5 files from Guppy, Metrichor, or Albacore.
For Metrichor, an example directory structure for run named XYZ:
XYZ/reads/downloads/fail/
XYZ/reads/downloads/pass/
How to archive all files in the XYZ downloads directory in a linux command line:
cd <directory containing XYZ directory>
tar -cvzf XYZ.tar.gz XYZ/reads/downloads/
PacBio¶
PacBio data submissions are supported in the platform specific native format.
One run consists of *.bax.h5, *.bas.h5 and xml files. Please note that these files must not be tarred.
SFF format¶
The SFF format is supported for the 454 and Ion Torrent platforms.
10x Genomics¶
To submit 10x Genomics data where read indexes exist, you must convert to BAM or CRAM format. The supported tags are defined in the SAM Optional Fields Specification
Deprecated formats¶
Read submissions are no longer accepted in the following formats:
- SOLiD native format (support ended in 2010)
- Illumina native format (support ended in 2010)
- SOLiD csfasta/qual format (support ended in 2015)
- Illumina qseq format (support ended in 2015)
- Illumina scarf format (support ended in 2015)
- SRF format (deprecated in 2015)
- Complete Genomics native format (deprecated in 2021)