Archive Generated Run Files

Whenever possible, ENA provides access to two types of file for each run we present: the submitted file(s) and archive-generated file(s). Both are visible in the ENA Browser view for runs:

../_images/archive-generated-files-p01.png

This page serves to briefly discuss the reason for this and the differences between the submitted and archive-generated files.

Submitted Files

The submitted files for any given run are copies of the files originally provided to us by the submitter. These files always undergo validation appropriate to their format, and are presented as-submitted with no automated curation. Formats are varied, and may be FASTQ but could also be others including BAM, FAST5, HDF5, etc.

Archive-Generated Files

Providing archive-generated FASTQs for runs is a means of bringing some consistency to the data we provide. By imposing a level of uniformity on these files, we can ensure users know what to expect of them and may incorporate them into pipelines with minimal friction.

Note that archive-generated FASTQ will not be available in the following uncommon scenarios:

  • BAM/CRAM files containing @PG:longranger

  • BAM/CRAM files containing @PG:cellranger

  • BAM/CRAM files containing CB:Z,CR:Z,CY:Z,RX:Z,QX:Z tags

  • Complete genomics native (data folder) submissions

  • PacBio native (HDF5) submissions

  • Many ONT native format submissions

Generated FASTQ Files

The number of files generated and their content varies depending on the nature of the submitted files

Number of
Application
Reads

FASTQ Files

Description

1

<run_accession>.fastq.gz
or
<run_accession>_1.fastq.gz
For experiments with single
application reads all reads will be
made available in one fastq file.

2

<run_accession>_1.fastq.gz
<run_accession>.fastq.gz
<run_accession>_2.fastq.gz
Paired experiments with two
application reads will be made
available in 1-3 FASTQ files. For a
paired experiment submitted with both
application reads the first reads
will be in <run accession>_1.fastq.gz
file, the second reads will be in
<run accession>_2.fastq.gz, and any
unpaired reads will be in <run
accession>.fastq.gz file. If files
from a paired experiment are
submitted and all reads are unpaired
then only a single file is created:
<run accession>.fastq.gz

> 2

<run_accession>_N.fastq.gz
For experiments with more than two
application reads (e.g. Complete
Genomics) one fastq file is created
for each application read, however,
no empty fastq files are created.

N/A

<run_accession>_consensus.fastq.gz

ONT or PacBio consensus reads.

N/A

<run_accession>_subreads.fastq.gz

PacBio subreads.

FASTQ File Format

@<run accession>.<spot index> [<spot name>][/<read index>]
<bases>
+
<phred qualities, ASCII encoded starting with '!' (33)>

Field

Description

<run accession>

The run accession. A spot is identified uniquely by the combination
of the run accession and the spot index

<spot index>

A positive integer assigned to the spots in the order in which they
appear in the run. A spot is identified uniquely by the combination of
the Run accession and the spot index.

<spot name>

The spot name as it was provided by the submitter. In cases where the
read name is missing or was removed by the archive this field is not
present.

<read index>

A positive integer assigned to the application reads in the order in
which they appear in the spot: /1 for first application read and /2 for
the second application read. In cases where the read name is missing or
was removed by the archive this field is not present.

Examples

Single layout:

@ERR000017.1 IL6_554:7:1:249:322
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
??????????????????????????????>>>>>>

Paired (first read):

@ERR005143.1 ID49_20708_20H04AAXX_R1:7:1:41:356/1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

Paired (second read):

@ERR005143.1 ID49_20708_20H04AAXX_R1:7:1:41:356/2
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

Single layout without read names:

@ERR000017.1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
??????????????????????????????>>>>>>

Paired without read names (first read):

@ERR005143.1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

Paired without read names (second read):

@ERR005143.1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh