============================= Archive Generated Run Files ============================= Whenever possible, ENA provides access to two types of file for each run we present: the submitted file(s) and archive-generated file(s). Both are visible in the ENA Browser view for runs: .. image:: images/archive-generated-files-p01.png This page serves to briefly discuss the reason for this and the differences between the submitted and archive-generated files. Submitted Files =============== The submitted files for any given run are copies of the files originally provided to us by the submitter. These files always undergo validation appropriate to their format, and are presented as-submitted with no automated curation. Formats are varied, and may be FASTQ but could also be others including BAM, FAST5, HDF5, etc. Archive-Generated Files ======================= Providing archive-generated FASTQs for runs is a means of bringing some consistency to the data we provide. By imposing a level of uniformity on these files, we can ensure users know what to expect of them and may incorporate them into pipelines with minimal friction. Note that archive-generated FASTQ will not be available in the following uncommon scenarios: - BAM/CRAM files containing @PG:longranger - BAM/CRAM files containing @PG:cellranger - BAM/CRAM files containing CB:Z,CR:Z,CY:Z,RX:Z,QX:Z tags - Complete genomics native (data folder) submissions - PacBio native (HDF5) submissions - Many ONT native format submissions Generated FASTQ Files --------------------- The number of files generated and their content varies depending on the nature of the submitted files +---------------+--------------------------------------+-----------------------------------------+ | | Number of | FASTQ Files | Description | | | Application | | | | | Reads | | | +---------------+--------------------------------------+-----------------------------------------+ | | | .fastq.gz | | For experiments with single | | 1 | | or | | application reads all reads will be | | | | _1.fastq.gz | | made available in one fastq file. | +---------------+--------------------------------------+-----------------------------------------+ | | | | Paired experiments with two | | | | | application reads will be made | | | | | available in 1-3 FASTQ files. For a | | | | | paired experiment submitted with both | | | | _1.fastq.gz | | application reads the first reads | | 2 | | .fastq.gz | | will be in _1.fastq.gz | | | | _2.fastq.gz | | file, the second reads will be in | | | | | _2.fastq.gz, and any | | | | | unpaired reads will be in .fastq.gz file. If files | | | | | from a paired experiment are | | | | | submitted and all reads are unpaired | | | | | then only a single file is created: | | | | | .fastq.gz | +---------------+--------------------------------------+-----------------------------------------+ | | | | For experiments with more than two | | > 2 | | _N.fastq.gz | | application reads (e.g. Complete | | | | | Genomics) one fastq file is created | | | | | for each application read, however, | | | | | no empty fastq files are created. | +---------------+--------------------------------------+-----------------------------------------+ | N/A | | _consensus.fastq.gz | ONT or PacBio consensus reads. | +---------------+--------------------------------------+-----------------------------------------+ | N/A | | _subreads.fastq.gz | PacBio subreads. | +---------------+--------------------------------------+-----------------------------------------+ FASTQ File Format _________________ :: @. [][/] + +-----------------+----------------------------------------------------------------------------+ | Field | Description | +-----------------+----------------------------------------------------------------------------+ | | | The run accession. A spot is identified uniquely by the combination | | | | of the run accession and the spot index | +-----------------+----------------------------------------------------------------------------+ | | | A positive integer assigned to the spots in the order in which they | | | | appear in the run. A spot is identified uniquely by the combination of | | | | the Run accession and the spot index. | +-----------------+----------------------------------------------------------------------------+ | | | The spot name as it was provided by the submitter. In cases where the | | | | read name is missing or was removed by the archive this field is not | | | | present. | +-----------------+----------------------------------------------------------------------------+ | | | A positive integer assigned to the application reads in the order in | | | | which they appear in the spot: /1 for first application read and /2 for | | | | the second application read. In cases where the read name is missing or | | | | was removed by the archive this field is not present. | +-----------------+----------------------------------------------------------------------------+ Examples ________ Single layout: :: @ERR000017.1 IL6_554:7:1:249:322 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + ??????????????????????????????>>>>>> Paired (first read): :: @ERR005143.1 ID49_20708_20H04AAXX_R1:7:1:41:356/1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh Paired (second read): :: @ERR005143.1 ID49_20708_20H04AAXX_R1:7:1:41:356/2 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh Single layout without read names: :: @ERR000017.1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + ??????????????????????????????>>>>>> Paired without read names (first read): :: @ERR005143.1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh Paired without read names (second read): :: @ERR005143.1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh