Module 3: Flat File upload - Submit an ENA Supported Sequence File

Annotated sequence entries are stored in the ENA as ENA supported sequence files. Here is an example of an HLA gene in ENA supported format. It is a text file that is computer readable due to the 2 character line beginnings (ID, AC, DE ...). The ENA browser renders the text file into a friendlier and more graphical view but the computer readable version is still available so that automatic pipelines down stream of the ENA can download and parse large numbers of sequence entries.

Create your own ENA supported sequence file

In most cases it is not necessary to submit an ENA supported sequence file because the interactive tool Webin provides spreadsheet templates for various types of sequences so that you can submit using a tab separated file (TSV) which you can fill in using any spreadsheet editor. These are called ‘annotation checklists’. After the submission via Webin or via programmatic REST API the TSV is converted into an ENA supported sequence file (or ‘flat file’) and validated before accessions are delivered.

Not all sequence types are available as a TSV spreadsheet template/annotation checklist. For instance the HLA gene above has multiple exons and this is difficult for us to turn into a template. Typically the more complicated sequences with multiple and repeating features are the hardest to make into TSV templates. For these types of sequences you can create an ENA supported sequence file yourself and submit it to the ENA using the programmatic REST API (this is submission by “flat file upload”, previously “entry upload”).

For a list of sequence types that are available as annotation checklists (TSV spreadsheets) see here: http://www.ebi.ac.uk/ena/submit/annotation-checklists

Please do not use submission by flat file for any sequence type listed on the above webpage. The spreadsheet/annotation checklist submission route is more robust because we do the file conversion.

For examples of ENA flat files that are not available for submission using annotation checklists/TSV see here: http://www.ebi.ac.uk/ena/submit/entry-upload-templates

Pay close attention to how the flat files are formatted. Use the web page above to construct your sequence flat file. This will be submitted by flat file upload. As with a TSV/annotation checklist submission (module 2) you need to create an analysis object in XML format to wrap the ENA flat file. Please check module 2: Analysis object for more information. To see how the analysis object and the sequence entries will be accessioned please refer to module 2: A word about Accession Numbers

Submission by Flat File Upload

Submitting an ENA flat file is the same as submitting a tab separated file, so much of the detail is in module 2). The main difference is that for tsv spreadsheet submissions the tab/tsv file is converted to an ENA flat file and then validation is applied. For a submission by flat file upload, the conversion is omitted because the file is already in the ENA supported format. The system will try to validate your ENA flat file after only minimal processing. There is a little more opportunity for error but this can be remedied by following the guidelines closely.

Step 1: Create a project

As with a TSV/annotation checklist submission (module 2), a project/study is required. If you already have a study you can add your annotated sequence entries to it. If not, create one first. Use either the interactive submission route or the programmatic submission route to do this. Note the project accession number when you receive it.

Step 2: Compress and upload the sequence flat file

As with a TSV/annotation checklist submission, the sequence flat file must be compressed and uploaded to your Webin ftp directory. You may also need to calculate the MD5 checksum. Check here and here for instructions. In this example I have an ENA flat file called Human_parvovirus_B19_entryupload.embl which I have compressed to create file Human_parvovirus_B19_entryupload.embl.gz. The checksum of Human_parvovirus_B19_entryupload.embl.gz is 7138bf3320cad8d215b7e9930ded114b.

Step 3: Create the analysis and submission XMLs

First check how the analysis file was created in module 2 step 4

In this example the analysis file looks like this

<?xml version = '1.0' encoding = 'UTF-8'?>
<ANALYSIS_SET>
   <ANALYSIS alias="Human_parvovirus_B19_entryupload" center_name="EBI">
      <TITLE>Human parvovirus B19 isolate IRB_1_2008 NS1 and VP1 unique region genes, partial cds</TITLE>
      <DESCRIPTION>Human parvovirus B19 isolate IRB_1_2008 NS1 and VP1 unique region genes, partial cds</DESCRIPTION>
      <STUDY_REF accession="PRJEBXXXX">
      </STUDY_REF>
      <ANALYSIS_TYPE>
         <SEQUENCE_FLATFILE/>
      </ANALYSIS_TYPE>
      <FILES>
         <FILE checksum="7138bf3320cad8d215b7e9930ded114b" checksum_method="MD5" filename="Human_parvovirus_B19_entryupload.embl.gz" filetype="flatfile"/>
      </FILES>
   </ANALYSIS>
</ANALYSIS_SET>

In this case there is no ERT number/checklist attribute because no TSV annotation checklist template is being used. Also the file type attribute is different: filetype="flatfile". The title and description can be a brief description of what is presented in the sequence flat file. Make sure to add all your own attributes and field values as the above is only for example purposes.

The submission XML in this example looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<SUBMISSION alias="entry_upload_Human_parvovirus_B19" center_name="EBI">
   <ACTIONS>
      <ACTION>
         <ADD source="analysis.xml" schema="analysis"/>
      </ACTION>
   </ACTIONS>
</SUBMISSION>

As in module 2 step 5, the next step is to complete a submission XML file. Provide a unique alias for the submission object and reference the file containing the analysis object (in this case I called it ‘analysis.xml’).

Step 4: Send both XMLs to ENA using REST API

This step is the same as module 2 step 6.

Use cURL or the web form to send the XMLs to ENA and register the flat file submission. Use the test server first and if successful and you are happy with the receipt proceed to submit to the production server.

In this example I obtained the following receipt

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="receipt.xsl"?>
<RECEIPT receiptDate="2017-05-08T12:51:53.601+01:00" submissionFile="submission.xml" success="true">
   <ANALYSIS accession="ERZ408000" alias="Human_parvovirus_B19_entryupload" status="PRIVATE" />
   <SUBMISSION accession="ERA911540" alias="entry_upload" />
   <ACTIONS>ADD</ACTIONS>
</RECEIPT>

In this example the analysis received accession ERZ408000 and the submission received accession ERA911540. You will not need the submission accession, whereas the analysis accession may be useful if you need to enquire about the progress of the submission. After the sequence entries are processed they will be accessioned and you will receive the accession (or accession range if multiple sequences were in the flat file) via the email address that is registered with your Webin account. Do not quote the analysis accession in any publication, always quote the sequence accessions (which come later by email). You can also quote the project/study accession, especially if you have used the project to group several submissions across different domains.