General Pathogens Submissions Guide

Introduction 

This guide provides general information and help for submitting pathogen sequence data to the European Nucleotide Archive (ENA) . All public INSDC pathogen data will be made available to browse using the Pathogens Portal.

Please see below for a specific guide for submitting pathogen related data. The guide frequently refers to the ENA Training Modules, our general ENA submissions guide. If you have any queries or require assistance with your submission please contact us at ena-path-collabs@ebi.ac.uk.

Tip

Looking for something else?

For pathogen-specific submissions guidance, please refer to these guides:

For small-scale SARS-CoV-2 viral data submissions, with no prior knowledge of ENA submission routes, we have developed a drag and drop submissions tool. Please complete the form if you would like to submit your data using this route.

Getting Started 

Register a submission account 

Before you can submit data to the ENA you must register a Webin submission account.

Please navigate to the Webin Portal and click the ‘Register’ button and complete the registration form.

The ENA Metadata Model 

Before submitting data to ENA, it is important to familiarise yourself with the ENA metadata model and what parts of your research project can be represented by which metadata objects. This will determine what you need to submit.

1/8

The ENA would like to introduce you to our very first TWEETORIAL! For this #tweetorial, we will be explaining the ENA Metadata Model. When submitting data to the ENA, you need to register additional metadata so your submission is in accordance with FAIR data principles. pic.twitter.com/m45ENIrlIM
— European Nucleotide Archive (ENA) (@ENASequence) April 13, 2022

ENA Submission routes 

ENA allows submissions via three routes, each of which is appropriate for a different set of submission types. You may be required to use more than one in the process of submitting your data:

Interactive Submissions are completed by filling out web forms directly in your browser and downloading template spreadsheets that can be completed off-line and uploaded to ENA. This is often the most accessible submission route.
Command Line Submissions use our bespoke Webin-CLI program. This validates your submissions entirely before you complete them, allowing you maximum control of the process.
Programmatic Submissions are completed by preparing your submissions as XML documents and either sending them to ENA using a program such as cURL or using the Webin Portal.

The table below outlines what can be submitted through each submission route.

	Interactive	Webin-CLI	Programmatic
Study	Y	N	Y
Sample	Y	N	Y
Read data	Y	Y	Y
Genome Assembly	N	Y	N
Transcriptome Assembly	N	Y	N
Template Sequence	N	Y	N
Other Analyses	N	N	Y

Register Metadata 

Register Study 

Data submissions to the ENA require that you register a study to contextualise and group your data. Details of how to do this can be found in our Study Registration Guide. Please ensure you describe your study adequately, as well as provide an informative title.

Your studies can now be claimed using your ORCID ID and/or assigned a DOI. Please see here and here for more information on these options.

Register Samples 

Having registered a study, please proceed to register your samples. These are metadata objects that describe the source biological material of your experiments. Following this, the sequence data can be registered (as described in later sections).

Instructions for sample registration can be found in our Sample Registration Guide. As part of this process, you must select a sample checklist to describe metadata. If you require any support regarding sample metadata, please contact ena-path-collabs@ebi.ac.uk.

for interactive submission, download the sample checklist template from the Webin Portal and once completed, submit the checklist in .tsv format on the Webin Portal to register your Samples. See programmatic sample submission if you are submitting samples programmatically.

Sample checklists 

The following Sample checklists contain mandatory, recommended and optional metadata fields (<SAMPLE_ATTRIBUTE>), with a description for each field, to help with sample metadata completion. The checklists were agreed by the Genomic Standards Consortium (GSC). In addition to the core checklist for each life domain, the GSC also provides checklist extensions which may have the metadata field you are looking for.

You can use the Sample checklists portal to browse all ENA checklists. The pathogen specific checklists are provided below.

link	Checklist name
ERC000028	ENA prokaryotic pathogen minimal sample checklist
ERC000029	ENA Global Microbial Identifier reporting standard checklist GMI_MDM:1.1
ERC000032	ENA Influenza virus reporting standard checklist
ERC000033	ENA virus pathogen reporting standard checklist
ERC000039	ENA parasite sample checklist
ERC000041	ENA Global Microbial Identifier Proficiency Test (GMI PT) checklist

Sample taxonomy 

Our Tips for Sample Taxonomy page provides a helpful guide for choosing the right taxonomy for your pathogen submission.

You can search for suitable taxon IDs and find more information about a taxon ID using the taxonomy API endpoints:

https://www.ebi.ac.uk/ena/taxonomy/rest/suggest-for-submission/
https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/
https://www.ebi.ac.uk/ena/taxonomy/rest/any-name/
https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/

The strain of a pathogen may be specified using the taxonomy, it may also be specified using the strain field in the checklists. If you specify the strain with both, this will make your strain easier to find.

The ENA taxonomy API interface may also be used.

Sample host 

Every pathogen checklist includes host attribute fields which can be used to describe the host. Here is provided some guidance on filling the host fields. If you have any questions or concerns about pathogen sample metadata, please contact the helpdesk.

Pathogen checklists host fields:

host taxid:: NCBI taxon id of the host, e.g. 9606
host health state:: health status of the host at the time of sample collection
host scientific name:: Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained.
lab_host:: scientific name of the laboratory host used to propagate the source organism from which the sample was obtained. The EBI cell line ontology may be used to find the name for the host cell line

Submit Runs 

After registering your study and samples, you can submit your read files along with experimental (library-related) metadata. See our Read Submission Guide for detailed instructions on submitting reads.

We encourage submissions to include information on specific protocols used for the experiment. This should be provided in the library description. This can be, for example, the name and/or URL to a specific protocol. View our listing of the available full experimental metadata dictionaries.

Note

Submitted reads to ENA should not contain human identifiable reads. Please filter out human reads prior to submission - if required, here is a tool which can be used.

Submit Assembled Sequences 

The instructions below provide a quick guide to submitting a completed isolate pathogen genome assembly. This type of submission is classed as ‘clone or isolate’ ASSEMBLY_TYPE for the ENA submissions services. For submission of other types of nucleotide assembly data, please see the submission options here. For submission of targeted sequences, please refer to the targeted sequence submissions guide.

For genome assembly submission, Webin-CLI (command line interface) needs to be used. The guide for downloading and using Webin-CLI is here.

A note on assembly levels

This guide includes chromosome list file examples which are used for a chromosome level assembly. Note that ‘chromosome’ should here be understood as a general term for a range of complete replicons, including chromosomes of eukaryotes, prokaryotes, and viruses, as well as organellar chromosomes and plasmids. All of these may be submitted within the same chromosome-level assembly.

If your assembly is not completed, you can submit a contig or scaffold level assembly. Please refer to the explainer about assembly levels here.

Prepare files 

Assembly file 

The accepted format for unannotated genome assembly is fasta OR for annotated genome assembly, the accepted format is embl flat file Please refer to the Accepted genome assembly data formats guide for information about preparing these files.

Manifest file 

The manifest file is a tab-separated .txt file for Webin-CLI assembly submission. It specifies metadata about the assembly, including the study and sample it is linked to. Please refer to the assembly manifest file guide for permitted values.

For example, the following manifest file represents a genome assembly consisting of contigs provided in one fasta file:

STUDY   TODO
SAMPLE   TODO
ASSEMBLYNAME   TODO
ASSEMBLY_TYPE clone or isolate
COVERAGE   TODO
PROGRAM   TODO
PLATFORM   TODO
MINGAPLENGTH   optional
MOLECULETYPE   genomic DNA
DESCRIPTION optional
RUN_REF optional
FASTA   genome.fasta.gz

Chromosome list file 

The chromosome list file must be provided when the submission contains assembled chromosomes. This is a tab separated file up to four columns. Each row describes each replicon unit within the assembly. Please refer to the chromosome list file guide for permitted values.

By default the chromosome TOPOLOGY will be assumed to be linear, therefore if the topology is circular, it must be specified.

chr01   1 Monopartite

chr01   1 circular-Monopartite viroid

chr01   1 Multipartite
chr02   2 Multipartite

By default prokaryotic chromosomes and plasmids will be assumed to reside in the in the cytoplasm, however, the ‘plasmid’ CHROMOSOME_LOCATION may be specified. By default the TOPOLOGY will be assumed to be linear, so in this example the circular topology was specified.

chr01   1 circular-Chromosome
chr02   2 circular-Chromosome plasmid
chr03   3 circular-Chromosome plasmid

By default eukaryotic chromosomes will be assumed to reside in the nucleus. By default the chromosome TOPOLOGY will be assumed to be linear, but it may also be specified.

chr01   1 Linear-Chromosome
chr02   2 Linear-Chromosome
chr03   3 Linear-Chromosome
chr04   4 Linear-Chromosome
chrMi   MIT Linear-Chromosome Mitochondrion

If there are sequences that are associated with a specific chromosome, but order and orientation is unknowm, you can also add an unlocalised list file to the submission. Alternatively, an AGP file may also be submitted to define unplaced sequences.

Webin-CLI submission 

When you have prepared your files, including the assembly, the manifest file and any additional files for higher assemblies, You can validate and test your submission using the Webin-CLI -validate flag. When you are ready to submit the assembly, you can use the -submit flag.

Webin-CLI validate command:

java -jar webin-cli-6.4.0.jar -userName Webin-XXXX -password XXXX -context genome -manifest manifest.txt -validate

Data Release and Citing 

Once the data is submitted, it will take some time to be processed and archived. If your data is set to public, it will be made public and accessible from the Pathogens Portal.

For information about data release, please find more information at the following pages: