General Pathogens Submissions Guide

../_images/pathogens_logo_1.png

Introduction

This guide provides general information and help for submitting pathogen sequence data to the European Nucleotide Archive (ENA) . All public INSDC pathogen data will be made available to browse using the Pathogens Portal.

Please see below for a specific guide for submitting pathogen related data. The guide frequently refers to the ENA Training Modules, our general ENA submissions guide. If you have any queries or require assistance with your submission please contact us at ena-path-collabs@ebi.ac.uk.

Tip

Looking for something else?

For pathogen-specific submissions guidance, please refer to these guides:

For small-scale SARS-CoV-2 viral data submissions, with no prior knowledge of ENA submission routes, we have developed a drag and drop submissions tool. Please complete the form if you would like to submit your data using this route.

Getting Started

Register a submission account

Before you can submit data to the ENA you must register a Webin submission account.

Please navigate to the Webin Portal and click the ‘Register’ button and complete the registration form.

The ENA Metadata Model

Before submitting data to ENA, it is important to familiarise yourself with the ENA metadata model and what parts of your research project can be represented by which metadata objects. This will determine what you need to submit.

ENA Submission routes

ENA allows submissions via three routes, each of which is appropriate for a different set of submission types. You may be required to use more than one in the process of submitting your data:

  • Interactive Submissions are completed by filling out web forms directly in your browser and downloading template spreadsheets that can be completed off-line and uploaded to ENA. This is often the most accessible submission route.

  • Command Line Submissions use our bespoke Webin-CLI program. This validates your submissions entirely before you complete them, allowing you maximum control of the process.

  • Programmatic Submissions are completed by preparing your submissions as XML documents and either sending them to ENA using a program such as cURL or using the Webin Portal.

The table below outlines what can be submitted through each submission route.

Interactive

Webin-CLI

Programmatic

Study

Y

N

Y

Sample

Y

N

Y

Read data

Y

Y

Y

Genome Assembly

N

Y

N

Transcriptome Assembly

N

Y

N

Template Sequence

N

Y

N

Other Analyses

N

N

Y

Register Metadata

Register Study

Data submissions to the ENA require that you register a study to contextualise and group your data. Details of how to do this can be found in our Study Registration Guide. Please ensure you describe your study adequately, as well as provide an informative title.

Your studies can now be claimed using your ORCID ID and/or assigned a DOI. Please see here and here for more information on these options.

Register Samples

Having registered a study, please proceed to register your samples. These are metadata objects that describe the source biological material of your experiments. Following this, the sequence data can be registered (as described in later sections).

Instructions for sample registration can be found in our Sample Registration Guide. As part of this process, you must select a sample checklist to describe metadata. If you require any support regarding sample metadata, please contact ena-path-collabs@ebi.ac.uk.

for interactive submission, download the sample checklist template from the Webin Portal and once completed, submit the checklist in .tsv format on the Webin Portal to register your Samples. See programmatic sample submission if you are submitting samples programmatically.

Sample checklists

The following Sample checklists contain mandatory, recommended and optional metadata fields (<SAMPLE_ATTRIBUTE>), with a description for each field, to help with sample metadata completion. The checklists were agreed by the Genomic Standards Consortium (GSC). In addition to the core checklist for each life domain, the GSC also provides checklist extensions which may have the metadata field you are looking for.

You can use the Sample checklists portal to browse all ENA checklists. The pathogen specific checklists are provided below.

link

Checklist name

ERC000028

ENA prokaryotic pathogen minimal sample checklist

ERC000029

ENA Global Microbial Identifier reporting standard checklist GMI_MDM:1.1

ERC000032

ENA Influenza virus reporting standard checklist

ERC000033

ENA virus pathogen reporting standard checklist

ERC000039

ENA parasite sample checklist

ERC000041

ENA Global Microbial Identifier Proficiency Test (GMI PT) checklist

Sample taxonomy

Our Tips for Sample Taxonomy page provides a helpful guide for choosing the right taxonomy for your pathogen submission.

You can search for suitable taxon IDs and find more information about a taxon ID using the taxonomy API endpoints:

https://www.ebi.ac.uk/ena/taxonomy/rest/suggest-for-submission/
https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/
https://www.ebi.ac.uk/ena/taxonomy/rest/any-name/
https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/

The strain of a pathogen may be specified using the taxonomy, it may also be specified using the strain field in the checklists. If you specify the strain with both, this will make your strain easier to find.

The ENA taxonomy API interface may also be used.

Sample host

Every pathogen checklist includes host attribute fields which can be used to describe the host. Here is provided some guidance on filling the host fields. If you have any questions or concerns about pathogen sample metadata, please contact the helpdesk.

Pathogen checklists host fields:

host taxid:

NCBI taxon id of the host, e.g. 9606

host health state:

health status of the host at the time of sample collection

host scientific name:

Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained.

lab_host:

scientific name of the laboratory host used to propagate the source organism from which the sample was obtained. The EBI cell line ontology may be used to find the name for the host cell line

Submit Runs

After registering your study and samples, you can submit your read files along with experimental (library-related) metadata. See our Read Submission Guide for detailed instructions on submitting reads.

We encourage submissions to include information on specific protocols used for the experiment. This should be provided in the library description. This can be, for example, the name and/or URL to a specific protocol. View our listing of the available full experimental metadata dictionaries.

Note

Submitted reads to ENA should not contain human identifiable reads. Please filter out human reads prior to submission - if required, here is a tool which can be used.

Submit Assembled Sequences

The instructions below provide a quick guide to submitting a completed isolate pathogen genome assembly. This type of submission is classed as ‘clone or isolate’ ASSEMBLY_TYPE for the ENA submissions services. For submission of other types of nucleotide assembly data, please see the submission options here. For submission of targeted sequences, please refer to the targeted sequence submissions guide.

For genome assembly submission, Webin-CLI (command line interface) needs to be used. The guide for downloading and using Webin-CLI is here.

A note on assembly levels

This guide includes chromosome list file examples which are used for a chromosome level assembly. Note that ‘chromosome’ should here be understood as a general term for a range of complete replicons, including chromosomes of eukaryotes, prokaryotes, and viruses, as well as organellar chromosomes and plasmids. All of these may be submitted within the same chromosome-level assembly.

If your assembly is not completed, you can submit a contig or scaffold level assembly. Please refer to the explainer about assembly levels here.

Prepare files

Assembly file

The accepted format for unannotated genome assembly is fasta OR for annotated genome assembly, the accepted format is embl flat file Please refer to the Accepted genome assembly data formats guide for information about preparing these files.

Manifest file

The manifest file is a tab-separated .txt file for Webin-CLI assembly submission. It specifies metadata about the assembly, including the study and sample it is linked to. Please refer to the assembly manifest file guide for permitted values.

For example, the following manifest file represents a genome assembly consisting of contigs provided in one fasta file:

STUDY   TODO
SAMPLE   TODO
ASSEMBLYNAME   TODO
ASSEMBLY_TYPE clone or isolate
COVERAGE   TODO
PROGRAM   TODO
PLATFORM   TODO
MINGAPLENGTH   optional
MOLECULETYPE   genomic DNA
DESCRIPTION optional
RUN_REF optional
FASTA   genome.fasta.gz

Chromosome list file

The chromosome list file must be provided when the submission contains assembled chromosomes. This is a tab separated file up to four columns. Each row describes each replicon unit within the assembly. Please refer to the chromosome list file guide for permitted values.

By default the chromosome TOPOLOGY will be assumed to be linear, therefore if the topology is circular, it must be specified.

chr01   1 Monopartite
chr01   1 circular-Monopartite viroid
chr01   1 Multipartite
chr02   2 Multipartite

If there are sequences that are associated with a specific chromosome, but order and orientation is unknowm, you can also add an unlocalised list file to the submission. Alternatively, an AGP file may also be submitted to define unplaced sequences.

Webin-CLI submission

When you have prepared your files, including the assembly, the manifest file and any additional files for higher assemblies, You can validate and test your submission using the Webin-CLI -validate flag. When you are ready to submit the assembly, you can use the -submit flag.

Webin-CLI validate command:

java -jar webin-cli-6.4.0.jar -userName Webin-XXXX -password XXXX -context genome -manifest manifest.txt -validate

Data Release and Citing

Once the data is submitted, it will take some time to be processed and archived. If your data is set to public, it will be made public and accessible from the Pathogens Portal.

For information about data release, please find more information at the following pages: