General Pathogens Submissions Guide ================================== .. image:: images/pathogens_logo_1.png :width: 400 :align: center .. contents:: :local: :depth: 3 Introduction ~~~~~~~~~~~~ This guide provides general information and help for submitting pathogen sequence data to the `European Nucleotide Archive (ENA) `_ . All public `INSDC `_ pathogen data will be made available to browse using the `Pathogens Portal `_. Please see below for a specific guide for submitting pathogen related data. The guide frequently refers to the `ENA Training Modules `_, our general ENA submissions guide. If you have any queries or require assistance with your submission please contact us at ena-path-collabs@ebi.ac.uk. .. tip:: **Looking for something else?** For pathogen-specific submissions guidance, please refer to these guides: - `ENA SARS-CoV-2 submissions guide `_ - `Monkeypox virus ENA submissions Guidance `_ For small-scale SARS-CoV-2 viral data submissions, with no prior knowledge of ENA submission routes, we have developed a drag and drop submissions tool. Please complete the `form `_ if you would like to submit your data using this route. Getting Started ~~~~~~~~~~~~~~~ Register a submission account ````````````````````````````` Before you can submit data to the ENA you must `register a Webin submission account `_. Please navigate to the `Webin Portal `_ and click the ‘Register’ button and complete the registration form. The ENA Metadata Model `````````````````````` Before submitting data to ENA, it is important to familiarise yourself with the `ENA metadata model `_ and what parts of your research project can be represented by which metadata objects. This will determine what you need to submit. .. raw:: html ENA Submission routes ````````````````````` ENA allows submissions via three routes, each of which is appropriate for a different set of submission types. You may be required to use more than one in the process of submitting your data: - **Interactive Submissions** are completed by filling out web forms directly in your browser and downloading template spreadsheets that can be completed off-line and uploaded to ENA. This is often the most accessible submission route. - **Command Line Submissions** use our bespoke Webin-CLI program. This validates your submissions entirely before you complete them, allowing you maximum control of the process. - **Programmatic Submissions** are completed by preparing your submissions as XML documents and either sending them to ENA using a program such as cURL or using the Webin Portal. The table below outlines what can be submitted through each submission route. +------------------------+-------------+-----------+--------------+ | | Interactive | Webin-CLI | Programmatic | +------------------------+-------------+-----------+--------------+ | Study | **Y** | N | **Y** | +------------------------+-------------+-----------+--------------+ | Sample | **Y** | N | **Y** | +------------------------+-------------+-----------+--------------+ | Read data | **Y** | **Y** | **Y** | +------------------------+-------------+-----------+--------------+ | Genome Assembly | N | **Y** | N | +------------------------+-------------+-----------+--------------+ | Transcriptome Assembly | N | **Y** | N | +------------------------+-------------+-----------+--------------+ | Template Sequence | N | **Y** | N | +------------------------+-------------+-----------+--------------+ | Other Analyses | N | N | **Y** | +------------------------+-------------+-----------+--------------+ Register Metadata ~~~~~~~~~~~~~~~~~ Register Study `````````````` Data submissions to the ENA require that you register a study to contextualise and group your data. Details of how to do this can be found in our `Study Registration Guide `_. Please ensure you describe your study adequately, as well as provide an informative title. Your studies can now be claimed using your ORCID ID and/or assigned a DOI. Please see `here `_ and `here `_ for more information on these options. Register Samples ```````````````` Having registered a study, please proceed to register your samples. These are metadata objects that describe the source biological material of your experiments. Following this, the sequence data can be registered (as described in later sections). Instructions for sample registration can be found in our `Sample Registration Guide `_. As part of this process, you must select a sample checklist to describe metadata. If you require any support regarding sample metadata, please contact ena-path-collabs@ebi.ac.uk. for **interactive submission**, download the sample checklist template from the Webin Portal and once completed, submit the checklist in **.tsv** format on the Webin Portal to register your Samples. See `programmatic sample submission `_ if you are submitting samples programmatically. Sample checklists ''''''''''''''''' The following Sample checklists contain **mandatory**, *recommended* and optional metadata fields (````), with a description for each field, to help with sample metadata completion. The checklists were agreed by the Genomic Standards Consortium (GSC). In addition to the core checklist for each life domain, the GSC also provides checklist `extensions `_ which may have the metadata field you are looking for. You can use the `Sample checklists portal `_ to browse all ENA checklists. The pathogen specific checklists are provided below. +-----------------------------------------------------------------+---------------------------------------------------------------------------+ | **link** | **Checklist name** | +-----------------------------------------------------------------+---------------------------------------------------------------------------+ | `ERC000028 `_ | ENA prokaryotic pathogen minimal sample checklist | +-----------------------------------------------------------------+---------------------------------------------------------------------------+ | `ERC000029 `_ | ENA Global Microbial Identifier reporting standard checklist GMI_MDM:1.1 | +-----------------------------------------------------------------+---------------------------------------------------------------------------+ | `ERC000032 `_ | ENA Influenza virus reporting standard checklist | +-----------------------------------------------------------------+---------------------------------------------------------------------------+ | `ERC000033 `_ | ENA virus pathogen reporting standard checklist | +-----------------------------------------------------------------+---------------------------------------------------------------------------+ | `ERC000039 `_ | ENA parasite sample checklist | +-----------------------------------------------------------------+---------------------------------------------------------------------------+ | `ERC000041 `_ | ENA Global Microbial Identifier Proficiency Test (GMI PT) checklist | +-----------------------------------------------------------------+---------------------------------------------------------------------------+ Sample taxonomy ''''''''''''''' Our `Tips for Sample Taxonomy `_ page provides a helpful guide for choosing the right taxonomy for your pathogen submission. You can search for suitable taxon IDs and find more information about a taxon ID using the taxonomy API endpoints: :: https://www.ebi.ac.uk/ena/taxonomy/rest/suggest-for-submission/ https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/ https://www.ebi.ac.uk/ena/taxonomy/rest/any-name/ https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/ The strain of a pathogen may be specified using the taxonomy, it may also be specified using the **strain** field in the checklists. If you specify the strain with both, this will make your strain easier to find. The `ENA taxonomy API `_ interface may also be used. Sample host ''''''''''' Every pathogen checklist includes host attribute fields which can be used to describe the host. Here is provided some guidance on filling the host fields. If you have any questions or concerns about pathogen sample metadata, please contact the `helpdesk `_. Pathogen checklists host fields: :host taxid: NCBI taxon id of the host, e.g. 9606 :host health state: health status of the host at the time of sample collection :host scientific name: Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained. :lab_host: scientific name of the laboratory host used to propagate the source organism from which the sample was obtained. The EBI `cell line ontology `_ may be used to find the name for the host cell line Submit Runs ~~~~~~~~~~~ After registering your study and samples, you can submit your read files along with experimental (library-related) metadata. See our `Read Submission Guide `_ for detailed instructions on submitting reads. We encourage submissions to include information on specific protocols used for the experiment. This should be provided in the library description. This can be, for example, the name and/or URL to a specific protocol. View our listing of the available `full experimental metadata dictionaries `_. .. note:: Submitted reads to ENA should not contain human identifiable reads. Please filter out human reads prior to submission - if required, `here `_ is a tool which can be used. Submit Assembled Sequences ~~~~~~~~~~~~~~~~~~~~~~~~~~ The instructions below provide a quick guide to submitting a completed isolate pathogen genome assembly. This type of submission is classed as 'clone or isolate' **ASSEMBLY_TYPE** for the ENA submissions services. For submission of other types of nucleotide assembly data, please see the submission options `here `_. For submission of targeted sequences, please refer to the `targeted sequence submissions guide `_. For genome assembly submission, Webin-CLI (command line interface) needs to be used. The guide for downloading and using Webin-CLI is `here `_. .. admonition:: A note on assembly levels This guide includes chromosome list file examples which are used for a **chromosome** level assembly. Note that ‘chromosome’ should here be understood as a general term for a range of complete replicons, including chromosomes of eukaryotes, prokaryotes, and viruses, as well as organellar chromosomes and plasmids. All of these may be submitted within the same chromosome-level assembly. If your assembly is not completed, you can submit a **contig** or **scaffold** level assembly. Please refer to the explainer about assembly levels `here `_. Prepare files ````````````` Assembly file ''''''''''''' The accepted format for unannotated genome assembly is **fasta** OR for annotated genome assembly, the accepted format is **embl flat file** Please refer to the `Accepted genome assembly data formats guide `_ for information about preparing these files. Manifest file ''''''''''''' The manifest file is a tab-separated .txt file for Webin-CLI assembly submission. It specifies metadata about the assembly, including the study and sample it is linked to. Please refer to the `assembly manifest file guide `_ for permitted values. For example, the following manifest file represents a genome assembly consisting of contigs provided in one fasta file: :: STUDY TODO SAMPLE TODO ASSEMBLYNAME TODO ASSEMBLY_TYPE clone or isolate COVERAGE TODO PROGRAM TODO PLATFORM TODO MINGAPLENGTH optional MOLECULETYPE genomic DNA DESCRIPTION optional RUN_REF optional FASTA genome.fasta.gz Chromosome list file '''''''''''''''''''' The **chromosome list file** must be provided when the submission contains assembled chromosomes. This is a tab separated file up to four columns. Each row describes each replicon unit within the assembly. Please refer to the `chromosome list file guide `_ for permitted values. .. tabs:: .. tab:: Viruses By default the chromosome **TOPOLOGY** will be assumed to be linear, therefore if the topology is circular, it must be specified. .. code:: none chr01 1 Monopartite .. code:: none chr01 1 circular-Monopartite viroid .. code:: none chr01 1 Multipartite chr02 2 Multipartite .. tab:: Bacteria By default prokaryotic chromosomes and plasmids will be assumed to reside in the in the cytoplasm, however, the 'plasmid' **CHROMOSOME_LOCATION** may be specified. By default the **TOPOLOGY** will be assumed to be linear, so in this example the circular topology was specified. .. code:: none chr01 1 circular-Chromosome chr02 2 circular-Chromosome plasmid chr03 3 circular-Chromosome plasmid .. tab:: Eukaryota By default eukaryotic chromosomes will be assumed to reside in the nucleus. By default the chromosome **TOPOLOGY** will be assumed to be linear, but it may also be specified. .. code:: none chr01 1 Linear-Chromosome chr02 2 Linear-Chromosome chr03 3 Linear-Chromosome chr04 4 Linear-Chromosome chrMi MIT Linear-Chromosome Mitochondrion If there are sequences that are associated with a specific chromosome, but order and orientation is unknowm, you can also add an `unlocalised list file `_ to the submission. Alternatively, an `AGP file `_ may also be submitted to define unplaced sequences. Webin-CLI submission ```````````````````` When you have prepared your files, including the assembly, the manifest file and any additional files for higher assemblies, You can validate and test your submission using the Webin-CLI ``-validate`` flag. When you are ready to submit the assembly, you can use the ``-submit`` flag. **Webin-CLI validate command:** :: java -jar webin-cli-6.4.0.jar -userName Webin-XXXX -password XXXX -context genome -manifest manifest.txt -validate Data Release and Citing ~~~~~~~~~~~~~~~~~~~~~~~ Once the data is submitted, it will take some time to be processed and archived. If your data is set to public, it will be made public and accessible from the Pathogens Portal. For information about data release, please find more information at the following pages: - `Data Release Policies `_ - `Accession numbers `_ - `Citing and Orcid data claiming `_