Accepted Genome Assembly Data Formats
Introduction
The advice here is appropriate for submission of complete or near-complete replicons, including plasmids, organelles, complete viral genomes, viral segments/replicons, bacteriophages, prokaryotic and eukaryotic genomes. Chromosomes include organelles (e.g. mitochondrion and chloroplast), plasmids and viral segments.
Genome assembly data files might contain:
Contig sequences
Scaffold sequences
Chromosome sequences
Unlocalised sequences
Functional annotation
Submission of single records that represent all of the unplaced scaffolds, or all of the scaffolds that belong to a particular chromosome but are not localized to a specific position on the chromosome, are not accepted. These records do not represent biological objects and should therefore be split into individual records for each scaffold.
You can use the below file formats to submit genome assemblies. Follow the links to learn more about formatting them:
FASTA File: Unannotated assemblies should be submitted as a FASTA file
Flat File: Annotated assemblies must be submitted as an EMBL flat file
AGP File: The assembly of scaffolds or chromosomes from contigs can be described using an AGP file
Chromosome List File: Must be provided when the submission contains assembled chromosomes
Unlocalised List File: Should be provided when the submission contains chromosomes with unlocalised sequences
Please note that all data files must be compressed with GZIP.
Some additional information is provided in the appendices:
FASTA File
Unannotated sequences should be submitted as a FASTA file. These sequences can be either contig or chromosome sequences. The FASTA format consists of two lines per record, the first being a sequence identifier and the second being the sequence itself. Ensure the sequence contains only valid nucleotide characters and no whitespace or newline characters.
Flat File
Annotated sequences can only be submitted in the EMBL flat file format. For the full range of features and qualifiers available for flat files and their expected content, please see WebFeat. The complete flatfile manual is available here.
The feature table annotation must conform to the INSDC Feature Table Definition.
Some tools to help you create flat files are described in our Third Party Tools page.
Chromosome List File
The chromosome list file must be provided when the submission contains assembled chromosomes.
The file is a tab separated text file (USASCII7) up to four columns. An example chromosome list file, describing a eukaryote with four linear nuclear chromosomes and one linear mitochondrial chromosomes:
chr01 1 Linear-Chromosome
chr02 2 Linear-Chromosome
chr03 3 Linear-Chromosome
chr04 4 Linear-Chromosome
chrMi MIT Linear-Chromosome Mitochondrion
Please read on for information on the content of the chromosome list file columns
OBJECT_NAME (first column): The unique sequence name, matching with the sequence name in your FASTA file (‘>’ line) or EMBL flat file (‘AC * ‘ line).
CHROMOSOME_NAME (second column): The chromosome name. The value will appear as the /chromosome, /plasmid or /segment qualifier in the EMBL-Bank flat files. Names must:
match the pattern: ^[A-Za-z0-9][A-Za-z0-9_#-.]*$
be shorter than 33 characters
be unique within an assembly
not contain any of the following as part of their name (case insensitive):
‘chr’
‘chrm’
‘chrom’
‘chromosome’
‘linkage group’
‘linkage-group’
‘linkage_group’
‘plasmid’
CHROMOSOME_TYPE (third column): Allowed values:
chromosome
plasmid
linkage_group
monopartite
segmented
multipartite
TOPOLOGY (CHROMOSOME_TYPE modifier):
Topology is not a separate column but can be specified as a modifier to the chromosome type
Options are ‘linear’ or ‘circular’, default is linear
Must not conflict with any value specified in flat file
Contigs, scaffolds and transcriptome sequences are always linear: entering ‘circular’ here will be overriden
CHROMOSOME_LOCATION (optional fourth column): By default eukaryotic chromosomes will be assumed to reside in the nucleus and prokaryotic chromosomes and plasmids in the cytoplasm. Allowed values:
Macronuclear
Nucleomorph
Mitochondrion
Kinetoplast
Chloroplast
Chromoplast
Plastid
Virion
Phage
Proviral
Prophage
Viroid
Cyanelle
Apicoplast
Leucoplast
Proplastid
Hydrogenosome
Chromatophore
AGP File
You may use an AGP file to describe the assembly of scaffolds from contigs, or of chromosomes from scaffolds.
AGP files can be validated using the NCBI AGP validator.
The AGP file can also be used to define sequences as unplaced. Unplaced sequences are those known to be part of the assembly, but it is unknown which chromosome they belong to.
Unlocalised List File
This file should be provided when the submission contains chromosomes with unlocalised sequences. Unlocalised sequences are contigs or scaffolds that are associated with a specific chromosome but for which order and orientation is unknown. An example unlocalised list file:
cb25.NA_084 III
cb25.NA_093 III
cb25.NA_108 III
The unlocalised list file is a tab separated text file (USASCII7) containing the following columns:
OBJECT_NAME (first column): the unique sequence name matching a FASTA header or flatfile
AC *
lineCHROMOSOME_NAME (second column): the unique chromosome name associated with this sequence. This must match with a CHROMOSOME_NAME in the chromosome list file.
Appendix: Unique Sequence Names
All sequences within one genome assembly submission must be identified by a unique sequence name provided in the FASTA, AGP or flat files.
It is essential that the sequence names are unique and used consistently between files. For example, the chromosome list file must refer to the chromosome sequences being submitted in FASTA, AGP or flat files using the unique entry name. Similarly, an AGP file must refer to scaffolds or contigs using unique entry names.
FASTA
The sequence name is extracted from the header line starting with >
.
For example, the following sequence has name contig1
:
>contig1
AAACCCGGG...
AGP
The sequence name is extracted from the 1st (object) column.
Flat Files
The sequence name is extracted from the AC *
line . The sequence name must be prefixed with a _
when using the flat file format.
For example, the following sequence has name contig1
:
AC * _contig1
Note that for the AC *
line, the ‘AC’ must be followed by exactly one space, an asterisk (*) character, and then
one more space.
Appendix: Definition of Terms
Term |
Definition |
---|---|
Assembly |
A set of chromosome assemblies, unlocalized and unplaced sequences,
alternate loci and patches that represent a genome.
|
Assembly chain |
The major and minor releases form an assembly chain. For example, the
assembly accession for GRCh37 major release is GCA_000001405.1. The
assembly accession consists of two parts: the assembly chain accession
(GCA_000001405) and the assembly version (.1). The assembly version is
incremented for each minor release while the assembly chain accession
remains unchanged.
|
Chromosome |
An assembled pseudomolecule that represents a biological chromosome.
Most of the chromosome is expected to be represented by sequenced bases,
although some gaps may still be present.
|
Placed sequence |
A sequence that has a known chromosomal location and orientation.
|
Unplaced sequence |
A sequence that is not associated with any specific chromosome.
|
Unlocalised sequence |
A sequence that is associated with a specific chromosome without
being ordered or oriented on that chromosome.
|