Accepted Genome Assembly Data Formats
The advice here is appropriate for submission of complete or near-complete replicons, including plasmids, organelles, complete viral genomes, viral segments/replicons, bacteriophages, prokaryotic and eukaryotic genomes. Chromosomes include organelles (e.g. mitochondrion and chloroplast), plasmids and viral segments.
Genome assembly data files might contain:
Submission of single records that represent all of the unplaced scaffolds, or all of the scaffolds that belong to a particular chromosome but are not localized to a specific position on the chromosome, are not accepted. These records do not represent biological objects and should therefore be split into individual records for each scaffold.
You can use the below file formats to submit genome assemblies. Follow the links to learn more about formatting them:
FASTA File: Unannotated assemblies should be submitted as a FASTA file
Flat File: Annotated assemblies must be submitted as an EMBL flat file
AGP File: The assembly of scaffolds or chromosomes from contigs can be described using an AGP file
Chromosome List File: Must be provided when the submission contains assembled chromosomes
Unlocalised List File: Should be provided when the submission contains chromosomes with unlocalised sequences
Please note that all data files must be compressed with GZIP.
Some additional information is provided in the appendices:
Unannotated sequences should be submitted as a FASTA file. These sequences can be either contig or chromosome sequences. The FASTA format consists of two lines per record, the first being a sequence identifier and the second being the sequence itself. Ensure the sequence contains only valid nucleotide characters and no whitespace or newline characters.
Annotated sequences can only be submitted in the EMBL flat file format. For the full range of features and qualifiers available for flat files and their expected content, please see WebFeat. The complete flatfile manual is available here.
The feature table annotation must conform to the INSDC Feature Table Definition.
Some tools to help you create flat files are described in our Third Party Tools page.
Chromosome List File
The chromosome list file must be provided when the submission contains assembled chromosomes.
The file is a tab separated text file (USASCII7) up to four columns. An example chromosome list file, describing a eukaryote with four linear nuclear chromosomes and one linear mitochondrial chromosomes:
chr01 1 Linear-Chromosome chr02 2 Linear-Chromosome chr03 3 Linear-Chromosome chr04 4 Linear-Chromosome chrMi MIT Linear-Chromosome Mitochondrion
Please read on for information on the content of the chromosome list file columns
OBJECT_NAME (first column): The unique sequence name, matching with the sequence name in your FASTA file (‘>’ line) or EMBL flat file (‘AC * ‘ line).
CHROMOSOME_NAME (second column): The chromosome name. The value will appear as the /chromosome, /plasmid or /segment qualifier in the EMBL-Bank flat files. Names must:
match the pattern: ^[A-Za-z0-9][A-Za-z0-9_#-.]*$
be shorter than 33 characters
be unique within an assembly
not contain any of the following as part of their name (case insensitive):
CHROMOSOME_TYPE (third column): Allowed values:
TOPOLOGY (CHROMOSOME_TYPE modifier):
Topology is not a separate column but can be specified as a modifier to the chromosome type
Options are ‘linear’ or ‘circular’, default is linear
Must not conflict with any value specified in flat file
Contigs, scaffolds and transcriptome sequences are always linear: entering ‘circular’ here will be overriden
CHROMOSOME_LOCATION (optional fourth column): By default eukaryotic chromosomes will be assumed to reside in the nucleus and prokaryotic chromosomes and plasmids in the cytoplasm. Allowed values:
You may use an AGP file to describe the assembly of scaffolds from contigs, or of chromosomes from scaffolds.
AGP files can be validated using the NCBI AGP validator.
The AGP file can also be used to define sequences as unplaced. Unplaced sequences are those known to be part of the assembly, but it is unknown which chromosome they belong to.
Unlocalised List File
This file should be provided when the submission contains chromosomes with unlocalised sequences. Unlocalised sequences are contigs or scaffolds that are associated with a specific chromosome but for which order and orientation is unknown. An example unlocalised list file:
cb25.NA_084 III cb25.NA_093 III cb25.NA_108 III
The unlocalised list file is a tab separated text file (USASCII7) containing the following columns:
OBJECT_NAME (first column): the unique sequence name matching a FASTA header or flatfile
CHROMOSOME_NAME (second column): the unique chromosome name associated with this sequence. This must match with a CHROMOSOME_NAME in the chromosome list file.
Appendix: Unique Sequence Names
All sequences within one genome assembly submission must be identified by a unique sequence name provided in the FASTA, AGP or flat files.
It is essential that the sequence names are unique and used consistently between files. For example, the chromosome list file must refer to the chromosome sequences being submitted in FASTA, AGP or flat files using the unique entry name. Similarly, an AGP file must refer to scaffolds or contigs using unique entry names.
The sequence name is extracted from the header line starting with
For example, the following sequence has name
The sequence name is extracted from the 1st (object) column.
The sequence name is extracted from the
AC * line . The sequence name must be prefixed with a
when using the flat file format.
For example, the following sequence has name
AC * _contig1
Note that for the
AC * line, the ‘AC’ must be followed by exactly one space, an asterisk (*) character, and then
one more space.
Appendix: Definition of Terms
A set of chromosome assemblies, unlocalized and unplaced sequences,
alternate loci and patches that represent a genome.
The major and minor releases form an assembly chain. For example, the
assembly accession for GRCh37 major release is GCA_000001405.1. The
assembly accession consists of two parts: the assembly chain accession
(GCA_000001405) and the assembly version (.1). The assembly version is
incremented for each minor release while the assembly chain accession
An assembled pseudomolecule that represents a biological chromosome.
Most of the chromosome is expected to be represented by sequenced bases,
although some gaps may still be present.
A sequence that has a known chromosomal location and orientation.
A sequence that is not associated with any specific chromosome.
A sequence that is associated with a specific chromosome without
being ordered or oriented on that chromosome.