Tuesday 24 July 2012

Navigating microbial genomes on the NCBI FTP site

If you are a bioinformatician working in microbial genomics, then you should know this URL:


If you click on the URL, there is a big list of folders, and it does look like a mess. But for those of us in microbial genomics there are a few key folders you should know about, and probably even have mirrored on your own servers:
  1. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
  2. ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/
  3. ftp://ftp.ncbi.nih.gov/genomes/Plasmids/
  4. ftp://ftp.ncbi.nih.gov/genomes/Viruses/
  5. ftp://ftp.ncbi.nih.gov/genomes/Fungi/
  6. ftp://ftp.ncbi.nih.gov/genomes/Fungi_DRAFT/
Most of my work is in bacterial genomics, so I'll discuss the contents of the first four folders only. I'll leave the last two to an experienced mycogenomicist.

1. Bacteria

This directory contains a folder for each completed bacterial genome. That is, the genome has been finished to a single DNA sequence per replicon (usually just one chromosome) and is fully annotated. There are currently around 1000 completed bacterial genomes, of which I've been involved in about 10.

Let's have a look at one. I chose Dickeya dadantii because it's a lovely sounding alliteration for a plant pathogen: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Dickeya_dadantii_3937_uid52537/

NC_014500.asn 15.9 MB 13/06/2012 12:11:00
NC_014500.faa 1.7 MB 13/06/2012 12:11:00
NC_014500.ffn 4.5 MB 19/11/2011 11:00:00
NC_014500.fna 4.8 MB 29/09/2010 10:00:00
NC_014500.frn 49.1 kB 29/09/2010 10:00:00
NC_014500.gbk 16.7 MB 13/06/2012 12:11:00
NC_014500.gff 1.8 MB 03/04/2012 03:41:00
NC_014500.ptt 407 kB 10/03/2012 13:18:00
NC_014500.rnt 7.1 kB 29/09/2010 10:00:00
NC_014500.rpt 281 B 25/04/2011 10:00:00
NC_014500.val 7.0 MB 13/06/2012 12:11:00

You can see a bunch of files, all with the same prefix (NC_104500) and a bunch of different suffixes or file extensions (gbk, gff) - some of which should be familiar to you. The NC_014500 is the RefSeq accession ID for the single chromosome of Dickeya dadantii. The most important files are:
  • fna : FASTA file of the chromosomal sequence (think "n" = nucleotide)
  • gbk : Genbank file containing meta-data, sequence, and annotations
  • gff : GFF3 file containing annotations only (coordinates relative to the .fna file)
  • faa : FASTA file of the translated coding regions (proteins) annotated in the .gbk/.gff (think "aa" = amino acids)
In terms of usefulness, the .gbk file contains (nearly) all the information that the other files contain - the .faa and .fna files are easily generated from the .gbk using BioPerl etc. If you want to get the .gbk files for all the finished genomes, you can download the tarball NCBI provides: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.gbk.tar.gz

2. Bacteria_DRAFT

This directory contains folders for each draft bacterial genome. That is, the genome has been de novo assembled into contigs/scaffolds (eg. using Newbler for 454 data) but has not been, and probably never will be, finished. They are usually annotated, either by the submitter or automatically by NCBI, but sometimes there may be only sequences. There is about 2600 draft genomes currently.

Here's the contents of the Thiocapsa marina str. 5811 genome folder - it's a purple sulphur coccus from the Mediterranean Coast if you are interested.

NZ_AFWV00000000.asn        13.5 kB 03/04/2012 03:19:00
NZ_AFWV00000000.contig.asn.tgz 1.7 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.faa.tgz 1.0 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.ffn.tgz 1.5 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.fna.tgz 1.6 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.frn.tgz 4.1 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.gbk.tgz 4.6 MB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.gbs.tgz 4.2 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.gff.tgz 393 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.ptt.tgz 119 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.rnt.tgz 1.5 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.rpt.tgz 2.5 kB 21/07/2012 02:13:00
NZ_AFWV00000000.contig.val.tgz 1.6 MB 21/07/2012 02:13:00
NZ_AFWV00000000.gbk         4.7 kB 03/04/2012 03:19:00
NZ_AFWV00000000.rpt         257 B 03/04/2012 03:19:00
NZ_AFWV00000000.val         6.0 kB 03/04/2012 03:19:00

This folder looks a bit different to the finished genomes. It has a .gbk file, but you will notice it is quite small (4700 bytes), and if you look at it, you can see it has no sequence or annotation, only some meta-data and a reference to "WGS  NZ_AFWV01000001-NZ_AFWV01000062".This means that this genome record consist of 62 other records; one for each contig in the assembly. These are stored in the compressed tar file NZ_AFWV00000000.contig.gbk.tgz as follows:

% tar ztf NZ_AFWV00000000.contig.gbk.tgz
NZ_AFWV01000001.gbk
NZ_AFWV01000002.gbk
NZ_AFWV01000003.gbk
...
NZ_AFWV01000061.gbk
NZ_AFWV01000062.gbk

So, in summary, instead of getting a nice neat single .gbk or .faa file for each replicon as you do for the completed genomes, you get a tarball of files for each assembly, with each file representing a contig in the draft genome. Any extra chromosomes or plasmids will be mixed in the bag of contigs.

3. Plasmids

The plasmids folder is not known to many people, it seems a bit hidden away frankly. It contains ~3000 completed plasmid sequences. Confusingly, ~1000 of these are duplicated from the Bacteria folder (as the plasmid was sequenced with its parent), while the other ~2000 are novel. Even more annoying is that the folder structure is different:

faa/ 21/07/2012 19:39:00
fna/ 21/07/2012 19:40:00
gbk/ 21/07/2012 19:41:00
...
plasmids.all.faa.tar.gz 43.2 MB 23/07/2012 19:43:00
plasmids.all.fna.tar.gz 75.1 MB 23/07/2012 19:43:00
plasmids.all.gbk.tar.gz 199 MB 23/07/2012 19:43:00
...

Now we have a folder for each file extension, which each contains 3000 files. So the files for a particular plasmid are spread out over multiple folders. Fortunately they provide compressed tar files of the whole archive to download directly:  plasmids.all.gbk.tar.gz

4. Viruses

Some of you may be wondering why I am including Viruses in this story. Well, some viruses infect Bacteria too - they are called bacteriophage. There are ~3000 folders in the Viruses division, but not all of them are bacteriophage. A simple grep for "phage" suggests ~600 are bacterial viruses.  The folder structure is the same as for the finished Bacteria genomes.

It is important to realise that most of these virus sequences are natively dsDNA and will also appear integrated into the chromosomal DNA of many of the entries in Bacteria and Bacteria_DRAFT.