Sunday, 6 October 2013

Understanding SNPs and INDELs in microbial genomes

Introduction

Variants are differences between two genomes. Here I describe two important types of nucleotide-level variants (SNPs and INDELs) and how they affect microbial genomes.

SNPs

A SNP is a single nucleotide polymorphism (pronounced "snip"). This is when there is single base which differs between two genomes, and the DNA around that base is otherwise unchanged.

Genome 1 | DNA | ATGCTATAGTAAATCTGCGCTAGCT
Genome 2 | DNA | ATGCTATAGTAAATGTGCGCTAGCT
                               |
                           SNP(C=>G)  

In coding-dense genomes like microbes, most SNPs will be within protein coding regions. Thus the SNP will change a codon, and potentially change the amino acid it codes for. If the amino acid coded for does not change, it is called a synonymous SNP (as the codon is a 'synonym' for the amino acid). If it does change, it is called a non-synonymous SNP.

Genome 1 | DNA | ATG AAA GTT GAT GAC CAG CAT TCC CCA TGA
Genome 2 | DNA | ATG AAA GTC GAT GAC CAG CAT TAC CCA TGA
                         ..|                 .|.  
                       SNP(T=>C)          SNP(C=>A)
                         ..|                 .|.
Genome 1 |  AA |  M   K   V   D   D   Q   H   S   P   *
Genome 2 |  AA |  M   K   V   D   D   Q   H   Y   P   *
                          |                   |
                         SYN               NON-SYN

A non-synonymous SNP can drastically alter the function of a protein because sometimes a single amino acid difference can modify the structure/shape of a protein. It could even affect the RNA transcript itself, causing it to be translated at lower efficiency or not at all. SNPs in promoter regions (-35, -10) and the ribosome binding site (RBS) can have similar effects.

A good rule of thumb is that SNPs in the 3rd position in a codon often produce synonymous SNPs, due to the particular pattern of degeneracy in the genetic code. If two SNPs occur right next to each other, the variant is sometimes called a multiple nucleotide polymorphism (MNP).

INDELs

An INDEL (INsertion/DELetion) is where a single base has been deleted, or inserted into one genome relative to another. It is a symmetrical relationship, as a deletion in one corresponds to an insertion in another. I reckon it should be called a deletion/insertion polymorphism (DIP) too, so we can all snack on SNPs and DIPs :-)

                           DEL(A)
                             |
Genome 1 | DNA | ATGCTATAGTAA-TCTGCGCTAGCT
Genome 2 | DNA | ATGCTATAGTAAATGTGCGCTAGCT
                             |
                           INS(A)  

While a SNP will either change a protein slightly or not at all, an INDEL will nearly always have a drastic affect on a protein. Because codons are groups of 3 nucleotides, removing/adding 1 nucleotide messes everything up; this is called a frame-shift mutation. This usually results in either a protein being extended, or truncated.

Genome 1 | DNA | ATG AAA GTT GAT GAC CAG CAT TCC CCA TGA
Genome 1 |  AA |  M   K   V   D   D   Q   H   S   P   *

Genome 2 | DNA | ATG AAA GTC -AT GAC CAG CAT TAC CCA TGA
                             |                          
                           DEL(G)          
                             |
Genome 2 | DNA | ATG AAA GTC ATG ACC AGC ATT ACC CAT GA? ??? ??? ???
Genome 2 |  AA |  M   K   V   M   T   S   I   T   H   X   X   X   X
                                                      |
                                            STOP Loss & read-through

In the previous case, the protein was extended into a new frame, causing it to have a different 3' end than normal. It will eventually hit another stop codon just by chance. In the case below, if a premature STOP codon is introduced, then we end up with a shorter reading frame.

Genome 3 | DNA | ATG AAA GTC GAAT GAC CAG CAT TAC CCA TGA
                               |
                             INS(A)
                               |
Genome 3 | DNA | ATG AAA GTC GAA  TGA CCA GCA TTA CCC ATG          
Genome 3 |  AA |  M   K   V   E    *   P   A   L   P   M
                                   |
                         STOP Gain & truncation

Because the terminator sequence is no longer where it needs to be, these genes may not every be transcribed, or translated. In that case they are called pseudo-genes.

If multiple deletions (or multiple insertions) occur together, it is sometimes called a micro-indel (or micro-insertion). A micro-INDEL of length 3 occasionally occurs in bacterial evolution, as it keeps the protein translation in frame.

Structural Variation

SNPs and INDELs are about low-level genomic variation. It is also possible to look at structural variants which affect the genome at larger scales. Events like gene duplications, tandem repeats, transposon insertions, inversions, and other chromosomal rearrangements are all important to consider, but this post will leave those issues for another day.

Conclusion

SNPs and INDELs are small differences between genomes. They are important drivers of bacterial evolution, by modifying how or whether genes are transcribed and translated. In my next post I will introduce my new tool Snippy for discovering these differences efficiently.

4 comments:

  1. hello Dr torsten,
    I am working on SNP and indel detection for non model plant organism after reaching and producing the vcf file i think i have hit a road block!! The format is very complex an i am wondering where and how to proceed further with the .vcf files i posses.
    I am using these .vcf files in IGV but igv says that it fails to detect an index file and i get a blank screen...any suggestion
    Thanks in advance

    ReplyDelete
    Replies
    1. This web page explains how to index your VCF files for IGV:
      http://www.broadinstitute.org/igv/VCF

      Delete
  2. Thanks! This was a helpful review. Loved the SNP DIP joke :0)

    ReplyDelete
  3. My daughter has multiple CBS insertions and a COMT insertion. Trying to make sense of what this means....

    ReplyDelete