Saturday, 8 September 2012

Using Velvet with mate-pair sequences

Introduction

Illumina sequencing instruments (HiSeq, MiSeq, Genome Analyzer) can produce three main types of reads when sequencing genomic DNA:
  1. Single-end
    Each "read" is a single sequence from one end of a DNA fragment. The fragment is usually 200-800bp long, with the amount being read can be chosen between 50 and 250 bp.
  2. Paired-end
    Each "read" is two sequences (a pair) from each end of the same genomic DNA fragment (more info). The distance between the reads on the original genome sequence is equal to the length of the DNA fragment that was sequenced (usually 200-800 bp).
  3. Mate-pair:
    Like paired-end reads, each "read" is two sequences from each end of the same DNA fragment, but the DNA fragment has been engineered from a circularization process (more info) such that the distance between the reads on the original genome sequence is much longer (say 3000-10000 bp) than the proxy DNA fragment (200-800 bp).

Single-end library ("SE")

When we got the original Illumina Genome Analyzer, all it could do was 36 bp single-end reads, and each lane gave us a massive 250 Mbp, and we had to walk 7 miles through snow in the dark to get it. Ok, that last bit is clearly false as we don't get snow in Australia and we speak metric here, but the point is that there is still plenty of legacy SE data around, and SE reads are still used in RNA-Seq sometimes. Let's imagine our data was provided as a standard FASTQ file called SE.fq:

velveth Dir 31 -short -fastq SE.fq
velvetg Dir -exp_cov auto -cov_cutoff auto

I strongly recommend enabling the -exp_cov auto and -cov_cutoff auto options. They will almost always improve the quality of your assemblies.

Paired-end library ("PE")

Paired-end reads are the standard output of most Illumina sequencers these days, currently 2x100bp for the HiSeq and 2x150bp for the GAIIx and MiSeq, but they are all migrating to 2x250bp soon. The two sequences per paired read are typically distributed in two separate files, the "1" file contains all the "left" reads and the "2" file contains all the corresponding "right" reads. Let's imagine our paired-end run gave us two files in standard FASTQ format, PE_1.fq and PE_2.fq:

velveth Dir 31 -shortPaired -separate -fastq PE_1.fq PE_2.fq
velvetg Dir -exp_cov auto -cov_cutoff auto

Previously you had to interleaved the left and right files for Velvet, but we recently added support to Velvet for the -separate option which we hope is now saving time and disk space throughout the Velvetsphere!

Mate-pair library ("MP")

Mate-pair reads are extremely valuable in a de novo setting as they provide long-range information about the genome, and can help link contigs together into larger scaffolds. They have been used reliably for years on the 454 FLX platform, but used less often on the Illumina platform. I think the main reasons for this are the poorer reliability of the Illumina mate-pair protocol and the larger amount of DNA required compared to a PE library.

We can consider MP reads as the same as PE reads, but with a larger distance between them ("insert size"). But there is one technical difference due to the circularization procedure used in their preparation. PE reads are oriented "opp-in" (L=>.....<=R), whereas MP reads are oriented "opp-out" (L<=.....=>R). Velvet likes its paired reads to be in opp-in orientation, so we need to reverse-complement all our MP reads first, which I do here with the EMBOSS "revseq" tool.

revseq -sequence MP_1.fq -outseq rcMP_1.fq -notag
revseq -sequence MP_2.fq -outseq rcMP_2.fq -notag
velveth Dir 31 -shortPaired -separate -fastq rcMP_1.fq rcMP_2.fq
velvetg Dir -exp_cov auto -cov_cutoff auto -shortMatePaired yes


Early Illumina MP libraries are often contaminated with PE reads (the so-called shadow library) which are the result of imperfect selection of biotin-marked fragments in the circularization process. There is a special option in velvetg (not velveth) called -shortMatePaired added by  Sylvain Foret which informs Velvet that a paired channel is MP but may contain  PE reads, which helps it to account for them and avoid mis-scaffolding. I recommend using this no matter how pure you think your MP library is.

Combining them all! (SE + PE + MP)

When de novo assembling multiple libraries in Velvet, you should order them from smallest insert size to largest insert size. For our case, this means the SE first, then the PE, then the MP. Each library must go in its own "channel" as it comes from a differently prepared DNA library. Channels are specified in Velvet with a numerical suffix on the read-type parameter (absence means channel 0):

velveth \
  Dir 31 \
  -short -fastq SE.fq \
  -shortPaired2 -separate -fastq PE_1.fq PE_2.fq \
  -shortPaired3 -separate -fastq rcMP_1.fq rcMP_2.fq 

velvetg \
  Dir \
  -exp_cov auto -cov_cutoff auto \
  -shortMatePaired3 yes

Note that the -shortMatePaired option has been allocated to channel 3 now (the -shortPaired3 library) as that is the MP channel.

Conclusions

It's relatively to get up and running with Velvet, but when your projects become more complicated, the methods in this post should help you. But if you prefer a nice GUI to take care of most of the issues discussed here, I recommend using our Velvet GUI called VAGUE (Velvet Assembler Graphical User Environment). 



43 comments:

  1. That's a very informative post. Do you know how would you handle SOLiD reads? Reverse complimenting SOLiD reads is not the same as that for Illumina. Also, can you do an assembly with mate=paired reads alone?

    Is VAGUE able to handle SOLiD data?

    ReplyDelete
  2. Great post Torsten. Do you have any actual stats on how much better this makes an assembly with real world data? PS. I cant wait for a decent miSEQ MP protocol, PE is rubbish for de novo assembly.

    Linda, Velvet does a terrible job of SOLiD data. Your best option is to assemble a genome with illumina or 454 or torrent data, then overlay the SOLiD LMP data and use it to fix some errors, and to order your contigs.

    ReplyDelete
  3. De novo assembly of SOLiD colour space data is fraught with problems. All assembly software I know only works in base space. The SOLiD community site has details of a pipeline to convert/assemble/merge but I don't think it works very well.

    You are right in the the orientation of SOLiD mate pairs is different again, I think it is (L=> R=>).

    Yes, you can do an assembly of just MP reads.

    Ultimately de novo assembly works best with very long SE reads. PE and MP reads are a "hack" around this. The variability in the insert/fragment size makes linking PE/MP reads a bit tricky, although Pevzner is working on "rectangle graphs" for this sort of data.

    ReplyDelete
  4. I really like the '-separate' option! Is it true that this option is not documented in the manual?

    ReplyDelete
  5. That is false. The -separate option is in the Velvet manual (Manual.pdf) since version 1.2.07.

    https://github.com/tseemann/velvet/raw/master/Manual.pdf

    ReplyDelete
    Replies
    1. Just an FYI... The link you give above for the Velvet manual points to the August 29, 2008 version, which has no mention of the -separate flag.

      Delete
  6. Hi Torsten,

    Thanks for the very informative post! It was just what (I thought) I needed. I'm trying to use Velvetoptimiser to assemble a small eukaryotic genome from two Illumia HiSeq libs- one PE and one MP.

    Do you know if/how I can pass the Velvetg cmds. for the MP lib. as you did? It seems that one can pass most Velvet cmds. via Velvetoptimiser, but there is not a lot of documentation on doing something like this... at least I've yet to find it.
    Thanks
    Walt

    ReplyDelete
  7. Yes, you can pass through the velveth and velvetg options in this post to VelvetOptimiser with the -f and -o options. These are all described the the VOpt manual in the tarball you download. If you have any trouble, just email Simon Gladman with the command line that failed. His email is in the manual.

    But first see if you can assemble without VOpt for a single K value. If you can't get that working, there is no point putting it into VOpt.

    ReplyDelete
    Replies
    1. Thanks, Toresten, that's exactly what I've been doing- assembling w/o VO. I also just tried to use the -separate flags (I had shuffled data before), but apparently the Velvet version on our cluster (1.2.01-62-mp) is way out of date and doesn't support that, which probably explains why I was having issues passing some cmds. with VO. For example, using -shortpaired2, which won't work, but -shortpaired3 will... go figure!

      Thanks again!

      Delete
  8. Hi Torsten, thanks for you informative post
    one little thing:
    when using revseq you have to specify both input and output format correctly and explicitely, otherwise (e.g. with your code) you end up with a fasta file...

    revseq -sequence MP.fq -sformat1 fastq-sanger -outseq rc_MP.fq -osformat2 fastq-sanger -notag

    would be correct for newer illumina data.
    Cheers

    ReplyDelete
  9. Harald,
    Thanks for the information on EMBOSS revseq. To be honest, I don't use revseq - I use my own tool - but I wanted my post to be portable. I should have tested it properly beforehand.
    Torsten.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Hi,
    So I am trying to assemble a low complexity genome. The simulated paired end reads (illumina Miseq) which I am getting out of metasim shows a good genomic coverage. But it gives just one fasta file which I am using as an input for velveth.
    My aim is to get a good contig length distribution which I am not getting. I guess I am doing something wrong in giving input parameters. Is it that velvet does not accept a single fasta file for paired read information? Do I have to input 2 different files (forward read and reverse read files)?

    Thanks
    Ashwani

    ReplyDelete
  13. Velvet can accept paired-end reads in two ways. The first is as separate left and right files (via the -separate option). The second is via a single interleaved/shuffled file (default).

    A large low complexity genome will never give a good assembly using only paired end reads because there are too many repeats which can not be resolved.

    ReplyDelete
    Replies
    1. Hi Torsten,

      Is it like velvet performs best with very short reads (say 36-75bp)?What about reads longer than 150bp, because when I vary the read length to a higher region (above 150bp), the assembly degrades.

      Thanks
      Ashwani

      Delete
  14. I use Velvet with 150bp and even 250bp MiSeq reads without any problems. With these length reads the best K-mer value is around K=100 rather than K=31 like it used to be with very short 36bp reads. But to achieve K=100, you still need to ensure your depth is high enough to make the K-mer coverage (not read coverage) high enough (50-100).

    ReplyDelete
    Replies
    1. I am using velvet to assemble fungal genome (~30 million reads). I have compiled velvet with 'MAXKMERLENGTH=165' and 'openmp=1' to handle my 150bp reads. I am getting memory error ( velveth: Can't malloc 207 chars: Cannot allocate memory). Below is my script that shows the memory I requested.

      #PBS -q standard
      #PBS -l jobtype=cluster_only

      ### Set the number of cpus that will be used.
      #PBS -l select=1:ncpus=4:mem=8Gb:pcmem=4gb
      #PBS -l pvmem=4gb
      #PBS -l cput=30:0:0
      #PBS -l walltime=24:0:0

      source /usr/share/Modules/init/bash

      module load unsupported
      module load nirav/velvet/1.2.10

      #Get things ready to use threaded (OMP) version of Velvet
      #Set OMP_THREAD_LIMIT--will be the same as ppn above
      export OMP_THREAD_LIMIT=4

      #Set OMP_NUM_THREADS--will be 1 lower than ppn
      NUM_THREADS=3
      export OMP_NUM_THREADS=$NUM_THREADS

      # Write something to record what thread limits were used.
      echo Limiting Velvet to $PBS_NP threads total with $NUM_THREADS slave threads.

      cd $PBS_O_WORKDIR/1st_assembly

      velveth fungal 125,155,10 -fastq.gz -shortPaired -separate /rsgrps2/sjmiller/Illumina_JeeHan/160224_M03132_0042_000000000-ALYFG_Analysis_fastq_files/Parasite_S1_L001_R1_001.fastq.gz \
      /rsgrps2/sjmiller/Illumina_JeeHan/160224_M03132_0042_000000000-ALYFG_Analysis_fastq_files/Parasite_S1_L001_R1_001.fastq.gz

      I know I am requesting way too less memory but my question is what is appropriate memory that could handle my dataset?

      Thanks,
      M

      Delete
    2. This comment has been removed by the author.

      Delete
  15. Hi Torsten,

    As stated by you I tried k mer value of around 100 and it worked for a metagenome (say 10 genomes). The meta velvet also gave good results. The number of reads in this case was 1 million. But as I increased the read number to > 4 million the assembly degrades, even if I use a very large k mer value. So apart from K mer is there any other parameter which needs to be varied?

    Thanks
    Ashwani

    ReplyDelete
  16. It is difficult to know what is going on with your data. I'm not really sure what you mean by "assembly degrades"?

    As you add more reads, more "lower proportion" organisms may pass the cutoffs that Velvet is using, and you will get more contigs.

    ReplyDelete
    Replies
    1. Hi,
      Thanks for the reply!!
      To be more specific I am aiming at a good contig length distribution. Say if I use 4 million reads, at k mer size around 200 (my read length is 220bp), I am getting a decent contig distribution but not a good one.
      Now if I think of increasing the read count further, I am not sure what parameters apart from k mer size I should vary. How the k mer coverage in these scenarios affect contig length distribution?

      Thanks
      Ashwani

      Delete
  17. Just the info I was after - thanks!

    With regard to the question about longer kmer values though (ie ~100), won't the increased probability of encountering a sequencing error start to create problems in the assembly. There will also be less overlaps that fit into this big kmer value. With the exception of long repeats, I would have thought inappropriate overlaps greater than 50 bp would be rare, so not much is gained anyway. Hope that makes sense!

    ReplyDelete
  18. You are right that the probability of erroneous K-mers does increase with K. You can only increase K if you also increase your read coverage too, so that we get enough correct K-mers (signal) to counteract the false K-mers (noise). Read clipping and read correction are often used to drastically reduce the number of false K-mers being put into the graph, which reduces memory usage and improves graph simplification.

    For bacteria we usually have lots of coverage, sometimes > 1000x so using higher K-mers improves the assembly in practice. The error rate of Illumina is well below 1% now, so false K-mers are less likely than one might first expect (after clipping). Some studies have shown that K=24 is "mostly unique" in genomes, but mostly doesn't make them all unique.

    Thanks for taking the time to comment.

    Torst

    ReplyDelete
  19. In which version there is the possibility of using the -separate option for paired-ends reads? in Version 1.2.03 I get an unknown option error for that.

    ReplyDelete
  20. Hi Torsten, great post. One thing I'd point out is that Velvet does not compile with 3 categories (for 3 libraries) by default (it defaults to 2) so to use more than 2 libraries in an assembly, you need to compile velvet with the CATEGORIES option.

    ReplyDelete
    Replies
    1. As per manual: make 'CATEGORIES=57' (in the unlikely event you have 57 separate libraries to assemble)

      Delete
  21. Hi Torsten,
    Thanks for the this informative post. Do you know any tool designed to trim (and reverse the reads if necessary) for adapters specifically for illumina mate-pair libraries ? Based on the junction adapter(s) present in the read MPs might have different orientation so reversing all the reads seems not the best option. I could write my own code for this but I wanted to check first to see if there is anything available.

    Thanks
    Hamdi

    ReplyDelete
  22. Hi Torsten,
    Thank you for such an informative post.
    I am using metavelvet to assemble metagenome generated by Ion Proton, there sre 23 million SE reads.
    I have used the following commands:
    velveth runAll_a/ 21 -short ionExpress_nofilter.fasta

    velvetg runAll_a/ -exp_cov auto

    meta-velvetg runAll_a/

    The last line of the output is:

    Final graph has 4272026 nodes and n50 of 101, max 1932, total 178452348, using 19006699/23565054 reads

    I am confused or rather surprised with this line. Isn't the contigs are too many.
    Also, when I ran the python script "scriptEstimatedCovMulti.py" I am getting peak as[-1] and I am not sure what that is.
    The coverage is also very low.
    I would like to know if the method I am adopting is correct or I need to modify some steps to get improved results.
    I will be thankful if any one could help me figure out if I am on right track.

    Thanks!

    ReplyDelete
    Replies
    1. Your result is not surprising. Velvet is not well suited to Proton data, it works best with Illumina. Also Velvet is not designed for meta-genomes, which have a mixture of species of different coverages.

      You should consider a more specific tool, eg. Meta-Velvet or Ray-Meta. The Khmer project also has a protocol for assembling meta-genomes: https://khmer-protocols.readthedocs.org/en/latest/metagenomics/

      Delete
    2. Thank you for the reply.
      I have used Meta-Velvet here at the last step, but the results or the contigs which are produced after it and after velvetg are same in number.

      I tried the same approach with Miseq data but it looked same.
      I will try Ray-Meta.

      Thanks!

      Delete
  23. Hi Torsten,
    Thanks for the this informative post.
    I now using hybrid data of pair-end and mate-pair reads to assemble. I want to perform error correct before assemble. Should I reverse complement the mate-pair reads before I correct the error for all of my data. Or can I correct them and then reverse complement the mate-pair data before assemble?

    ReplyDelete
    Replies
    1. Trim the reads first, then correct the reads, then reverse complement for the assembler last.

      Delete
  24. Hi Torsten,
    I was wondering how using PE and MP reads of different lengths would affect the best k-mer size? If you have, say, a best k-mer size for the PE reads that is longer than the MP read length.

    ReplyDelete
    Replies
    1. Most modern assemblers no longer use a single k-mer value. They combine assemblies of multiple k-mer values.

      You may want to assemble just the PE reads first into contigs, then use an external scaffolder to use the MP reads to join and order contigs.

      Delete
  25. Hi Torsten,

    I started to do velvet recently, i want to know all the basic things about Velvet. i hate the official manual which one is i downloaded from the Velevt official page. Could you pls sugeest me any of the Articles and Velvet manual...

    Best,

    K.Kathir

    ReplyDelete
  26. Hi Torsten,
    recently i was trying to assemble my genome with 3 libraries,and how to determin the -ins_length option if i want to assemble with all the libraries? Is it necessary ao add this option to improve my assembly results? if i assemble with separate library ,how could i combine these results into one ?
    Looking forward to your reply

    ReplyDelete
  27. Hello Torsten,
    You are really doing a wonderful work by writing here . It really helps the beginners in de novo assembly.Here, i have one question that why all 3 libraries (SE reads, PE reads and Mate PE reads) of an organism should be combined and assembled? Combining 3 libraries will improve the assembly quality? could you answer me this?

    ReplyDelete
    Replies
    1. De novo genome assembly would be easy if there were no repeats in the genome, and no errors in the reads. We can improve error rate with better technology, but we can never change the genomes. Repeats produce ambiguous hubs in the assembly graph. The only way to untangle a repeat of length L is to have reads, or "read spans" that are LONGER than L bp. That's why Mate Pair reads are used, as they span up to 40 kbp of the genome, wherease Paired End reads only reach ~600bp.



      You might like to read some of the slides I have given on this topic:
      http://www.slideshare.net/torstenseemann/presentations

      Delete