Sunday, 11 November 2012

Tools to merge overlapping paired-end reads

Introduction

In very simple terms, current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a "library" and the short pieces are called "fragments". Each of  the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment - either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a "pair" of reads for each fragment. These "paired-end" reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.

Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.

However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won't. It's starting to get messy.

The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.


The tools

Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I've used one tool (FLASH) to overlap some MiSeq 2x150 PE reads, and then assembled them using Velvet, and the merged reads produced a "better" assembly than with the paired reads. But that's it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!



Features to look for

  • Keeps original IDs in merged reads
  • Outputs the un-overlapped paired reads
  • Ability to strip adaptors first
  • Rescores the Phred qualities across the overlapped region
  • Parameters to control the overlap sensitivity
  • Handle .gz and .bz2 compressed files
  • Multi-threading support
  • Written in C/C++ (faster compiled) rather than Python/Perl (slower)

33 comments:

  1. Allpaths-LG can now also do fragment filling:
    http://www.broadinstitute.org/software/allpaths-lg/blog/?p=577

    ReplyDelete
  2. So I'm trying to assemble a metagenome. THought maybe I'll run my shotgun sequences through flash or cope before.
    The default output of Flash contains one fastq file with combined sequences and two with those that remind uncombined. (From both input fastq-s then)
    Would you just use the combined sequences fastq to feed to Velvet or SOAPdenovo etc. assembler or would you cocatenate all three fastqs as it seems to me the uncombined files still combine bits and pieces that could be useful to assembly no matter that they are shorter?

    ReplyDelete
  3. Assembling a meta-genome is difficult at the best of times! Velvet can usually cope ok with overlapping paired end reads. But it probably do better if you do pre-overlap them with FLASH. You can give Velvet both the overlapped (SE) and non-overlapped (PE) reads:

    velveth Dir K
    -short -fastq COMBINED.fq
    -shortPaired -fastq -separate NOTCOMBINED_R1.fq NOTCOMBINED_R2.fq
    velvetg Dir -exp_cov auto -cov_cutoff auto

    However, metagenomes are tricky, as different genomes will have different abundances in the mixture... you may want to consider partitioning the reads first

    http://www.pnas.org/content/early/2012/07/25/1121464109.abstract

    or using digital normalization to equalize the abundance with KHMER:

    https://khmer.readthedocs.org/en/latest/guide.html#metagenome-assembly

    ReplyDelete
  4. I thought on using COPE as it seems to produce better N50 after assembly than flash (as per their own paper), then follow it with SOAPdenovo.
    Digital normalization is soemthing that crossed my mind too as I got across this artcile 1. I'll take a look at those posted by You now too.
    I was thinking of using Quake as the very first step. Just installed it have to look that it hopefully wouldn't mess with the order of paired end reads in the files.


    1.Brown, C. T., Howe, A., Zhang, Q., Pyrkosz, A. B., & Brom, T. H. (2012). A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. Genomics.

    ReplyDelete
  5. If you want to consider read error correction, you may wish to consider trying Musket: http://www.ncbi.nlm.nih.gov/pubmed/23202746

    ReplyDelete
  6. great post, thanks for sharing it.

    ReplyDelete
  7. Thanks for sharing. Does someone know the script to call samples one by one in flash to join the reads ?

    Thanks

    ReplyDelete
  8. Wasim, you need to learn how to use a loop command like "for" in a shell script: http://www.thegeekstuff.com/2011/07/bash-for-loop-examples/

    ReplyDelete
  9. ABySS includes a tool, abyss-mergepairs, for merging overlapping paired-end reads.

    https://github.com/bcgsc/abyss/blob/master/Align/mergepairs.cc

    ReplyDelete
  10. Another tool called PEAR has also been published:
    http://bioinformatics.oxfordjournals.org/content/early/2013/11/10/bioinformatics.btt593.full
    Looks like it doesn't need much prior information, and axhaustively tries all possibilities.

    ReplyDelete
  11. Great stuff, across the ditch we're starting to run 2x300bp PE reads on MiSeqs, for small eukaryotic genomes (eg fungi).

    ReplyDelete
  12. Also, with regards to Jens post, can we please put a stake through the heart of the N50 statistic. There may be a single number that tells you how "good" an assembly is, but N50 ain't it. Maybe try some of the newer programs like ALE, CGAL, REAPR to assess the "goodness" of an assembly.

    ReplyDelete
  13. Question, Illumina has a built in script into their MSR software (for the miseq) that processes the PE files into a single file with all of the reads. According to Illumina and some others, the script uses the coordinate positions of the read clusters on the flowcell (which apparently is buried somewhere in the FASTQ file ) to quickly synthesize the new composite/joined/stitched reads. Any comments on how this algorithm stacks up against the traditional methods?

    ReplyDelete
    Replies
    1. The FASTQ IDs have a bunch of numbers with colons between them, these are flow cell coordinates. I'm not sure how this helps with stitching, as the Read1 and Read2 come from the same cluster, and have the same FASTQ IDs. Can this stitching script be enabled from the MiSeq GUI ?

      Delete
    2. Hi. Here is a little information about the MiSeq's behaviour with regard to read stitching (see p7-8 "Read Stitching"), and how to enable it for the FastQ generation workflow.

      http://supportres.illumina.com/documents/documentation/software_documentation/miseqreporter/miseq-reporter-generatefastq-workflow-guide-15042322-b.pdf

      Delete
  14. This comment has been removed by the author.

    ReplyDelete
  15. Is it a good idea to make F=180bp where L=101 to get the overlapping reads ?

    ReplyDelete
    Replies
    1. Munna, that should work OK.

      However I would recommend using PEAR now, as it exhaustively tests all the overlaps and does not require you to supply any parameters.

      Delete
  16. Dear Dr. Seeman,

    Indeed a nice article. may you add PEAR now in the main text because it will help very initially to choose it as a program to work. Thanks

    ReplyDelete
  17. Dr. Seeman,

    In unpublished collaborative work, I had implemented such merging of paired-end / mate-pair reads to form merged, longer reads back in 2008-2009 as part of a WWW-accessible EST (Expressed Sequence Tags) analyses pipeline called "ArthropodEST" [ http://arthropodest.ksu.edu/ ], to help improve contig assemblies, using FASTA files containing labeled forward and reverse reads. We (collaborators and self) had called it "pre-pairing" (i.e., preparing the reads for better assembly) occurring just before contig-assembly using CAP3 assembler. The motivation was to improve contig-assembly by "pre-pairing", because these reads were otherwise either in different contigs or remained as singletons. At that time, we used to receive FASTA sequences without accompanying quality scores, so we figured out an algorithm to form the consensus merged sequence based on published knowledge that these reads have low-quality sequence within about 20% of read-ends. This "pre-pairing" algorithm used BLASTN to align the forward-reverse pair at an e-value of 1e-04. If the resulting top hit overlap is atleast 20 bases long, occurring completely within 20% of the ends of both reads of the pair, then a merged consensus sequence for the pair was determined. Otherwise, failing all these conditions, these reads would remain as-is.

    Sanjay.

    ReplyDelete
    Replies
    1. You were ahead of your time!
      We need faster methods now due to millions of sequences.

      Delete
  18. This comment has been removed by the author.

    ReplyDelete
  19. Does it make sense to try overlap when F=5000 bp and L=101?

    ReplyDelete
    Replies
    1. No. In that case you have "long mate pair" reads where R1 and R2 are not near each on the genome. It makes no sense to overlap them.

      Delete
  20. can we use pear reads merger and then abyss-konnector to use fill the gap between reads and then abyss-pe to assembled data?

    ReplyDelete
    Replies
    1. No. If there are gaps between R1 and R2 you can not overlap them.

      Delete
    2. sorry no gaps but abyss-konnector connect paired-end reads by navigating a Bloom filter de Bruijn graph so can i use pear --> musket --> abyss-konnector-->abyss-pe in pipeline?
      i tried this with my dataset and i got good results at hihger kmer as most of the data was merged by abyss-konnector but i want to be sure about abyss konnector ?

      Delete
    3. You need to be careful of PEAR because it will do the wrong overlap in regions with a short tandem repeat. A paired-end assembler is less likely to make this mistake.

      Delete

  21. Hi, I am doing a denovo genome assembly. I have (35-300bp) paired-end reads from Illumina Miseq.
    I am running softwares upon software and changing parâmeters to achieve better results.

    But as I go through these softwares and meet new parameters, some of them stands as doubts to me.
    Here they are:

    1-If my paired-end reads are overlapping, how can I now it from them by themselves?
    If they are overlapping and I use COPE,FLASH, etc to do overlap, will this step cause any problem
    to my reads?

    2- I am using SSPACE to do scaffolds. One of the parameters SSPACE needs in its library file is the
    "inner mate distance". Some one told me it is something like 0-300 bp.

    I tried insert this value as 200 and it worked. But, How can I know if this is the better? How can I know
    the inner mate distance based only in the information of my fastq files?


    Note: I did not make the library. I received these files from other people to assemble the genome.

    Thank oyu since now.

    ReplyDelete
    Replies
    1. Yes FLASH can be used to overlap them. It could sometimes do the wrong overlap, but not common. you need to align your reads to a reference genome to get a good idea of the inner distance for SSPACE.

      Delete
  22. Great and informative post! I am attempting to assemble a ~30Gb genome using low depth (20x) 150bp PE illumina reads. The data is in two groups v1 (roughly 15x coverage) and v2 (the remaining 5x coverage) producing two pairs of read files. I've trimmed the adaptors and low quality reads as well as merged the pairs in each version into a single file but as the merged pairs are differing lengths, I am unable to assemble properly. The two versions were created using different primers, as well, so the reads for each pair will not merge properly with each other. Do you have any suggestions on how I can assemble with low coverage and these data? Thank you!

    ReplyDelete
    Replies
    1. Before you overlap reads, I recommend you use ABYSS to do a rough assembly of your data. It is very low coverage so it may be hard to get a good result. If that works ok, then use DiscovarDenovo on the original reads. It does not matter if the reads are different lengths.

      Delete