The Genome Factory: Tools to merge overlapping paired-end reads

Sunday, 11 November 2012

Tools to merge overlapping paired-end reads

Introduction

In very simple terms, current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a "library" and the short pieces are called "fragments". Each of the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment - either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a "pair" of reads for each fragment. These "paired-end" reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.

Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.

However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won't. It's starting to get messy.

The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.

The tools

Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I've used one tool (FLASH) to overlap some MiSeq 2x150 PE reads, and then assembled them using Velvet, and the merged reads produced a "better" assembly than with the paired reads. But that's it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!

PEAR (Paired-End Read Merger)
http://sco.h-its.org/exelixis/web/software/pear/doc.html (* this is what I use)
COPE (Connecting Overlapping Paired End reads)
http://sourceforge.net/projects/coperead/
SeqPrep
https://github.com/jstjohn/SeqPrep
FLASH (Fast Length Adjustment of Short Reads to Improve Genome Assemblies)
http://www.cbcb.umd.edu/software/flash
fastq-join (part of ea-utils)
http://code.google.com/p/ea-utils/wiki/FastqJoin
PANDAseq
https://github.com/neufeld/pandaseq
stitch (now defunct, merged into PANDAseq)
https://github.com/audy/stitch
mergePairs.py
http://code.google.com/p/standardized-velvet-assembly-report/source/browse/trunk/mergePairs.py

Features to look for

Keeps original IDs in merged reads
Outputs the un-overlapped paired reads
Ability to strip adaptors first
Rescores the Phred qualities across the overlapped region
Parameters to control the overlap sensitivity
Handle .gz and .bz2 compressed files
Multi-threading support
Written in C/C++ (faster compiled) rather than Python/Perl (slower)

39 comments:

jvhaarst13 November 2012 at 20:55
Allpaths-LG can now also do fragment filling:
http://www.broadinstitute.org/software/allpaths-lg/blog/?p=577
ReplyDelete
Replies
Unknown27 March 2013 at 18:45
So I'm trying to assemble a metagenome. THought maybe I'll run my shotgun sequences through flash or cope before.
The default output of Flash contains one fastq file with combined sequences and two with those that remind uncombined. (From both input fastq-s then)
Would you just use the combined sequences fastq to feed to Velvet or SOAPdenovo etc. assembler or would you cocatenate all three fastqs as it seems to me the uncombined files still combine bits and pieces that could be useful to assembly no matter that they are shorter?
ReplyDelete
Replies
Torsten Seemann28 March 2013 at 08:50
Assembling a meta-genome is difficult at the best of times! Velvet can usually cope ok with overlapping paired end reads. But it probably do better if you do pre-overlap them with FLASH. You can give Velvet both the overlapped (SE) and non-overlapped (PE) reads:

velveth Dir K
-short -fastq COMBINED.fq
-shortPaired -fastq -separate NOTCOMBINED_R1.fq NOTCOMBINED_R2.fq
velvetg Dir -exp_cov auto -cov_cutoff auto

However, metagenomes are tricky, as different genomes will have different abundances in the mixture... you may want to consider partitioning the reads first

http://www.pnas.org/content/early/2012/07/25/1121464109.abstract

or using digital normalization to equalize the abundance with KHMER:

https://khmer.readthedocs.org/en/latest/guide.html#metagenome-assembly

ReplyDelete
Replies
Unknown28 March 2013 at 19:40
I thought on using COPE as it seems to produce better N50 after assembly than flash (as per their own paper), then follow it with SOAPdenovo.
Digital normalization is soemthing that crossed my mind too as I got across this artcile 1. I'll take a look at those posted by You now too.
I was thinking of using Quake as the very first step. Just installed it have to look that it hopefully wouldn't mess with the order of paired end reads in the files.

1.Brown, C. T., Howe, A., Zhang, Q., Pyrkosz, A. B., & Brom, T. H. (2012). A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. Genomics.
ReplyDelete
Replies
Torsten Seemann29 March 2013 at 07:45
If you want to consider read error correction, you may wish to consider trying Musket: http://www.ncbi.nlm.nih.gov/pubmed/23202746
ReplyDelete
Replies
iammaxwelldemon9 May 2013 at 01:51
great post, thanks for sharing it.
ReplyDelete
Replies
Unknown23 July 2013 at 00:23
Thanks for sharing. Does someone know the script to call samples one by one in flash to join the reads ?

Thanks
ReplyDelete
Replies
Torsten Seemann26 July 2013 at 15:25
Wasim, you need to learn how to use a loop command like "for" in a shell script: http://www.thegeekstuff.com/2011/07/bash-for-loop-examples/
ReplyDelete
Replies
Shaun Jackman8 August 2013 at 03:49
ABySS includes a tool, abyss-mergepairs, for merging overlapping paired-end reads.

https://github.com/bcgsc/abyss/blob/master/Align/mergepairs.cc
ReplyDelete
Replies
Torsten Seemann14 November 2013 at 15:33
Another tool called PEAR has also been published:
http://bioinformatics.oxfordjournals.org/content/early/2013/11/10/bioinformatics.btt593.full
Looks like it doesn't need much prior information, and axhaustively tries all possibilities.
ReplyDelete
Replies
Dan6 December 2013 at 06:13
Great stuff, across the ditch we're starting to run 2x300bp PE reads on MiSeqs, for small eukaryotic genomes (eg fungi).
ReplyDelete
Replies
Dan6 December 2013 at 06:21
Also, with regards to Jens post, can we please put a stake through the heart of the N50 statistic. There may be a single number that tells you how "good" an assembly is, but N50 ain't it. Maybe try some of the newer programs like ALE, CGAL, REAPR to assess the "goodness" of an assembly.
ReplyDelete
Replies
Unknown18 February 2014 at 09:28
Question, Illumina has a built in script into their MSR software (for the miseq) that processes the PE files into a single file with all of the reads. According to Illumina and some others, the script uses the coordinate positions of the read clusters on the flowcell (which apparently is buried somewhere in the FASTQ file ) to quickly synthesize the new composite/joined/stitched reads. Any comments on how this algorithm stacks up against the traditional methods?
ReplyDelete
Replies
Unknown14 March 2014 at 22:39
This comment has been removed by the author.
ReplyDelete
Replies
Munna25 March 2014 at 17:26
Is it a good idea to make F=180bp where L=101 to get the overlapping reads ?
ReplyDelete
Replies
Unknown8 April 2014 at 19:28
Dear Dr. Seeman,

Indeed a nice article. may you add PEAR now in the main text because it will help very initially to choose it as a program to work. Thanks
ReplyDelete
Replies
Unknown12 March 2015 at 02:10
This comment has been removed by the author.
ReplyDelete
Replies
Unknown12 March 2015 at 02:11
Does it make sense to try overlap when F=5000 bp and L=101?
ReplyDelete
Replies
Unknown28 April 2015 at 00:58
can we use pear reads merger and then abyss-konnector to use fill the gap between reads and then abyss-pe to assembled data?
ReplyDelete
Replies
Unknown8 July 2015 at 04:24

Hi, I am doing a denovo genome assembly. I have (35-300bp) paired-end reads from Illumina Miseq.
I am running softwares upon software and changing parâmeters to achieve better results.

But as I go through these softwares and meet new parameters, some of them stands as doubts to me.
Here they are:

1-If my paired-end reads are overlapping, how can I now it from them by themselves?
If they are overlapping and I use COPE,FLASH, etc to do overlap, will this step cause any problem
to my reads?

2- I am using SSPACE to do scaffolds. One of the parameters SSPACE needs in its library file is the
"inner mate distance". Some one told me it is something like 0-300 bp.

I tried insert this value as 200 and it worked. But, How can I know if this is the better? How can I know
the inner mate distance based only in the information of my fastq files?

Note: I did not make the library. I received these files from other people to assemble the genome.

Thank oyu since now.
ReplyDelete
Replies
Unknown29 August 2016 at 15:53
Great and informative post! I am attempting to assemble a ~30Gb genome using low depth (20x) 150bp PE illumina reads. The data is in two groups v1 (roughly 15x coverage) and v2 (the remaining 5x coverage) producing two pairs of read files. I've trimmed the adaptors and low quality reads as well as merged the pairs in each version into a single file but as the merged pairs are differing lengths, I am unable to assemble properly. The two versions were created using different primers, as well, so the reads for each pair will not merge properly with each other. Do you have any suggestions on how I can assemble with low coverage and these data? Thank you!
ReplyDelete
Replies
Torsten Seemann22 October 2016 at 16:01
You were ahead of your time!
We need faster methods now due to millions of sequences.
ReplyDelete
Replies
Unknown28 March 2018 at 09:21
Hi! I'm using QIIME1 to analyse my data and I am trying to merge my read files into one using SeqPrep and generate an updated barcode file with it but I only seem to get the merged file and not the updated barcode file? Any tips? :)
ReplyDelete
Replies
Unknown25 April 2018 at 03:21
This comment has been removed by a blog administrator.
ReplyDelete
Replies
Alexis Grace11 August 2021 at 22:29
All thanks to Dr OLIHA for curing my herpes virus/hpv with his herbal medicine, i do not have much to say but with all my life i will forever be grateful to him and God Almighty for using Dr OLIHA to reach me when i thought it was all over, today i am happy with my life again after the medical doctor have confirmed my HERPES SIMPLEX VIRUS / HPV of 5 is gone,i have never in my life believed that HERPES SIMPLEX VIRUS could be cured by herbal medicine. so i want to use this means to reach other persons who have this disease by testifying the power of Dr OLIHA that all hope is not lost yet, try and contact him by any means for any kind of disease with his email: oliha.miraclemedicine@gmail.com add him on whatsapp line or call +2349038382931.
ReplyDelete
Replies
Anonymous18 October 2022 at 18:15

대전콜걸
대구콜걸
대구콜걸
아산콜걸
아산콜걸
부산콜걸

ReplyDelete
Replies
eddielydon18 December 2023 at 17:49
Thanks for this. This is the simplest explanation I can understand given the tons of Explanation here. barbie collection
ReplyDelete
Replies
Yellowstone Jacketco6 November 2024 at 22:35
This comment has been removed by the author.
ReplyDelete
Replies
David Jack14 June 2025 at 06:36
Stand out like a champion in the England Lionesses purple puffer jacket, the ultimate blend of sporty elegance and bold fashion. Inspired by the fierce pride of England’s national women’s team, this eye-catching piece offers premium warmth and street-ready style. Whether you're a fan of women's soccer, a fashion-forward trendsetter, england football merchandise or simply love celebrity-inspired outerwear, this jacket delivers big on both comfort and statement-making flair. Perfect for winter fashion lovers across the USA who want to channel the power and pride of the Lionesses in every step.
ReplyDelete
Replies