The Genome Factory: May 2012

While browsing SeqAnswers.com today I came across a post where Uwe Appelt provided a couple of lines of Unix shell wizadry to solve some problem. What attracted my attention was the following:

paste - - - - < in.fq | filter | tr "\t" "\n" > out.fq

Now, I've done a reasonable amount of shell one-liners in my life, but I'd never seen this before. I've used the paste command a couple of times, but clearly its potential power did not sink in! Here is the man page description for paste:

Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.

So what's happening here? Well, in Unix, the "-" character means to use STDIN instead of a filename. Here Uwe is providing paste with four filenames, each of which is the same stdin filehandle. So lines 1..4 of input.fq are put onto one line (with tab separator), and lines 5..8 on the next line and so on. Now, our stream has the four lines of FASTQ entry on a single line, which makes it much more amenable to Unix line-based manipulation, represented by filter in my example. Once that's all done, we need to put it back into the standard 4-line FASTQ format, which is as simple as converting the tabs "\t" back to newlines "\n" with the tr command.

Example 1: FASTQ to FASTA

A common thing to do is convert FASTQ to FASTA, and we don't always have our favourite tool or script to to this when we aren't on our own servers:

paste - - - - < in.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > out.fa

paste converts the input FASTQ into a 4-column file
cut command extracts out just column 1 (the ID) and column 2 (the sequence)
sed replaces the FASTQ ID prefix "@" with the FASTA ID prefix ">"
tr conversts the 2 columns back into 2 lines

And because the shell command above uses a pipe connecting four commands (paste, cut, sed, tr) the operating system will run them all in parallel, which will make it run faster assuming your disk I/O can keep up.

Example 2: Removing redundant FASTQ ID in line 3

The third line in the FASTQ format is somewhat redundant - it is usually a duplicate of the first line, except with "+" instead of "@" to denote that a quality string is coming next rather than an ID. Most parsers ignore it, and happily accept a blank ID after the "+", which saves a fair chunk of disk space. If you have legacy files with the redundant IDs and want to conver them, here's how we can do it with our new paste trick:

paste -d ' ' - - - - | sed 's/ +[^ ]*/ +/' | tr " " "\n"

paste converts the input FASTQ into a 4-column file, but using SPACE instead of TAB as the separator character
sed finds and replaces the "+DUPE_ID" line with just a "+"
tr conversts the 4 columns back into 4 lines

That's it for today, hope you learnt something, because I certainly did.

The Genome Factory

Sunday 20 May 2012

Cool use of Unix paste with NGS sequences