Thursday, 10 November 2011

Compressing FASTQ reads by splitting into homogeneous streams

Today I took FASTQ file with 3.5M reads, which was Read1 from a paired-end Illumina 100bp run - it was about 883Mb in size. As many have shown before me, GZIP compresses to about 1/4 the size, and BZIP2 about 1/5.
  • 883252 R1.fastq
  • 233296 R1.fastq.gz
  • 182056 R1.fastq.bz2
I then split the read file into 3 separate files: (1) The ID line, but with the mandatory '@' removed, (2) the sequence line, but uppercased for consistency, and (3) the quality line unchanged. It ignored the 3rd line of each FASTQ entry, as it is redundant. This knocked 1% off the total size.
  • 189588 id.txt
  • 341756 seq.txt
  • 341756 qual.txt
  • 873100 TOTAL
Now, I compressed each of the three streams (ID, Sequence, Quality) independently with GZIP. The idea is that these dictionary-based compression schemes will work better on more homogeneous data streams, than when they are interleaved in one stream. As you can see this does improve things by about 15%, but still not as good as BZIP2 without de-interleaving.
  •  20608 id.txt.gz
  •  84096 qual.txt.gz
  • 102040 seq.txt.gz
  • 206644 TOTAL (was 233296 combined)
If we use BZIP2 to compress the interleaved stream, it does only 5% better than when it was a single stream. This is testament to BZIP2's ability to cope with heterogeneous data streams better than GZIP.
  •  16560 id.txt.bz2
  •  66812 qual.txt.bz2
  •  93564 seq.txt.bz2
  • 176936 TOTAL (was 182056 combined)
So in summary, we've re-learnt that BZIP2 is better than GZIP, and that they are both doing quite well adapting to the three interleaved data types in a FASTQ file.

1 comment:

  1. The total for the deinterleaved bzip2 method 4.04 bits per nucleotide in the original file. The DNA is 2.14 bits of that, the qualities are 1.53, and the IDs are 0.37. I think sorting the reads lexographically would probably help both gzip (LZ77) and bzip2 (BWT+).