Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

The FASTQ Format

The most common format for DNA sequencing data are FASTA and FASTQ. Both are text-based formats. FASTQ is an extension of FASTQ. It adds quality scores.

For a given FASTQ file, every four lines represent a single DNA sequence.

The general syntax of a FASTQ file is as follows:

<fastq>:= <block>+
<block>:=@<seqname>\n<seq>\n[<seqname>]\n<qual>\n
<seqname>:= [A-Za-z0-9_.:-]+
<seq>:= [A-Za-z\n\.~]+
<qual>:= [!-~\n]+

We will discuss here only one  topic: FASTA/FASTQ compression.  currently there is no standardized compressor for FASTA/FASTQ files and often general purpose archivers are used as a substitute. Typically gzip/pigz (the power of inertia), more rarely bzip2/pbzip2 (the best combination of speed and compression ratio for this type of data), even more rarely xz; the latter has much slower compression speed than bzip2).

Here is a relevant quote about bzip2 from the chapter 8 (Contextual Data Transforms Practical Implementations) of the book Understanding Compression Data Compression for Modern Developers Colt by McAnlis, Aleks Haecky  (O'Reilly, 2016)

BWT and DNA

BWT has always been an edge-case of compression. Its initial existence showed really good results for text-based data, but it could never compete from a performance perspective with other algorithms such as GZIP. As such, BWT (or bzip2, the dominant BWT encoder) never really took the compression world by storm.

That is, until humans began sequencing deoxyribonucleic acid, or DNA.

Human DNA has a pretty simple setup with only four basic nucleotide bases, labeled A, C, G, and T. A given genome is basically a massive string containing these four symbols in various orderings. How much? Well, the human genome contains about 3.1647 billion DNA base pairs.

It turns out that BWT’s block-sorting algorithm is an ideal transform that could be applied to DNA to make it more compressible, searchable, and retrievable. (There’s actually a boat-load of papers proving this.) The reduction in size and availability for fast reads are of high importance when aligning reads of new genomes against a reference.

This just goes to show how there’s no single silver bullet when it comes to data compression. Each stream of information has its own variable characteristics and responds differently to different transforms and encoders. Although BWT might not have taken the web away from its cousin, GZIP, it stands alone as an important factor in the next few decades of bioinformatics.

Good overview of compression of FASTQ is given in LFQC a lossless compression algorithm for FASTQ files Bioinformatics Oxford Academic

Compression of nucleotide sequences has been an interesting problem for a long time. Cox etal. (2012) apply Burrows–Wheeler Transform (BWT) for compression of genomic sequences. GReEn (Pinho etal., 2012) is a reference-based sequence compression method offering a very good compression ratio. (Compression ratio refers to the ratio of the original data size to the compressed data size). Compressing the sequences along with quality scores and identifiers is a very different problem. Tembe etal. (2010) have attached the Q-scores with the corresponding DNA bases to generate new symbols out of the base-Q-score pair. They encoded each such distinct symbol using Huffman encoding (Huffman, 1952). Deorowicz and Grabowski (2011) divided the quality scores into three sorts of quality streams:

  1. quasi-random with mild dependence on quality score positions.
  2. quasi-random quality scores with strings ending with several # characters.
  3. quality scores with strong local correlations within individual records.

To represent case 2 they use a bit flag. Any string with that specific bit flag is processed to remove all trailing #. They divide the quality scores into individual blocks and apply Huffman coding. Tembe etal. (2010) and Deorowicz and Grabowski (2011) also show that general purpose compression algorithms do not perform well and domain specific algorithms are indeed required for efficient compression of FASTQ files. In their papers they demonstrate that significant improvements in compression ratio can be achieved using domain specific algorithms compared with bzip2 and gzip.

The literature on FASTQ compression can be divided into two categories, namely lossless and lossy. A lossless compression scheme is one where we preserve the entire data in the compressed file. On the contrary, lossy compression techniques allow some of the less important components of data to be lost during compression. Kozanitis etal. (2011) perform randomized rounding to perform lossy encoding. Asnani etal. (2012) introduce a lossy compression technique for quality score encoding. Wan etal. (2012) proposed both lossy and lossless transformations for sequence compression and encoding. Hach etal. (2012) presented a ‘boosting’ scheme which reorganizes the reads so as to achieve a higher compression speed and compression rate, independent of the compression algorithm in use. When it comes to medical data compression it is very difficult to identify which components are unimportant. Hence, many researchers believe that lossless compression techniques are particularly needed for biological/medical data. Quip (Jones etal., 2012) is one such lossless compression tool. It separately compresses the identifier, sequence and quality scores. Quip makes use of Markov Chains for encoding sequences and quality scores. DSRC (Deorowicz and Grabowski, 2011) is also considered a state of the art lossless compression algorithm which we compare with ours in this article. Bonfield and Mahoney (2013) have come up with a set of algorithms (named Fqzcomp and Fastqz) to compress FASTQ files recently. They perform identifier compression by storing the difference between the current identifier and the previous identifier. Sequence compression is performed by using a set of techniques including base pair compaction, encoding and an order-k model. Apart from hashing, they have also used a technique for encoding quality values by prediction.

See also Note on 2 bit compression of FASTA files

The FASTQ format extends FASTA by including a numeric quality score to each base in the sequence. The FASTQ format is widely used to store high-throughput sequencing data, which is reported with a per-base quality score indicating the confidence of each base call. Unfortunately like FASTA, FASTQ has variants and pitfalls that can make the seemingly simple format frustrating to work with.

Each entry in a FASTQ file consists of four lines:
  1. Sequence identifier. It begins with @ followed by a sequence identifier and an optional description.
  2. Read sequence. This is essentially a sequence of letters.
  3. Quality score identifier line (consisting only of a +)
  4. Quality score. This is a string of characters. Line 2 and line 4 must be of equal lengths.

 

The FASTQ format looks like:

@DJB775P1:248:D0MDGACXX:7:1202:12362:49613 1
TGCTTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAA 2
+ 3
JJJJJIIJJJJJJHIHHHGHFFFFFFCEEEEEDBD?DDDDDDBDDDABDDCA 4
@DJB775P1:248:D0MDGACXX:7:1202:12782:49716
CTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTAGATCGG
+
IIIIIIIIIIIIIIIHHHHHHFFFFFFEECCCCBCECCCCCCCCCCCCCCCC
1
The description line, beginning with @. This contains the record identifier and other information.
2
Sequence data, which can be on one or many lines.
3
The line beginning with +, following the sequence line(s) indicates the end of the sequence. In older FASTQ files, it was common to repeat the description line here, but this is redundant and leads to unnecessarily large FASTQ files.
4
The quality data, which can also be on one or many lines, but must be the same length as the sequence. Each numeric base quality is encoded with ASCII characters using a scheme we’ll discuss later (“Base Qualities”).

As with FASTA, it’s a common convention in FASTQ files to split description lines by the first space into two parts: the record identifier and comment.

FASTQ is deceivingly tricky to parse correctly. A common pitfall is to treat every line that begins with @ as a description line. However, @ is also a valid quality character. FASTQ sequence and quality lines can wrap around to the next line, so it’s possible that a line beginning with @ could be a quality line and not a header line. Consequently, writing a parser that always treats lines beginning with @ as header lines can lead to fragile and incorrect parsing. However, we can use the fact that the number of quality score characters must be equal to the number of sequence characters to reliably parse this format—which is how the readfq parser introduced later on works.

 

Base Qualities

Each sequence base of a FASTQ entry has a corresponding numeric quality score in the quality line(s). Each base quality scores is encoded as a single ASCII character. Quality lines look like a string of random characters, like the fourth line here:

@AZ1:233:B390NACCC:2:1203:7689:2153
GTTGTTCTTGATGAGCCATGAGGAAGGCATGCCAAATTAAAATACTGGTGCGAATTTAAT
+
CCFFFFHHHHHJJJJJEIFJIJIJJJIJIJJJJCDGHIIIGIGIJIJIIIIJIJJIJIIH

(This FASTQ entry is in this chapter’s README file if you want to follow along.)

Remember, ASCII characters are just represented as integers between 0 and 127 under the hood (see man ascii for more details). Because not all ASCII characters are printable to screen (e.g., character echoing "\07" makes a “ding” noise), qualities are restricted to the printable ASCII characters, ranging from 33 to 126 (the space character, 32, is omitted).

All programming languages have functions to convert from a character to its decimal ASCII representation, and from ASCII decimal to character. In Python, these are the functions ord() and chr(), respectively. Let’s use ord() in Python’s interactive interpreter to convert the quality characters to a list of ASCII decimal representations:

>>> qual = "JJJJJJJJJJJJGJJJJJIIJJJJJIGJJJJJIJJJJJJJIJIJJJJHHHHHFFFDFCCC"
>>> [ord(b) for b in qual]
[74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 71, 74, 74, 74, 74, 74, 73,
 73, 74, 74, 74, 74, 74, 73, 71, 74, 74, 74, 74, 74, 73, 74, 74, 74, 74, 74,
 74, 74, 73, 74, 73, 74, 74, 74, 74, 72, 72, 72, 72, 72, 70, 70, 70, 68, 70,
 67, 67, 67]

Unfortunately, converting these ASCII values to meaningful quality scores can be tricky because there are three different quality schemes: Sanger, Solexa, and Illumina (see Table 10-2). The Open Bioinformatics Foundation (OBF), which is responsible for projects like Biopython, BioPerl, and BioRuby, gives these the names fastq-sanger, fastq-solexa, and fastq-illumina. Fortunately, the bioinformatics field has finally seemed to settle on the Sanger encoding (which is the format that the quality line shown here is in), so we’ll step through the conversion process using this scheme.

Table 10-2. FASTQ quality schemes (adapted from Cock et al., 2010 with permission)
Name ASCII character range Offset Quality score type Quality score range
Sanger, Illumina (versions 1.8 onward) 33–126 33 PHRED 0–93
Solexa, early Illumina (before 1.3) 59–126 64 Solexa 5–62
Illumina (versions 1.3–1.7) 64–126 64 PHRED 0–62

First, we need to subtract an offset to convert this Sanger quality score to a PHRED quality score. PHRED was an early base caller written by Phil Green, used for fluorescence trace data written by Phil Green. Looking at Table 10-2, notice that the Sanger format’s offset is 33, so we subtract 33 from each quality score:

>>> phred = [ord(b)-33 for b in qual]
>>> phred
[41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 38, 41, 41, 41, 41, 41, 40,
 40, 41, 41, 41, 41, 41, 40, 38, 41, 41, 41, 41, 41, 40, 41, 41, 41, 41, 41,
 41, 41, 40, 41, 40, 41, 41, 41, 41, 39, 39, 39, 39, 39, 37, 37, 37, 35, 37,
 34, 34, 34]

Now, with our Sanger quality scores converted to PHRED quality scores, we can apply the following formula to convert quality scores to the estimated probability the base is correct:

P = 10-Q/10

To go from probabilities to qualities, we use the inverse of this function:

Q = -10 log10P

In our case, we want the former equation. Applying this to our PHRED quality scores:

>>> [10**(-q/10) for q in phred]
[1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05,
 1e-05, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 0.0001, 0.0001, 1e-05,
 1e-05, 1e-05, 1e-05, 1e-05, 0.0001, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05,
 1e-05, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 1e-05, 0.0001,
 1e-05, 0.0001, 1e-05, 1e-05, 1e-05, 1e-05, 0.0001, 0.0001, 0.0001, 0.0001,
 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001]

Converting between Illumina (version 1.3 to 1.7) quality data is an identical process, except we use the offset 64 (see Table 10-2). The Solexa conversion is a bit trickier because this scheme doesn’t use the PHRED function that maps quality scores to probabilities. Instead, it uses Q = (10P/10 + 1)-1. See Cock et al., 2010 for more details about this format.


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

FASTQ Files