May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Compression of FASTA/FASTQ files


High Performance Computing (HPC)

Recommended Links

The FASTA Format

The FASTQ Format

Many studies have been carried o Note on 2 bit compression of FASTA files
Tools C3 Tools PDSH -- a parallel remote shell rdist rsync   Parallel command execution
uptime command mostat top ps sar ptree  
vmstat iostat nfsstat HPC Humor Admin Horror Stories Humor Etc


DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine—in a strand of DNA. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

Knowledge of DNA sequences has become indispensable for basic biological research, and in numerous applied fields such as medical diagnosis, biotechnology, forensics, virology. The rapid speed of sequencing attained with modern DNA sequencing technology has been instrumental in the sequencing of complete DNA sequences, or genomes of numerous types and species of life, including the human genome and other complete DNA sequences of many animals, plants, and microbial species.

The first DNA sequences were obtained in the early 1970s by academic researchers using laborious methods based on two-dimensional chromatography. Following the development of fluorescence-based sequencing methods with a DNA sequencer, DNA sequencing has become easier and orders of magnitude faster.

Here is some background information on the topic (Borrowed from  the paper DNA Sequence Compression using the Burrows-Wheeler Transform by Don Adjeroh, Yong Zhang*, et all

The biological activity of every living organism is controlled by billions of individual cells. The control-center of each cell is the deoxyribonucleic acid (DNA) that contains a complete set of instructions needed to direct the functioning of each and every one of the cells. The chemical composition of the DNA is the same for all living organisms. The DNA of every living organism contains four basic nucleotide bases: adenine, cytosine, guanine, and thymine, ususally
abreviated using the symbols A, C, G and T respectively.

The DNA sequence is usually divided into a number of chromosomes, which in turn contain genes. Genes are sequences of base pairs that contain instructions on how to produce proteins. They are also related to heredity. The areas of the DNA that contain genes are thus called coding areas, while the remaining parts are called non-coding areas. In higher-level eukaryotes, genes are usually spliced up into alternating regions of exons and introns. The introns are noncoding
DNA and are cut-out before the messenger ribonucleic acid (mRNA) leaves the nucleus for the ribosome - where the protein specified by the mRNA is synthesized. The information in the mRNA thus represents the exons which are then used to make proteins. The non-coding area (sometimes called "junk DNA") contains redundant repeating sequences . They are estimated to make up at least 50% of the human genome.

From a computational viewpoint, a biological sequence can be viewed mainly as a one-dimensional sequence of symbols, for instance with an alphabet of 4
symbols for DNA, or RNA, and 20 symbols for proteins. Biological sequences typically contain different types of repetitions and other hidden regularities. Long runs of tandem repeats and of randomly interspersed repeats are prominent features of DNA sequences. The family of Alu repeats (typically about 300 bases in length) is typical of short interspersed repeat sequences (SINEs - short interspersed nuclear elements). These have been estimated to make up about 9% of the human genome, thus out-numbering the proportion of protein coding regions[Herzel94es]. There are also the long interspersed repeat sequences (LINEs - long interspersed nuclear elements) which are usually more than 6000 bases in length. In the human genome, the L1 family is the most common LINEs, with about 60,000 to 100,1000 occurrences. There are also short repeats (sometimes called "random repeats"), attributed to the fact that typical sequences and genomes are orders of magnitude larger than the alphabet size (4 in this case).

We will discuss here only one  topic: FASTA/FASTQ files compression. Both are a text representations described in more details in

They are not formally standardized, but in both cases "standard de-facto" exists and utilities for conversion to this "standard de facto" exists too.

The most common format for DNA sequencing data is now FASTQ format. In FASTQ files an additional symbol N is introduced meaning "not defined" base. In order not to increase the size of alphabet, the symbol N can be decoded from the quality score: N should have the quality zero. (This should be enforced.)

For a given FASTQ file, every four lines represent a single DNA sequence. The general syntax of a FASTQ file is as follows:

<fastq>:= <block>+
<seqname>:= [A-Za-z0-9_.:-]+
<seq>:= [A-Za-z\n\.~]+
<qual>:= [!-~\n]+

FASTA/FASTQ compression ratio

Four symbol alphabet  (A,C,G,T) can be encoded using two bit per letter. That would achieve compression ration or around four and that level represent the "minimum acceptable level" of compression  of FASTA files no matter what algorithm is used.

But FASTA/FASTQ are not completely random. They convey important information about the organism. Due to this they contain  the repetitions, palindromes and  other regular structures inherent in biological sequences. Which imply additional redundancies that can provide an avenue for increasing the compression ratio of six or more. 

But the identification of such sequences is tricky and not all short reads in FASTA/FASTQ contain them to justify the computational costs. Cross short reads repetitions can be quite distant and require building of the dictionary. Also due to additional overhead repetitions up to probably eight symbols do not increase compression ratio much as you now need to distinguish between "literal string" and references to previous occurrence or dictionary entry. And that creates an overhead.  Effect of which can be visible in Gzip, which usually does not achieve the compression ration four on FASTA/FASTQ files.

So achieving the compression ratio of, say, six, for FAST files is not that easy, and consistently doing so is currently is out of scope for generic methods (see below).  That's the law of diminishing returns in action. 

At the same time there is no widely accepted specialized algorithm which would achieve compression ration of six or better without undue slow down of compression speed.  The area of specialized algorithms is balkanized and often researchers do not sustain interest in their compression utility and corresponding algorithms  for more then six years. There is a lot of abandonware in the field. Some with promising ideas.

BWT and FASTA/FASTQ compression

Burrows–Wheeler transform  (BWT) usually shows good results for text-based data, but it is slow and gains are not enough to compete with Deflate algorithms used in gzip. But it really shines in compression FASTA/FASTQ file as DNA consists of only four basic nucleotides, labeled A, C, G, and T. FASTA file is basically a massive string containing these four symbols in various combinations.

BWT’s block-sorting algorithm is viewed by some to be ideal transform that could be applied to FASTA/FASTQ (see, for example,  the paper DNA Sequence Compression using the Burrows-Wheeler Transform ) to make it more compressible.

It in important to note that bzip2 is parallelizable and its parallel version (pbzip2) competes in speed with  pigz -- parallel version of gzip. 

Major categories of FASTA/FASTQ files compression methods and programs

There are two major categories of FASTA/FASTQ files compression methods and programs:

Algorithms that do well with text compression are probably worth investigating, insofar as uncompressed FASTQ is structured text.

Good overview of compression of FASTQ is given in LFQC a lossless compression algorithm for FASTQ files Bioinformatics Oxford Academic

Compression of nucleotide sequences has been an interesting problem for a long time. Cox etal. (2012) apply Burrows–Wheeler Transform (BWT) for compression of genomic sequences. GReEn (Pinho etal., 2012) is a reference-based sequence compression method offering a very good compression ratio. (Compression ratio refers to the ratio of the original data size to the compressed data size). Compressing the sequences along with quality scores and identifiers is a very different problem. Tembe etal. (2010) have attached the Q-scores with the corresponding DNA bases to generate new symbols out of the base-Q-score pair. They encoded each such distinct symbol using Huffman encoding (Huffman, 1952). Deorowicz and Grabowski (2011) divided the quality scores into three sorts of quality streams:

  1. quasi-random with mild dependence on quality score positions.
  2. quasi-random quality scores with strings ending with several # characters.
  3. quality scores with strong local correlations within individual records.

To represent case 2 they use a bit flag. Any string with that specific bit flag is processed to remove all trailing #. They divide the quality scores into individual blocks and apply Huffman coding. Tembe etal. (2010) and Deorowicz and Grabowski (2011) also show that general purpose compression algorithms do not perform well and domain specific algorithms are indeed required for efficient compression of FASTQ files. In their papers they demonstrate that significant improvements in compression ratio can be achieved using domain specific algorithms compared with bzip2 and gzip.

The literature on FASTQ compression can be divided into two categories, namely lossless and lossy. A lossless compression scheme is one where we preserve the entire data in the compressed file. On the contrary, lossy compression techniques allow some of the less important components of data to be lost during compression. Kozanitis etal. (2011) perform randomized rounding to perform lossy encoding. Asnani etal. (2012) introduce a lossy compression technique for quality score encoding. Wan etal. (2012) proposed both lossy and lossless transformations for sequence compression and encoding. Hach etal. (2012) presented a ‘boosting’ scheme which reorganizes the reads so as to achieve a higher compression speed and compression rate, independent of the compression algorithm in use. When it comes to medical data compression it is very difficult to identify which components are unimportant. Hence, many researchers believe that lossless compression techniques are particularly needed for biological/medical data. Quip (Jones etal., 2012) is one such lossless compression tool. It separately compresses the identifier, sequence and quality scores. Quip makes use of Markov Chains for encoding sequences and quality scores. DSRC (Deorowicz and Grabowski, 2011) is also considered a state of the art lossless compression algorithm which we compare with ours in this article. Bonfield and Mahoney (2013) have come up with a set of algorithms (named Fqzcomp and Fastqz) to compress FASTQ files recently. They perform identifier compression by storing the difference between the current identifier and the previous identifier. Sequence compression is performed by using a set of techniques including base pair compaction, encoding and an order-k model. Apart from hashing, they have also used a technique for encoding quality values by prediction.

See also Note on 2 bit compression of FASTA files

Compression of FASTA/FASTQ  files using generic compression programs

Currently there is no standardized compressor for FASTA/FASTQ files and often general purpose archivers are used as a substitute.  Out of troika the most popular archivers (gzip/pigz bzip/pbzip2 and xz) pbzip2 has an edge.

  Sample Compressor Parameters Orig size Compressed size Compress time Compress speed  Decompress time   Speed relative to gzip -6 Size relative to gzip Qratio (speed divided on square of size) All relative to gzip % of space occupied by the compressed
Compression ratio
        (GB) (GB) Min sec MB/sec Min sec          
1 FASTQ gzip -6 7.43 2.08 18 3 6.86 1 14 1.00 1.00 1.00 0.28 3.58
2 FASTQ pigz   7.43 2.08 1 16 97.78 0 51 14.39 1.00 14.37 0.28 3.57
3 FASTQ pigz -9 7.43 2.04 2 32 48.89 0 51 7.20 0.98 7.47 0.27 3.64
4 FASTQ bzip2   7.43 1.53 20 42 5.98 7 21 0.88 0.74 1.63 0.21 4.86
5 FASTQ pbzip2   7.43 1.60 1 51 66.95 0 56 9.86 0.77 16.63 0.22 4.64
6 FASTQ xz -9 7.43 1.34 400 5 0.31 3 19 0.05 0.64 0.11 0.18 5.56
7 FASTQ xz   7.43 1.53 206 1 0.60 3 19 0.09 0.74 0.16 0.21 4.86

Top Visited
Past week
Past month


Old News ;-)

[May 20, 2018] How to use mkfifo named pipes with

May 20, 2018 |

24th February 2016

prinseq_logo_1 is a utility written in Perl for preprocessing NGS reads, also in FASTQ format .
It can read sequences both from files and from stdin (if you only have 1 sequence).

I wanted to use it with compressed (gzipped/bzipped2) FASTQ input files.
As I do not need to store decompressed input files, the most efficient solution is to use pipes.
This works well for a single file, but not for 2 files (paired-end reads).

For 2 files, named pipes (also known as FIFO s) can be used.
You can create a named pipe in Linux with the help of mkfifo command, for example mkfifo R1_decompressed.fastq .
To use it, start decompressing something into it (either in a different terminal, or in background), for example zcat R1.fastq.gz > R1_decompressed.fastq & ;
we can call this a writing/generating process, because it writes into a pipe.
(If you are writing software to use named pipes, any processes writing into them should be started in a new thread, as they will block until all the data is consumed.)
Now if you give the R1_decompressed.fastq as a file argument to some other program, it will see decompressed content (e.g. wc -l R1_decompressed.fastq will tell you the number of lines in the decompressed file); we can call program reading from the named pipe a reading/consuming process.
As soon as a consuming process had consumed (read) all of the data, the writing/generating process will finally exit.

This, however, does not work with (version 0.20.4 or earlier), with a broken pipe error. Read the rest of this entry "

[May 13, 2018] What is the difference between FASTA, FASTQ, and SAM file formats

May 13, 2018 |

Konrad Rudolph ,Jun 2, 2017 at 12:16

Let's start with what they have in common: All three formats store
  1. sequence data, and
  2. sequence metadata.

Furthermore, all three formats are text-based.

However, beyond that all three formats are different and serve different purposes.

Let's start with the simplest format:


FASTA stores a variable number of sequence records, and for each record it stores the sequence itself, and a sequence ID. Each record starts with a header line whose first character is > , followed by the sequence ID. The next lines of a record contain the actual sequence.

The Wikipedia artice gives several examples for peptide sequences, but since FASTQ and SAM are used exclusively (?) for nucleotide sequences, here's a nucleotide example:

>Mus_musculus_tRNA-Ala-AGC-1-1 (chr13.trna34-AlaAGC)
>Mus_musculus_tRNA-Ala-AGC-10-1 (chr13.trna457-AlaAGC)

The ID can be in any arbitrary format, although several conventions exist .

In the context of nucleotide sequences, FASTA is mostly used to store reference data; that is, data extracted from a curated database; the above is adapted from GtRNAdb (a database of tRNA sequences).


FASTQ was conceived to solve a specific problem of FASTA files: when sequencing, the confidence in a given base call (that is, the identity of a nucleotide) varies. This is expressed in the Phred quality score . FASTA had no standardised way of encoding this. By contrast, a FASTQ record contains a sequence of quality scores for each nucleotide.

A FASTQ record has the following format:

  1. A line starting with @ , containing the sequence ID.
  2. One or more lines that contain the sequence.
  3. A new line starting with the character + , and being either empty or repeating the sequence ID.
  4. One or more lines that contain the quality scores.

Here's an example of a FASTQ file with two records:


FASTQ files are mostly used to store short-read data from high-throughput sequencing experiments. As a consequence, the sequence and quality scores are usually put into a single line each, and indeed many tools assume that each record in a FASTQ file is exactly four lines long, even though this isn't guaranteed.

As for FASTA, the format of the sequence ID isn't standardised, but different producers of FASTQ use fixed notations that follow strict conventions .


SAM files are so complex that a complete description [PDF] takes 15 pages. So here's the short version.

The original purpose of SAM files is to store mapping information for sequences from high-throughput sequencing. As a consequence, a SAM record needs to store more than just the sequence and its quality, it also needs to store information about where and how a sequence maps into the reference.

Unlike the previous formats, SAM is tab-based, and each record, consisting of either 11 or 12 fields, fills exactly one line. Here's an example (tabs replaced by fixed-width spacing):

r001  99  chr1  7 30  17M         =  37  39  TTAGATAAAGGATACTG   IIIIIIIIIIIIIIIII
r002  0   chrX  9 30  3S6M1P1I4M  *  0   0   AAAAGATAAGGATA      IIIIIIIIII6IBI    NM:i:1

For a description of the individual fields, refer to the documentation. The relevant bit is this: SAM can express exactly the same information as FASTQ, plus, as mentioned, the mapping information. However, SAM is also used to store read data without mapping information.

In addition to sequence records, SAM files can also contain a header , which stores information about the reference that the sequences were mapped to, and the tool used to create the SAM file. Header information precede the sequence records, and consist of lines starting with @ .

SAM itself is almost never used as a storage format; instead, files are stored in BAM format, which is a compact binary representation of SAM. It stores the same information, just more efficiently, and in conjunction with a search index , allows fast retrieval of individual records from the middle of the file (= fast random access ). BAM files are also much more compact than compressed FASTQ or FASTA files.

The above implies a hierarchy in what the formats can store: FASTA ⊂ FASTQ ⊂ SAM.

In a typical high-throughput analysis workflow, you will encounter all three file types:

  1. FASTA to store the reference genome/transcriptome that the sequence fragments will be mapped to.
  2. FASTQ to store the sequence fragments before mapping.
  3. SAM/BAM to store the sequence fragments after mapping.

Scott Gigante ,Aug 17, 2017 at 6:01

FASTQ is used for long-read sequencing as well, which could have a single record being thousands of 80-character lines long. Sometimes these are split by line breaks, sometimes not. – Scott Gigante Aug 17 '17 at 6:01

Konrad Rudolph ,Aug 17, 2017 at 10:03

@ScottGigante I alluded to this by saying that the sequence can take up several lines. – Konrad Rudolph Aug 17 '17 at 10:03

Scott Gigante ,Aug 17, 2017 at 13:22

Sorry, should have clarified: I was just referring to the line "FASTQ files are (almost?) exclusively used to store short-read data from high-throughput sequencing experiments." Definitely not exclusively. – Scott Gigante Aug 17 '17 at 13:22

Konrad Rudolph ,Feb 21 at 17:06

@charlesdarwin I have no idea. The line with the plus sign is completely redundant. The original developers of the FASTQ format probably intended it as a redundancy to simplify error checking (= to see if the record was complete) but it fails at that. In hindsight it shouldn't have been included. Unfortunately we're stuck with it for now. – Konrad Rudolph Feb 21 at 17:06

Wouter De Coster ,Feb 21 at 23:16

@KonradRudolph as far as I know fastq is a combination of fasta and qual files, see also This explains the header of the quality part. It, however, doesn't make sense we're stuck with it... – Wouter De Coster Feb 21 at 23:16

eastafri ,May 16, 2017 at 18:57

In a nutshell,

FASTA file format is a DNA sequence format for specifying or representing DNA sequences and was first described by Pearson (Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448)

FASTQ is another DNA sequence file format that extends the FASTA format with the ability to store the sequence quality. The quality scores are often represented in ASCII characters which correspond to a phred score)

Both FASTA and FASTQ are common sequence representation formats and have emerged as key data interchange formats for molecular biology and bioinformatics.

SAM is format for representing sequence alignment information from a read aligner. It represents sequence information in respect to a given reference sequence. The information is stored in a series of tab delimited ascii columns. The full SAM format specification is available at

user172818 ♦ ,May 16, 2017 at 19:07

On a historical note, the Sanger Institute first used the FASTQ format. – user172818 ♦ May 16 '17 at 19:07

Konrad Rudolph ,Jun 2, 2017 at 10:43

SAM can also (and is increasingly used for it, see PacBio) store unaligned sequence information, and in this regard equivalent to FASTQ. – Konrad Rudolph Jun 2 '17 at 10:43

bli ,Jun 2, 2017 at 11:30

Note that fasta is also often use for protein data, not just DNA. – bli Jun 2 '17 at 11:30

BaCh ,May 16, 2017 at 18:53

Incidentally, the first part of your question is something you could have looked up yourself as the first hits on Google of "NAME format" point you to primers on Wikipedia, no less. In future, please do that before asking a question.
  1. FASTA
  2. FASTQ
  3. SAM

FASTA (officially) just stores the name of a sequence and the sequence, inofficially people also add comment fields after the name of the sequence. FASTQ was invented to store both sequence and associated quality values (e.g. from sequencing instruments). SAM was invented to store alignments of (small) sequences (e.g. generated from sequencing) with associated quality values and some further data onto a larger sequences, called reference sequences, the latter being anything from a tiny virus sequence to ultra-large plant sequences.

Alon Gelber ,May 16, 2017 at 19:50

FASTA and FATSQ formats are both file formats that contain sequencing reads while SAM files are these reads aligned to a reference sequence. In other words, FASTA and FASTQ are the "raw data" of sequencing while SAM is the product of aligning the sequencing reads to a refseq.

A FASTA file contains a read name followed by the sequence. An example of one of these reads for RNASeq might be:

>Flow cell number: lane number: chip coordinates etc.

The FASTQ version of this read will have two more lines, one + as a space holder and then a line of quality scores for the base calls. The qualities are given as characters with '!' being the lowest and '~' being the highest, in increasing ASCII value. It would look something like this

@Flow cell number: lane number: chip coordinates etc.

A SAM file has many fields for each alignment, the header begins with the @ character. The alignment contains 11 mandatory fields and various optional ones. You can find the spec file here: .

Often you'll see BAM files which are just compressed binary versions of SAM files. You can view these alignment files using various tools, such as SAMtools, IGV or USCS Genome browser.

As to the benefits, FASTA/FASTQ vs. SAM/BAM is comparing apples and oranges. I do a lot of RNASeq work so generally we take the FASTQ files and align them the a refseq using an aligner such as STAR which outputs SAM/BAM files. There's a l ot you can do with just these alignment files, looking at expression, but usually I'll use a tool such as RSEM to "count" the reads from various genes to create an expression matrix, samples as columns and genes as rows. Whether you get FASTQ or FASTA files just depends on your sequencing platform. I've never heard of anybody really using the quality scores.

Konrad Rudolph ,Jun 2, 2017 at 10:47

Careful, the FASTQ format description is wrong: a FASTQ record can span more than four lines; also, + isn't a placeholder, it's a separator between the sequence and the quality score, with an optional repetition of the record ID following it. Finally, the quality score string has to be the same length as the sequence. – Konrad Rudolph Jun 2 '17 at 10:47

[May 10, 2018] FAST (FAST Analysis of Sequences Toolbox), built on BioPerl, provides open source command-line tools to filter, transform, annotate and analyze biological sequence data by Peter Becich •

May 10, 2018 |

70 UC Merced Peter Becich • 70 wrote:

FAST (FAST Analysis of Sequences Toolbox), built on BioPerl, provides open source command-line tools to filter, transform, annotate and analyze biological sequence data. Modeled after the GNU (GNU's Not Unix) Textutils such as grep, cut, and tr, FAST tools such as fasgrep, fascut, and fastr make it easy to rapidly prototype expressive bioinformatic workflows in a compact and generic command vocabulary. Compact combinatorial encoding of data workflows with FAST commands can facilitate better documentation and reproducibility of bioinformatic protocols, supporting better transparency in big biological data science. Interface self-consistency and conformity with conventions of GNU, Matlab, Perl, BioPerl, R and GenBank, help make FAST easy to learn. FAST automates numerical, text-based, sequence-based and taxonomic searching, sorting, selection and transformation of sequence records and alignment sites based on indices, ranges, tags and feature annotations, and analytics for composition and codon usage. Automated content- and feature-based extraction of sites and support for molecular population genetic statistics makes FAST useful for molecular evolutionary analysis. FAST is portable, easy to install, and secure, with stable releases posted to CPAN and development on Github.

The default data exchange format in FAST is Multi-FastA (specifically, a restriction of BioPerl FastA format). Sanger and Illumina 1.8+ FASTQ formatted files are also supported. The command-line basis of FAST makes it easier for non-programmer biologists to interactively investigate and control biological data at the speed of thought.

FAST can be found in the CPAN ( ). Contributions are welcomed and appreciated ( ).

See the FAST Cookbook for examples ( ) or Lawrence et al. "FAST: FAST Analysis of Sequences Toolbox", to appear ( ).

lh3 30k 1

yes I might have confounded two different issues here - among the reasons that I don't use javascript is that I don't know it would hold up in realistic data scenarios - but that is not relevant to charms since those are not supposed to be used as a large scale data analysis platform.

long terms I see charms as a platform to demonstrate, teach and show concepts through the web. There will be "official" charms available by default but at the same time any user will be able to create their own library and others will have the option of loading these for their own use with one click.

Within their own library users may implement everything in javascript (that is the simplest approach) or they can choose to connect to a RPC or HTTP service (on a different machine) via a simple API. When choosing these latter the Biostar server will act as a proxy (since javacript cannot connect to different domains) and submits the data to the user's server, once the result is back displays it to the user. The returned data may be a rich HTML document, it gets injected into the user's page.

The entire concept is not fully worked out yet - what I would like is to have a methodology to let someone try a tool, evaluate and demonstrate the use and operation of a software or library without asking the user to install it.

ADD REPLY • link modified 3.1 years ago • written 3.1 years ago by Istvan Albert ♦♦ 76k 1

Hmm.. When will a biostar user want to call charms? What "concepts" are you thinking to "demonstrate, teach and show" through the web? If it is just for demonstration, it would be easier to show that on my own server (e.g. via ipython) as I need it anyway to talk to the biostar server? And if the purpose is for teaching, we are not really talking about huge data sets. Javascript would be a better fit than adding an extra layer interfacing to other languages over the internet? I guess I need to see concrete use cases to understand what charms is for. Anyway, good that you are thinking about this.

ADD REPLY • link modified 3.1 years ago • written 3.1 years ago by lh3 30k

The rationale comes from teaching more than one bioinformatics course a year and it always gets me just unexpectedly difficult is to perform some of the simplest task unless we first spend several sessions making sure everyone's system is set up just right, installed the right tools libraries etc. half of the course is about unix, and frankly some people don't get unix at all - and I think that should be fine as well. One should be able to analyze data withouth understanding unix.

Example: Say I knew the name of two genes and I want to find out which one is longer. I don't know how to do this in a simple way. I can only do it by making use of a few decades of unix and other programming where chain together a series of pipe delimited commands that in turn will rely on a series of tools and libraries that I have preinstalled (some after quite the battle) on my system. or perhaps one can get there clicking about on webpages.

But what if there were a charm where getting sequence XYZ works in the browser by writing fetch(XYZ) and I could get the length of it by writing length(fetch(XYZ)) .

And then if I want to align the two I could just say align(s1, s2) and performs the alignment right there.

It is true that writing it in straight javascript would be the best solution - it is more a matter of time and resources to put towards that - hence I want to add the option of reuse existing libraries in a simple way.

ADD REPLY • link modified 3.1 years ago • written 3.1 years ago by Istvan Albert ♦♦ 76k 1

I see. That makes sense. What complex functionalities you need for teaching but are not easy to reimplement in javascript?

ADD REPLY • link written 3.1 years ago by lh3 30k 1

In the end it not just about the implementation but the usability of it. For example provides an aligner but it in the current form it takes a lot of work to understand the results.

The library produces a series of objects that have all kinds of attributes and methods. What it does not seem to provide a simple way to produce a biologically meaningful interpretable results. Like say a blast output, or a tabular blast output or a SAM output.

ADD REPLY • link written 3.1 years ago by Istvan Albert ♦♦ 76k 1

I am not sure NtSeq does standard alignment. It seems to be using pattern matching. I have just written a real DP-based algorithm: The APIs are C-like (actually most of my javascript programs look like C because I don't know much about javascript). You don't need deep knowledge in javascript to understand what it is doing.

ADD REPLY • link modified 3.1 years ago • written 3.1 years ago by lh3 30k

whoa, this is so cool, it is exactly how I think about it. In fact this will be the way I will teach alignments from now on.

I have added your library to the charms - with an example usage (you may need to force refresh the page to get the newest javascript).

As an example I have created both a simplified aligner and formatter on top of the library that may be useful for some people but not others. The goal is to make it easy for others to customize and share the way that they think about making use of these libraries.

I will be working on a feature where a user can upload a javascript file into the charms and other users will be able to simply click it and specify that, when they visit the charms they want that users's file to be activated as well. Now this new library may override the original formatting or it may choose to add a new formatting option with a different name.

ADD REPLY • link modified 3.1 years ago • written 3.1 years ago by Istvan Albert ♦♦ 76k

Maybe I misunderstand Charms, but I thought they are meant to demonstrate installable packages, not replace them. I see the short explanations "Charms are small programs written in a javascript based language" and "It can also perform calls to external programs" and took this to mean that Charms are Javascript wrappers of bioinformatics packages written in a variety of languages.

[May 09, 2018] FAST FAST Analysis of Sequences Toolbox

May 09, 2018 |

Tool: FASTA and FASTQ tools 24 2.9 years ago by Kamil • 1.8k Boston Kamil • 1.8k wrote:

Many developers have created tools for manipulating FASTA and FASTQ files. This is a comprehensive list of all the publicly available projects:

Java Go C/C++ Python Perl tool c++ fastq python fasta • 4.7k views ADD COMMENT • link •

[May 09, 2018] How To Parse Fasta Files In Perl

May 09, 2018 |

Question: How To Parse Fasta Files In Perl 7 7.9 years ago by nikulina • 280 Cambridge nikulina • 280 wrote:

Dear colleagues! I have a file with lots of sequences in FASTA format. I want to write a perl script to analyze each sequence (to count the length of certain fragment). So, how can I manage to treat each sequence as a variable? Should I use an array to read my file?

So, here is my script. It might be not very nice, but it works. I would like to modify it in order to work with FASTA data.

$string_filename = 'file.txt';  
open(FILE, $string_filename);  
@array = FILE;     
close FILE;  
foreach $string(@array){  
$R = length $string;  
if ( $string =~ /ggc/ ) {   
$M = $';   
$W = length $M;  
if ( $string =~ /atg/ ) {   
$K = $`;   
$Z = length $K;  
$x = $W + $Z - $R;     
print " \n\ the distance is the following: \n\n ";  
print $x;  
} else {  
print "\n\ I couldn\'t find the start codone.\n\n";  
} else {  
print "\n\ I couldn\'t find the binding site.\n\n"; }  

I will be grateful for your help :)

fasta perl • 26k views ADD COMMENT • link • modified 7.9 years ago by Tarah • 0 • written 7.9 years ago by nikulina • 280

Could you also show us an example sequence for which this code works? If the code is supposed to do what I think it is supposed to do, I think there may be quite a few problems with it.

ADD REPLY • link written 7.9 years ago by Neilfws 47k

Are you really sure none of to 10000 topics about how to parse file XXX did match your needs?

ADD REPLY • link written 6.4 years ago by Fabian Bull • 1.3k 21 gravatar for Neilfws 7.9 years ago by Neilfws 47k Sydney, Australia Neilfws 47k wrote:

First, there is no need to reinvent the wheel. As Stefano wrote, Bioperl will parse fasta sequences for you and do a whole lot more besides. Once installed, it is as simple as:

use Bio::SeqIO;
my $seqio = Bio::SeqIO->new(-file => "file.fa", '-format' => 'Fasta');
while(my $seq = $seqio->next_seq) {
  my $string = $seq->seq;
  # do stuff with $string

Second, there are some issues with your code. It should be "@array = <FILE>" - although as Stefano points out, you should not read the whole file into an array.

So far as I can tell, you are trying to find sub-sequences which begin "atg" and end with "ggc". Some other issues with your code:

  1. It seems to assume that there is only one each of "atg" and "ggc", because you use if() to match the regular expressions, not while().
  2. It returns negative values for length of the sub-sequence. Is this what you want? It is unclear whether you are looking for "atg" which lie upstream of "ggc" or whether they can be at any position in the sequence.
  3. It looks as though you are looking for start codons. There may be alternatives to atg: gtg or ttg.
  4. Your regular expressions are case-sensitive and would miss, for example, ATG.

Assuming that you are trying to find the region atg -> ggc, you could try something like:

while(my $string =~/atg(.*)ggc/gi) {
  # do something with match
  # e.g. match start = $-[0]+1, match end = $+[0]

That example uses the special Perl variables @- and @+ to get match positions, but Bioperl will also provide you with plenty of methods for analysing sub-sequences.

ADD COMMENT • link written 7.9 years ago by Neilfws 47k

Thank you for your attention to my question. In fact I would like to find the distance between binding sites for RNAP II and start of transcription in certain human genes. So, I used some sample motifs ('ggc' and 'atg') in my perl script, just to make the task easier and to test how it works in this simple variant. The binding site is situated before the start codone, that's why i didn't take into consideration those variants, where the first motif is situated after the second. Once more, thank you for your help.

ADD REPLY • link written 7.9 years ago by nikulina • 280 7 gravatar for Stefano Berri 7.9 years ago by Stefano Berri 4.0k Cambridge, UK Stefano Berri 4.0k wrote:

If you are planning to read and manipulate a lot of files with fasta sequences, do it properly. Use Bioperl. It make life easier (see an example here ). It takes some time to set it up and learn the "philosophy" behind, but then you can do much more: read from NCBI/EMBL, read/write to different formats... all with the same interface. Already debugged for you.

Also, if you use big files, don't do this:

open(FILE, $string_filename); @array = FILE;

It will load the whole file in memory. Nowadays fasta files might be Huuuuuge.

ADD COMMENT • link written 7.9 years ago by Stefano Berri 4.0k

Thank you! indeed i recognise that my variant is not very convinient and consumes lots of memory. I'll try to examine BIOperl and use it for futher tasks.

ADD REPLY • link written 7.9 years ago by nikulina • 280 4 gravatar for Hanif Khalak 7.9 years ago by Hanif Khalak • 1.2k Doha, QA Hanif Khalak • 1.2k wrote:

Along the lines of answers to this question , you can read/process one FASTA sequence at a time. I'd modify your code like this:

$string_filename = 'file.txt';  
open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");  

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
chomp $seq;
    $seq =~ s/^>*.+\n//;  # remove FASTA header
    $seq =~ s/\n//g;  # remove endlines

    $R = length $seq;
    if ( $seq =~ /ggc/ ) {   
        $M = $';
        $W = length $M;
        if ( $seq =~ /atg/ ) {   
            $K = $`;   
            $Z = length $K;  
            $x = $W + $Z - $R;     
            print "\n\ the distance is the following: $x\n\n";
        } else {  
            print "\n\ I couldn't find the start codon.\n\n";
    } else {  
        print "\n\ I couldn't find the binding site.\n\n"; }  

}  # end while

close FILE;  

ADD COMMENT • link written 7.9 years ago by Hanif Khalak • 1.2k

Thank you! it works!

ADD REPLY • link written 7.9 years ago by nikulina • 280 0 gravatar for Tarah 6.4 years ago by Tarah • 0 Tarah • 0 wrote:

Can I please add a question to this? What if you want to remove that string/sequence that you are looking for? I have a control phage in my illumina data that I want to remove, but am having a hard time finding out how to do this. Thanks so much!

[May 09, 2018] at master · 4ureliek-Fasta

May 09, 2018 |

4ureliek v2.2 Users who have contributed to this file

Raw Blame History executable file 275 lines (247 sloc) 12.1 KB
#!/usr/bin/perl -w
# Author : Aurelie Kapusta
# email : [email protected]
# Pupose : To extract sequences from a fasta or fastq file with filters on headers (matching IDs, containing a word etc - see usage)
use strict;
use warnings;
use Carp;
use Getopt::Long;
use Bio::SeqIO;
my $version = " 2.2 " ;
my $scriptname = " " ;
my $changelog = "
# - v1.0 = 19 Mar 2015
# Basically merging 6 different scripts in one... It was a mess
# - v2.0 = 17 Jul 2015
# Merging can be messy too! Introdction of bugs. The -inv option didn't work.
# Also, allow the -m IDfile to be a fasta file
# Usage update
# - v2.1 = 12 Apr 2016
# fastq option
# - v2.1 = 13 Apr 2016
# grep option; faster indead when very large fastq file, but still super slow
# TO DO: a bio db and not a SeqIO
\n " ;
my $usage = " \n Usage [ $version ]:
perl -in <fa> -m <X> [-file] [-out <X>] [-fq] [-grep] [-desc] [-both] [-regex] [-inv] [-noc] [-chlog] [-v] [-h]
This script allows to extract fasta sequences from a file.
- matching ID (from command line or using another fasta file or a file containing a list of IDs using -file)
- containing a word in the ID or in the description (-desc), or in both (-both)
- the complement of that (meaning, extract when it does not match), option -inv (inverse match)
Note that for a given fasta header:
>ID description
The ID corresponds to anything before the first space, description is anything that's after (even if spaces)
To extract all sequences containing ERV or LTR in IDs only:
perl -in fastafile.fa -m ERV,LTR -regex -v
To extract all sequences that don't have the word \" virus \" in the description or in the ID
perl -in fastafile.fa -m virus -both -inv -v
To extract all sequences that have their ID listed in a file
perl -in fastafile.fa -m list.txt -v
To extract all sequences that have their full header listed in a file
perl -in fastafile.fa -m list.txt -both -v
-in => (STRING) input fasta file
-m => (STRING) provide (i) a word or a list of words, or (ii) a path to a file
(i) in command line: you can set several words using , (comma) as a separator.
For example: -m ERV,LTR
Note that there can't be spaces in the command line, or they have to be escaped with \
(ii) a file: it can be a fasta/fastq file, or simply a file with a list of IDs (one column)
If the \" > \" or @ is kept with the ID, then all lines need to have it (unless -grep)
Headers can contain:
- fasta/fastq IDs only (no spaces) [defaults earch is done against IDs only]
- full fasta headers (use -both to match both, otherwise only ID is looked at)
- descriptions only (spaces allowed) if -desc is set
Note that you need to use the -file flag
-file => (BOOL) chose this if -m corresponds to a file
-out => (STRING) to set the name of the output file (default = input.extract.fa)
-fq => (BOOL) if input file is in fastq format; output will also be fastq
-grep => (BOOL) Chose this with -fq to use grep instead of using BioSeq
But this is even slower on large files.
Only relevant if -fq is set as well, because the sequences
will be extracted using grep -A 3 for each word set with -m
(extracting line that matches + 3 lines after the match)
Also, this makes irrelevant the use of these options:
-desc, -both, -regex, -inv, -noc
-desc => (BOOL) to look for match in the description and not the header
-both => (BOOL) to look into both headers and description
-regex => (BOOL) to look for containing the word and not an exact match
Special characters in names or descriptions will be an issue;
the only ones that are taken care of are: | / . [ ]
-inv => (BOOL) to extract what DOES NOT match
-noc => (BOOL) to ignore case in matching
-chlog => (BOOL) print updates
-v => (BOOL) verbose mode, make the script talk to you
-v => (BOOL) print version if only option
-h|help => (BOOL) print this help \n\n " ;

Frederick Sanger - Wikipedia

Frederick Sanger OM CH CBE FAA (/ˈsζŋər/; 13 August 1918 – 19 November 2013) was a British biochemist who twice won the Nobel Prize in Chemistry, one of only two people to have done so in the same category (the other is John Bardeen in physics),[4] the fourth person overall with two Nobel Prizes, and the third person overall with two Nobel Prizes in the sciences. In 1958, he was awarded a Nobel Prize in Chemistry "for his work on the structure of proteins, especially that of insulin". In 1980, Walter Gilbert and Sanger shared half of the chemistry prize "for their contributions concerning the determination of base sequences in nucleic acids". The other half was awarded to Paul Berg "for his fundamental studies of the biochemistry of nucleic acids, with particular regard to recombinant DNA".[5]

Early life and education

Frederick Sanger was born on 13 August 1918 in Rendcomb, a small village in Gloucestershire, England, the second son of Frederick Sanger, a general practitioner, and his wife, Cicely Sanger (nιe Crewdson).[6] He was one of three children. His brother, Theodore, was only a year older, while his sister May (Mary) was five years younger.[7] His father had worked as an Anglican medical missionary in China but returned to England because of ill health. He was 40 in 1916 when he married Cicely who was four years younger. Sanger's father converted to Quakerism soon after his two sons were born and brought up the children as Quakers. Sanger's mother was the daughter of a wealthy cotton manufacturer and had a Quaker background, but was not a Quaker.[7]

When Sanger was around five years old the family moved to the small village of Tanworth-in-Arden in Warwickshire. The family was reasonably wealthy and employed a governess to teach the children. In 1927, at the age of nine, he was sent to the Downs School, a residential preparatory school run by Quakers near Malvern. His brother Theo was a year ahead of him at the same school. In 1932, at the age of 14, he was sent to the recently established Bryanston School in Dorset. This used the Dalton system and had a more liberal regime which Sanger much preferred. At the school he liked his teachers and particularly enjoyed scientific subjects.[7] Able to complete his School Certificate a year early, for which he was awarded seven credits, Sanger was able to spend most of his last year of school experimenting in the laboratory alongside his chemistry master, Geoffrey Ordish, who had originally studied at Cambridge University and been a researcher in the Cavendish Laboratory. Working with Ordish made a refreshing change from sitting and studying books and awakened Sanger's desire to pursue a scientific career.[8]

In 1936 Sanger went to St John's College, Cambridge to study natural sciences. His father had attended the same college. For Part I of his Tripos he took courses in physics, chemistry, biochemistry and mathematics but struggled with physics and mathematics. Many of the other students had studied more mathematics at school. In his second year he replaced physics with physiology. He took three years to obtain his Part I. For his Part II he studied biochemistry and obtained a 1st Class Honours. It was a relatively new department founded by Gowland Hopkins with enthusiastic lecturers who included Malcolm Dixon, Joseph Needham and Ernest Baldwin.[7]

Both his parents died from cancer during his first two years at Cambridge. His father was 60 and his mother was 58. As an undergraduate Sanger's beliefs were strongly influenced by his Quaker upbringing. He was a pacifist and a member of the Peace Pledge Union. It was through his involvement with the Cambridge Scientists' Anti-War Group that he met his future wife, Joan Howe, who was studying economics at Newnham College. They courted while he was studying for his Part II exams and married after he had graduated in December 1940. Under the Military Training Act 1939 he was provisionally registered as a conscientious objector, and again under the National Service (Armed Forces) Act 1939, before being granted unconditional exemption from military service by a tribunal. In the meantime he undertook training in social relief work at the Quaker centre, Spicelands, Devon and served briefly as a hospital orderly.[7]

Sanger began studying for a PhD in October 1940 under N.W. "Bill" Pirie. His project was to investigate whether edible protein could be obtained from grass. After little more than a month Pirie left the department and Albert Neuberger became his adviser.[7] Sanger changed his research project to study the metabolism of lysine[9] and a more practical problem concerning the nitrogen of potatoes.[10] His thesis had the title, "The metabolism of the amino acid lysine in the animal body". He was examined by Charles Harington and Albert Charles Chibnall and awarded his doctorate in 1943.[7]

Sequencing insulin

Neuberger moved to the National Institute for Medical Research in London, but Sanger stayed in Cambridge and in 1943 joined the group of Charles Chibnall, a protein chemist who had recently taken up the chair in the Department of Biochemistry. Chibnall had already done some work on the amino acid composition of bovine insulin[11] and suggested that Sanger look at the amino groups in the protein. Insulin could be purchased from the pharmacy chain Boots and was one of the very few proteins that were available in a pure form. Up to this time Sanger had been funding himself. In Chibnall's group he was initially supported by the Medical Research Council and then from 1944 until 1951 by a Beit Memorial Fellowship for Medical Research.[6]

Sanger's first triumph was to determine the complete amino acid sequence of the two polypeptide chains of bovine insulin, A and B, in 1952 and 1951, respectively.[12][13] Prior to this it was widely assumed that proteins were somewhat amorphous. In determining these sequences, Sanger proved that proteins have a defined chemical composition.[7]

To get to this point, Sanger refined a partition chromatography method first developed by Richard Laurence Millington Synge and Archer John Porter Martin to determine the composition of amino acids in wool. Sanger used a chemical reagent 1-fluoro-2,4-dinitrobenzene (now, also known as Sanger's reagent, fluorodinitrobenzene, FDNB or DNFB), sourced from poisonous gas research by Bernhard Charles Saunders at the Chemistry Department at Cambridge University.

Sanger's reagent proved effective at labelling the N-terminal amino group at one end of the polypeptide chain.[14] He then partially hydrolysed the insulin into short peptides, either with hydrochloric acid or using an enzyme such as trypsin. The mixture of peptides was fractionated in two dimensions on a sheet of filter paper, first by electrophoresis in one dimension and then, perpendicular to that, by chromatography in the other. The different peptide fragments of insulin, detected with ninhydrin, moved to different positions on the paper, creating a distinct pattern that Sanger called "fingerprints". The peptide from the N-terminus could be recognised by the yellow colour imparted by the FDNB label and the identity of the labelled amino acid at the end of the peptide determined by complete acid hydrolysis and discovering which dinitrophenyl-amino acid was there.[7]

By repeating this type of procedure Sanger was able to determine the sequences of the many peptides generated using different methods for the initial partial hydrolysis. These could then be assembled into the longer sequences to deduce the complete structure of insulin. Finally, because the A and B chains are physiologically inactive without the three linking disulfide bonds (two interchain, one intrachain on A), Sanger and coworkers determined their assignments in 1955.[15][16]

Sanger's principal conclusion was that the two polypeptide chains of the protein insulin had precise amino acid sequences and, by extension, that every protein had a unique sequence.

It was this achievement that earned him his first Nobel prize in Chemistry in 1958.[17] This discovery was crucial for the later sequence hypothesis of Crick for developing ideas of how DNA codes for proteins.[18]

Sequencing RNA

From 1951 Sanger was a member of the external staff of the Medical Research Council[6] and when they opened the Laboratory of Molecular Biology in 1962, he moved from his laboratories in the Biochemistry Department of the university to the top floor of the new building. He became head of the Protein Chemistry division.[7]

Prior to his move, Sanger began exploring the possibility of sequencing RNA molecules and began developing methods for separating ribonucleotide fragments generated with specific nucleases. This work he did while trying to refine the sequencing techniques he had developed during his work on insulin.[18]

The key challenge in the work was finding a pure piece of RNA to sequence. In the course of the work he discovered in 1964, with Kjeld Marcker, the formylmethionine tRNA which initiates protein synthesis in bacteria.[19]

He was beaten in the race to be the first to sequence a tRNA molecule by a group led by Robert Holley from Cornell University, who published the sequence of the 77 ribonucleotides of alanine tRNA from Saccharomyces cerevisiae in 1965.[20] By 1967 Sanger's group had determined the nucleotide sequence of the 5S ribosomal RNA from Escherichia coli, a small RNA of 120 nucleotides.[21]

Sequencing DNA

He then turned to sequencing DNA, which would require an entirely different approach. He looked at different ways of using DNA polymerase I from E. coli to copy single stranded DNA.[22] In 1975, together with Alan Coulson, he published a sequencing procedure using DNA polymerase with radiolabelled nucleotides that he called the "Plus and Minus" technique.[23][24] This involved two closely related methods that generated short oligonucleotides with defined 3' termini. These could be fractionated by electrophoresis on a polyacrylamide gel and visualised using autoradiography. The procedure could sequence up to 80 nucleotides in one go and was a big improvement on what had gone before, but was still very laborious. Nevertheless, his group were able to sequence most of the 5,386 nucleotides of the single-stranded bacteriophage φX174.[25] This was the first fully sequenced DNA-based genome. To their surprise they discovered that the coding regions of some of the genes overlapped with one another.[3]

In 1977 Sanger and colleagues introduced the "dideoxy" chain-termination method for sequencing DNA molecules, also known as the "Sanger method".[24][26] This was a major breakthrough and allowed long stretches of DNA to be rapidly and accurately sequenced.

It earned him his second Nobel prize in Chemistry in 1980, which he shared with Walter Gilbert and Paul Berg.[5] The new method was used by Sanger and colleagues to sequence human mitochondrial DNA (16,569 base pairs)[27] and bacteriophage λ (48,502 base pairs).[28] The dideoxy method was eventually used to sequence the entire human genome.[29]

Postgraduate students

During the course of his career Sanger supervised more than ten PhD students, two of whom went on to also win Nobel Prizes. His first graduate student was Rodney Porter who joined the research group in 1947.[3] Porter later shared the 1972 Nobel Prize in Physiology or Medicine with Gerald Edelman for his work on the chemical structure of antibodies.[30] Elizabeth Blackburn studied for a PhD in Sanger's laboratory between 1971 and 1974.[3][31] She shared the 2009 Nobel Prize in Physiology or Medicine with Carol W. Greider and Jack W. Szostak for her work on telomeres and the action of telomerase.[32]

Awards and honours

As of 2015, Sanger is the only person to have been awarded the Nobel Prize in Chemistry twice, and one of only four two-time Nobel laureates: The other three were Marie Curie (Physics, 1903 and Chemistry, 1911), Linus Pauling (Chemistry, 1954 and Peace, 1962) and John Bardeen (twice Physics, 1956 and 1972).[4]
Elected Fellow of the Royal Society (FRS) in 1954[3]
Commander of the Order of the British Empire – 1963
Order of the Companions of Honour – 1981
Order of Merit – 1986
Corresponding Fellow of the Australian Academy of Science – 1982
William Bate Hardy Prize – 1976
Nobel Prize in Chemistry – 1958, 1980
Corday–Morgan Medal – 1951
Royal Medal – 1969
Gairdner Foundation International Award – 1971
Copley Medal – 1977
G.W. Wheland Award – 1978
Louisa Gross Horwitz Prize of Columbia University – 1979
Albert Lasker Award for Basic Medical Research – 1979
Association of Biomolecular Resource Facilities Award – 1994
Citation for Chemical Breakthrough Award from the Division of History of Chemistry of the American Chemical Society – 2016[33][34][35]
The Wellcome Trust Sanger Institute (formerly the Sanger Centre) is named in his honour.

Personal life

Sanger married Margaret Joan Howe in 1940. She died in 2012. They had three children - Robin, born in 1943, Peter born in 1946 and Sally Joan born in 1960.[6] He said that his wife had "contributed more to his work than anyone else by providing a peaceful and happy home."[36]

Later life

Sanger retired in 1983, aged 65, to his home, "Far Leys", in Swaffham Bulbeck outside Cambridge.[3]

In 1992, the Wellcome Trust and the Medical Research Council founded the Sanger Centre (now the Sanger Institute), named after him.[37] The Institute is located on the Wellcome Trust Genome Campus near Hinxton, only a few miles from Sanger's home. He agreed to having the Centre named after him when asked by John Sulston, the founding director, but warned, "It had better be good."[37] It was opened by Sanger in person on 4 October 1993, with a staff of fewer than 50 people, and went on to take a leading role in the sequencing of the human genome.[37] The Institute now[when?] has over 900 people and is one of the world's largest genomic research centres.

Sanger said he found no evidence for a God so he became an agnostic.[38] In an interview published in the Times newspaper in 2000 Sanger is quoted as saying: "My father was a committed Quaker and I was brought up as a Quaker, and for them truth is very important. I drifted away from those beliefs – one is obviously looking for truth, but one needs some evidence for it. Even if I wanted to believe in God I would find it very difficult. I would need to see proof."[39]

He declined the offer of a knighthood, as he did not wish to be addressed as "Sir". He is quoted as saying, "A knighthood makes you different, doesn't it, and I don't want to be different." In 1986, he accepted the award of an Order of Merit, which can have only 24 living members.[36][38][39]

In 2007 the British Biochemical Society was given a grant by the Wellcome Trust to catalogue and preserve the 35 laboratory notebooks in which Sanger recorded his research from 1944 to 1983. In reporting this matter, Science noted that Sanger, "the most self-effacing person you could hope to meet", was spending his time gardening at his Cambridgeshire home.[40]

Sanger died in his sleep at Addenbrooke's Hospital in Cambridge on 19 November 2013.[36][41] As noted in his obituary, he had described himself as "just a chap who messed about in a lab",[42] and "academically not brilliant".[43]

Recommended Links

Google matched content

Softpanorama Recommended

Top articles


DNA sequencing - Wikipedia

Frederick Sanger - Wikipedia

Frederick Sanger interviewed by Alan Macfarlane, 24 August 2007 (film), also available on Video on YouTube. Duration 57 minutes.

Compression of FASTQ and SAM Format Sequencing Data

Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz:, fqzcomp:, and samcomp:

Performance compariso

  1. Deorowicz S (2013) Grabowski S (2013) Data compression for sequencing data. Algorithms for molecular biology: AMB 8(1):25. doi: 10.1186/1748-7188-8-25.
  2. 2.

    Loh PR, Baym M, Berger B (2012) Compressive genomics. Nat Biotechnol 30(7):627–30. doi: 10.1038/nbt.2241.

  3. 3.

    RAID Incorporated (2015) Storing and managing petabytes of genome sequencing data. Tech. rep. [Online]. Accessed on 23 March 2015

  4. 4.

    Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics (Oxford, England) 25(14):1731–1738. doi: 10.1093/bioinformatics/btp319.

  5. 5.

    Baxevanis Andreas D, Ouellette Francis BF (2004) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. doi: 10.1007/s10439-006-9105-9

  6. 6.

    Format specification (FASTQ) (2014) [Online]. Accessed on 23 Sept 2014

  7. 7.

    Howison M, Zapata F, Dunn CW (2013) Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics 29(23):2959–2963. doi: 10.1093/bioinformatics/btt525 CrossRefGoogle Scholar

  8. 8.

    Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26(10):1135–1145. doi: 10.1038/nbt1486 CrossRefGoogle Scholar

  9. 9.

    Bakr NS, Sharawi AA (2013) DNA lossless compression algorithms: review. Am J Bioinform Res 3(3):72–81. doi: 10.5923/j.bioinformatics.20130303.04 Google Scholar

  10. 10.

    1000 Genomes (2014) A deep catalog of human genetic variation. [Online]. Accessed on 03 Oct 2014

  11. 11.

    Encyclopedia of DNA Elements (ENCODE) (2014) [Online]. Accessed on 03 Oct 2014

  12. 12.

    Genomics England (2014). [Online]. Accessed on 03 Oct 2014

  13. 13.

    ICGC Cancer Genome Projects (2014) [Online]. Accessed on 03 Oct 2014

  14. 14.

    Wandelt S, Bux M, Leser U (2013) Trends in genome compression. Curr Bioinform 1–24 .

  15. 15.

    Kaipa KK, Lee K, Ahn T, Narayanan R (2010) System for random access dna sequence compression. IEEE international conference on bioinformatics and biomedicine workshops system, pp 853–854.

  16. 16.

    Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory I(3):337–338Google Scholar

  17. 17.

    Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536MathSciNetCrossRefMATHGoogle Scholar

  18. 18.

    Huffman DA (1952) A method for the construction of minimum-redundancy codes. Proc IRE 27:1098,1101Google Scholar

  19. 19.

    Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucl Acids Res 40(22):1–9. doi: 10.1093/nar/gks754 CrossRefGoogle Scholar

  20. 20.

    Pinho AJ, Pratas D, Garcia SP (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucl Acids Res 40(4):1–8. doi: 10.1093/nar/gkr1124 CrossRefGoogle Scholar

  21. 21.

    Christley S, Lu Y, Li C, Xie X (2009) Human genomes as email attachments. Bioinformatics 25(2):274–275. doi: 10.1093/bioinformatics/btn582 CrossRefGoogle Scholar

  22. 22.

    Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE engineering in medicine and biology magazine 20:61–66. doi: 10.1109/51.940049.

  23. 23.

    Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Genome informatics. Workshop on genome informatics 10:51–61

  24. 24.

    Stephane G, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf Process Manage 30:875–886.

  25. 25.

    Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief Bioinform 15(3):390–406. doi: 10.1093/bib/bbt088 CrossRefGoogle Scholar

  26. 26.

    Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. Genome Inf 11:43–52. doi: 10.11234/gi1990.11.43 Google Scholar

  27. 27.

    Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6):860–862. doi: 10.1093/bioinformatics/btr014.

  28. 28.

    Bhola V, Bopardikar AS, Narayanan R, Lee K, Ahn T (2011) No-reference compression of genomic data stored in FASTQ format. Proceedings-2011 IEEE international conference on bioinformatics and biomedicine, BIBM 2011, pp 147–150. doi: 10.1109/BIBM.2011.110

  29. 29.

    Yanovsky V (2011) ReCoil-an algorithm for compression of extremely large datasets of dna data. Algorithms Mol Biol 6(1):23. doi: 10.1186/1748-7188-6-23.

  30. 30.

    Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS (2012) DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences: sequence analysis. Bioinformatics (Oxford, England) 28(19):2527–9. doi: 10.1093/bioinformatics/bts467.

  31. 31.

    Grassi E, Gregorio FD, Molineris I (2012) KungFQ: a simple and powerful approach to compress fastq files. IEEE/ACM Trans Comput Biol Bioinform 9(6):1837–1842. doi: 10.1109/TCBB.2012.123 CrossRefGoogle Scholar

  32. 32.

    Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PLoS ONE 8(3):1–11. doi: 10.1371/journal.pone.0059190 CrossRefGoogle Scholar

  33. 33.

    Roguski L, Deorowicz S (2014) DSRC 2-industry oriented compression of FASTQ files. Bioinformatics 30(15):2213–2215. doi: 10.1093/bioinformatics/btu208.

  34. 34.

    Sardaraz M, Tahir M, Ikram AA. Advances in high throughput dna sequence data compression. J Bioinform Comput Biol 0(0):1630,002(0). doi: 10.1142/S0219720016300021. (PMID: 26846812)

  35. 35.

    Zhu Z, Zhang Y, Ji Z, He S, Yang X (2013) High-throughput DNA sequence data compression. Brief Bioinform 16(1). doi: 10.1093/bib/bbt087.

  36. 36.

    Adler, M.: PIGZ Documentation. [Online; accessed: 2014-12-03]

  37. 37.

    Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the burrows-wheeler transform. Bioinformatics (Oxford, England) 28(11):1415–1419. doi: 10.1093/bioinformatics/bts173.

  38. 38.

    Dutta A, Haque MM, Bose T, Reddy C, Mande SS (2015) Fqc: a novel approach for efficient compression, archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13(03):1541,003CrossRefGoogle Scholar

  39. 39.

    Hach F, Numanagic I, Alkan C, Sahinalp SC, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23):3051–7. doi: 10.1093/bioinformatics/bts593.

  40. 40.

    Howison M (2013) High-throughput compression of FASTQ data with SeqDB. IEEE/ACM Trans Comput Biol Bioinform 10(1):213–218. doi: 10.1109/TCBB.2012.160 MathSciNetCrossRefGoogle Scholar

  41. 41.

    Janin L, Schulz-Trieglaff O, Cox AJ (2014) BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics (Oxford, England) 1–6. doi: 10.1093/bioinformatics/btu387.

  42. 42.

    Linux man page (2014) Pbzip2: parallel bzip2 file compressor. [Online]. Accessed on 03 Dec 2014

  43. 43.

    Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20):3276–3281. doi: 10.1093/bioinformatics/btv384 CrossRefGoogle Scholar

  44. 44.

    Oberhumer M (2015) Lzo real-time data compression library. [Online]. Accessed on 03 March 2016

  45. 45.

    Pavlov I (2016) 7-zip. [Online]. Accessed on 03 March 2016

  46. 46.

    WinRAR archiver, a powerful tool to process RAR and ZIP files. [Online]. Accessed on 03 Dec 2014

  47. 47.

    Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2015) A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579(1):75–81. doi: 10.1016/j.gene.2015.12.053 CrossRefGoogle Scholar

  48. 48.

    Zhan X, Yao D (2014) A novel method to compress high-throughput dna sequence read archive. In: Software intelligence technologies and applications international conference on frontiers of internet of things 2014, international conference on, pp 58–61. doi: 10.1049/cp.2014.1536

  49. 49.

    Daily K, Rigor P, Christley S, Xie X, Baldi P (2010) Data structures and compression algorithms for high-throughput sequencing technologies. BMC bioinformatics 11(1):514. doi: 10.1186/1471-2105-11-514.

  50. 50.

    Guo G, Qiu S, Ye Z, Wang B, Fang L, Lu M, See S, Mao R (2013) GPU-accelerated adaptive compression framework for genomics data. 2013 IEEE international conference on big data GPU-accelerated, pp 181–186Google Scholar

  51. 51.

    Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G (2011) Compressing genomic sequence fragments using SlimGene. J Comput Biol 18(3):401–13. doi: 10.1089/cmb.2010.0253.

  52. 52.

    Sakib MN, Tang J, Zheng WJ, Huang CT (2011) Improving transmission efficiency of large sequence alignment/map (SAM) files. PLoS ONE 6(12):2–5. doi: 10.1371/journal.pone.0028251 CrossRefGoogle Scholar

  53. 53.

    Vey G (2009) Differential direct coding: a compression algorithm for nucleotide sequence data. Database: the journal of biological databases and curation 2009:bap013. doi: 10.1093/database/bap013.

  54. 54.

    Awan F, Mukherjee A (2002) Lossless compression handbook, chap. text compression, pp 227–245. Communications, networking and multimedia. Elsevier Science, LondonGoogle Scholar

  55. 55.

    Carus A, Mesut A (2010) Fast text compression using multiple static dictionaries. Inf Technol J 9:1013–1021CrossRefGoogle Scholar

  56. 56.

    Crochemore M, Lecroq T (2012) The computer science and engineering handbook, chap. pattern matching and text compression algorithms, pp 3–77. CRC Press, Boca RatonGoogle Scholar

  57. 57.

    Burrows M, Wheeler DA (1994) A block-sorting lossless data compression algorithm. Tech. rep. digital equipment corporation, CaliforniaGoogle Scholar

  58. 58.

    7-zip soruceforge editor's review (2016) [Online]. Accessed on 03 March 2016

  59. 59.

    Mahooney M (2016) Data compression explained. [Online]. Accessed on 03 March 2016

  60. 60.

    Selva JJ, Chen X (2013) SRComp: short read sequence compression using burstsort and Elias omega coding. PLoS ONE 8(12):1–7. doi: 10.1371/journal.pone.0081414 CrossRefGoogle Scholar

  61. 61.

    Tembe W, Lowey J, Suh E, Genomics T, Street N (2010) G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17):2192–2194. doi: 10.1093/bioinformatics/btq346 CrossRefGoogle Scholar

  62. 62.

    Mahoney MV (2005) Adaptive weighing of context models for lossless data compression. Florida Tech., Melbourne CS-2005-16(x):1–6.

  63. 63.

    Salomon D, Bryant D, Motta G (2010) Handbook of data compression. Springer, LondonCrossRefMATHGoogle Scholar

  64. 64.

    Batu T, Ergun F, Sahinalp C (2006) Oblivious string embeddings and edit distance approximations. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm, pp 792–801. Society for Industrial and Applied MathematicsGoogle Scholar

  65. 65.

    Biocancer Research Journal: Transcriptoma (2014) [Online]. Accessed on 03 Dec 2014

  66. 66.

    Pevsner J (2009) Bioinformatics and functional genomics, 2nd edn. Springer, BerlinGoogle Scholar

  67. 67.

    Deorowicz S, Grabowski S (2011) Robust relative compression of genomes with random access. Bioinformatics 27(21):2979–2986. doi: 10.1093/bioinformatics/btr505 CrossRefGoogle Scholar



Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy


War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes


Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law


Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site


The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: February, 27, 2021