DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or
technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine—in a strand of DNA. The
advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.
Knowledge of DNA sequences has become indispensable for basic biological research, and in numerous applied fields such as medical
diagnosis, biotechnology, forensics, virology. The rapid speed of sequencing attained with modern
DNA sequencing technology has been instrumental in the sequencing of complete DNA sequences, or genomes of numerous types and
species of life, including the human genome and other complete DNA sequences of many animals, plants, and microbial species.
The first DNA sequences were obtained in the early 1970s by academic researchers using laborious methods based on two-dimensional
chromatography. Following the development of fluorescence-based sequencing methods with a DNA sequencer, DNA sequencing has become
easier and orders of magnitude faster.
The biological activity of every living organism is controlled by billions of individual cells. The control-center of each cell
is the deoxyribonucleic acid (DNA) that contains a complete set of instructions needed to direct the functioning of each and every
one of the cells. The chemical composition of the DNA is the same for all living organisms. The DNA of every living organism
contains four basic nucleotide bases: adenine, cytosine, guanine, and thymine, ususally abreviated using the symbols A, C, G and T respectively.
The DNA sequence is usually divided into a number of chromosomes, which in turn contain genes. Genes are sequences of base pairs
that contain instructions on how to produce proteins. They are also related to heredity. The areas of the DNA that contain genes are
thus called coding areas, while the remaining parts are called non-coding areas. In higher-level eukaryotes, genes are usually
spliced up into alternating regions of exons and introns. The introns are noncoding DNA and are cut-out before the messenger ribonucleic acid (mRNA) leaves the nucleus for the ribosome - where the protein specified
by the mRNA is synthesized. The information in the mRNA thus represents the exons which are then used to make proteins. The
non-coding area (sometimes called "junk DNA") contains redundant repeating sequences . They are estimated to make up at least 50% of
the human genome.
From a computational viewpoint, a biological sequence can be viewed mainly as a one-dimensional sequence of symbols, for instance
with an alphabet of 4 symbols for DNA, or RNA, and 20 symbols for proteins. Biological sequences typically contain different types of repetitions and
other hidden regularities. Long runs of tandem repeats and of randomly interspersed repeats are prominent features of DNA sequences.
The family of Alu repeats (typically about 300 bases in length) is typical of short interspersed repeat sequences (SINEs - short
interspersed nuclear elements). These have been estimated to make up about 9% of the human genome, thus out-numbering the proportion
of protein coding regions[Herzel94es]. There are also the long interspersed repeat sequences (LINEs - long interspersed nuclear
elements) which are usually more than 6000 bases in length. In the human genome, the L1 family is the most common LINEs, with about
60,000 to 100,1000 occurrences. There are also short repeats (sometimes called "random repeats"), attributed to the fact that
typical sequences and genomes are orders of magnitude larger than the alphabet size (4 in this case).
We will discuss here only one topic: FASTA/FASTQ files compression. Both are a text
representations described in more details in
They are not formally standardized, but in both cases "standard de-facto" exists and utilities for conversion to this "standard
de facto" exists too.
The most common format for DNA sequencing data is now FASTQ format. In FASTQ files an additional symbol N is introduced meaning
"not defined" base. In order not to increase the size of alphabet, the symbol N can be decoded from the quality score: N should
have the quality zero. (This should
be enforced.)
For a given FASTQ file, every four lines represent a single DNA sequence.
The general syntax of a FASTQ file is as follows:
Four symbol alphabet (A,C,G,T) can be encoded using two bit per letter. That would achieve compression ration or around four
and that level represent the "minimum acceptable level" of compression of FASTA files no matter what algorithm is used.
But FASTA/FASTQ are not
completely random. They convey important information about the organism. Due to this they contain the repetitions, palindromes and
other regular structures inherent in biological sequences. Which imply additional redundancies that can provide an avenue for
increasing the compression ratio of six or more.
But the identification of such
sequences is tricky and not all short reads in FASTA/FASTQ contain them to justify the computational costs. Cross short reads
repetitions can be quite distant and require building of the dictionary. Also due to additional overhead repetitions up to probably
eight symbols do not increase compression ratio much as you now need to distinguish between "literal string" and references to
previous occurrence or dictionary entry. And that creates an overhead. Effect of which can be visible in Gzip, which usually
does not achieve the compression ration four on FASTA/FASTQ files.
So achieving the compression ratio of, say, six, for FAST files is not that easy, and consistently doing so is currently is out of
scope for generic methods (see below). That's the law of diminishing returns in action.
At the same time there is no widely accepted specialized algorithm which would achieve compression ration of six or better without
undue slow down of compression speed. The area of specialized algorithms is balkanized and often researchers do not sustain
interest in their compression utility and corresponding algorithms for more then six years. There is a lot of abandonware in
the field. Some with promising ideas.
Burrows–Wheeler transform
(BWT) usually shows good results for text-based data, but it is slow and gains are not
enough to compete with Deflate algorithms used in gzip. But it really shines in compression
FASTA/FASTQ file as DNA consists of only four basic nucleotides, labeled
A, C, G, and T. FASTA file is basically a massive string containing these four symbols in
various combinations.
There are two major categories of FASTA/FASTQ files compression methods and programs:
reference-free methods. They in turn can be generic and specialized. Generic also can be single threaded or
multi-threaded that can take advatage fo multi-core CPU that became dominant (for example, gzip is a single threaded and
pigz is parallelized version that produced and can decompress gz files)
generic Typically this is gzip/pigz (the power of inertia), more rarely
bzip2/pbzip2 (the best combination of speed and compression ratio for this type of data), even more rarely xz; the latter has much
slower compression speed than pbzip2). Those are generally limited by compression ration of five (archived file size
is 20% or more of the original). Of them pbzip2 is currently the undisputable leader. Here is a relevant quote about bzip2 from the chapter 8 (Contextual Data Transforms Practical Implementations) of the book
Understanding Compression Data Compression for Modern Developers Colt by McAnlis, Aleks Haecky (O'Reilly, 2016)
BWT and DNA
BWT has always been an edge-case of compression. Its initial existence showed really good results for text-based data, but it
could never compete from a performance perspective with other algorithms such as GZIP. As such, BWT (or bzip2, the dominant BWT
encoder) never really took the compression world by storm.
That is, until humans began sequencing deoxyribonucleic acid, or DNA.
Human DNA has a pretty simple setup with only four basic nucleotide bases, labeled A, C, G, and T. A given genome is basically
a massive string containing these four symbols in various orderings. How much? Well, the human genome contains about 3.1647
billion DNA base pairs.
It turns out that BWT’s block-sorting algorithm is an ideal transform that could
be applied to DNA to make it more compressible, searchable, and retrievable. (There’s actually a boat-load of papers proving
this.) The reduction in size and availability for fast reads are of high importance when aligning reads of new genomes against a
reference.
This just goes to show how there’s no single silver bullet when it comes to data compression. Each stream of information has
its own variable characteristics and responds differently to different transforms and encoders. Although BWT might not have taken
the web away from its cousin, GZIP, it stands alone as an important factor in the next few decades of bioinformatics.
specialized that are based on the specific characteristics of the FASTA/FASTQ files (FASTQZ, etc)
such as small alphabet size (i.e., mainly four, nucleotides A (adenine), T (thymine), C (cytosine) and G (guanine)),
repeats and palindromes. They can achieve higher compression ratio, but often at the cost of
additional time both during compression and decompression. The simplest reference-free specialized method is based on conversion of sequences into 2 bit values with exceptions (for
example letters N) encoded via special commands after this compressed string which overwrite the decoded string in specified
places. It can achieve the compression ratio of 4 but postprocessing of encoded sequence with generic archivers does not improve it
much. See for example fapack written by Mahoney (fastqz - FASTQ
compressor)
fapack is a program for packing FASTA files 4 bases (A,C,G,T) per byte. fapacks also accepts lowercase (a,c,g,t)
used to indicate repeats. It produces a larger reference genome but generally better compression.
reference-based methods, that exploit one or a set of reference sequence(s) iether from some open source database, or one of
existing in the dataset FASTA/FASTQ file (reference) in a set of FASTA files for particular organism. They generally can provide
much higher compression ratio the other methods.
Algorithms that do well with text compression are probably
worth investigating, insofar as uncompressed FASTQ is structured text.
Compression of nucleotide sequences has been an interesting problem for a long time.
Cox
etal. (2012) apply Burrows–Wheeler Transform (BWT) for compression of genomic sequences.
GReEn (Pinho
etal., 2012) is a reference-based sequence compression method offering a very good
compression ratio. (Compression ratio refers to the ratio of the original data size to the
compressed data size). Compressing the sequences along with quality scores and identifiers is a
very different problem.
Tembe
etal. (2010) have attached the Q-scores with the corresponding DNA bases to generate
new symbols out of the base-Q-score pair. They encoded each such distinct symbol using
Huffman encoding (Huffman, 1952).
Deorowicz and Grabowski (2011) divided the quality scores into three sorts of quality streams:
quasi-random with mild dependence on quality score positions.
quasi-random quality scores with strings ending with several # characters.
quality scores with strong local correlations within individual records.
To represent case 2 they use a bit flag. Any string with that specific bit flag is processed to
remove all trailing #. They divide the quality scores into individual blocks and apply Huffman
coding.
Tembe
etal. (2010) and
Deorowicz and Grabowski (2011) also show that general purpose compression algorithms do not
perform well and domain specific algorithms are indeed required for efficient compression of FASTQ
files. In their papers they demonstrate that significant improvements in compression ratio can be
achieved using domain specific algorithms compared with bzip2 and gzip.
The literature on FASTQ compression can be divided into two categories, namely lossless and
lossy. A lossless compression scheme is one where we preserve the entire data in the compressed
file. On the contrary, lossy compression techniques allow some of the less important components of
data to be lost during compression.
Kozanitis etal. (2011) perform randomized rounding to perform lossy encoding.
Asnani
etal. (2012) introduce a lossy compression technique for quality score encoding.
Wan
etal. (2012) proposed both lossy and lossless transformations for sequence compression
and encoding.
Hach
etal. (2012) presented a ‘boosting’ scheme which reorganizes the reads so as to achieve a
higher compression speed and compression rate, independent of the compression algorithm in use.
When it comes to medical data compression it is very difficult to identify which components are
unimportant. Hence, many researchers believe that lossless compression techniques are particularly
needed for biological/medical data. Quip (Jones
etal., 2012) is one such lossless compression tool. It separately compresses the
identifier, sequence and quality scores. Quip makes use of Markov Chains for encoding sequences and
quality scores. DSRC (Deorowicz
and Grabowski, 2011) is also considered a state of the art lossless compression algorithm which
we compare with ours in this article.
Bonfield
and Mahoney (2013) have come up with a set of algorithms (named Fqzcomp and Fastqz) to compress
FASTQ files recently. They perform identifier compression by storing the difference between the
current identifier and the previous identifier. Sequence compression is performed by using a set of
techniques including base pair compaction, encoding and an order-k model. Apart from
hashing, they have also used a technique for encoding quality values by prediction.
Currently there is no standardized compressor for FASTA/FASTQ
files and often general purpose archivers are used as a substitute. Out of troika the most popular archivers (gzip/pigz bzip/pbzip2
and xz) pbzip2 has an edge.
Sample
Compressor
Parameters
Orig size
Compressed size
Compress time
Compress speed
Decompress time
Speed relative to gzip -6
Size relative to gzip
Qratio (speed divided on square of size) All relative to gzip
prinseq-lite.pl is a utility written in Perl for
preprocessing NGS reads, also in FASTQ format .
It can read sequences both from files and from stdin (if you only have 1 sequence).
I wanted to use it with compressed (gzipped/bzipped2) FASTQ input files.
As I do not need to store decompressed input files, the most efficient solution is to use
pipes.
This works well for a single file, but not for 2 files (paired-end reads).
For 2 files, named
pipes (also known as FIFO s) can be used.
You can create a named pipe in Linux with the help of mkfifo command, for example
mkfifo R1_decompressed.fastq .
To use it, start decompressing something into it (either in a different terminal, or in
background), for example zcat R1.fastq.gz > R1_decompressed.fastq & ;
we can call this a writing/generating process, because it writes into a pipe.
(If you are writing software to use named pipes, any processes writing into them should be
started in a new thread, as they will block until all the data is consumed.)
Now if you give the R1_decompressed.fastq as a file argument to some other program, it will see
decompressed content (e.g. wc -l R1_decompressed.fastq will tell you the number of
lines in the decompressed file); we can call program reading from the named pipe a
reading/consuming process.
As soon as a consuming process had consumed (read) all of the data, the writing/generating
process will finally exit.
This, however, does not work with prinseq-lite.pl (version 0.20.4 or earlier), with a broken
pipe error.
Read the rest of this entry "
Let's start with what they have in common: All three formats store
sequence data, and
sequence metadata.
Furthermore, all three formats are text-based.
However, beyond that all three formats are different and serve different purposes.
Let's start with the simplest format:
FASTA
FASTA stores a variable number of sequence records, and for each record it stores the
sequence itself, and a sequence ID. Each record starts with a header line whose first
character is > , followed by the sequence ID. The next lines of a record
contain the actual sequence.
The Wikipedia
artice gives several examples for peptide sequences, but since FASTQ and SAM are used
exclusively (?) for nucleotide sequences, here's a nucleotide example:
In the context of nucleotide sequences, FASTA is mostly used to store reference data; that
is, data extracted from a curated database; the above is adapted from GtRNAdb (a database of tRNA sequences).
FASTQ
FASTQ was conceived to solve a specific problem of FASTA files: when sequencing, the
confidence in a given base call (that is, the identity of a
nucleotide) varies. This is expressed in the Phred quality score . FASTA had no
standardised way of encoding this. By contrast, a FASTQ record contains a sequence of quality
scores for each nucleotide.
A FASTQ record has the following format:
A line starting with @ , containing the sequence ID.
One or more lines that contain the sequence.
A new line starting with the character + , and being either empty or
repeating the sequence ID.
One or more lines that contain the quality scores.
Here's an example of a FASTQ file with two records:
FASTQ files are mostly used to store short-read data from high-throughput sequencing
experiments. As a consequence, the sequence and quality scores are usually put into a single
line each, and indeed many tools assume that each record in a FASTQ file is exactly four
lines long, even though this isn't guaranteed.
SAM files are so complex that a complete description[PDF]
takes 15 pages. So here's the short version.
The original purpose of SAM files is to store mapping information for sequences from
high-throughput sequencing. As a consequence, a SAM record needs to store more than just the
sequence and its quality, it also needs to store information about where and how a sequence
maps into the reference.
Unlike the previous formats, SAM is tab-based, and each record, consisting of either 11 or
12 fields, fills exactly one line. Here's an example (tabs replaced by fixed-width
spacing):
For a description of the individual fields, refer to the documentation. The relevant bit
is this: SAM can express exactly the same information as FASTQ, plus, as mentioned, the
mapping information. However, SAM is also used to store read data without mapping
information.
In addition to sequence records, SAM files can also contain a header , which
stores information about the reference that the sequences were mapped to, and the tool used
to create the SAM file. Header information precede the sequence records, and consist of lines
starting with @ .
SAM itself is almost never used as a storage format; instead, files are stored in BAM
format, which is a compact binary representation of SAM. It stores the same information, just
more efficiently, and in conjunction with a search index , allows fast retrieval of
individual records from the middle of the file (= fast random access ). BAM files are also much
more compact than compressed FASTQ or FASTA files.
The above implies a hierarchy in what the formats can store: FASTA ⊂ FASTQ
⊂ SAM.
In a typical high-throughput analysis workflow, you will encounter all three file
types:
FASTA to store the reference genome/transcriptome that the sequence fragments will be
mapped to.
FASTQ to store the sequence fragments before mapping.
SAM/BAM to store the sequence fragments after mapping.
FASTQ is used for long-read sequencing as well, which could have a single record being
thousands of 80-character lines long. Sometimes these are split by line breaks, sometimes
not. – Scott Gigante
Aug 17 '17 at 6:01
Sorry, should have clarified: I was just referring to the line "FASTQ files are (almost?)
exclusively used to store short-read data from high-throughput sequencing experiments."
Definitely not exclusively. – Scott Gigante
Aug 17 '17 at 13:22
@charlesdarwin I have no idea. The line with the plus sign is completely redundant. The
original developers of the FASTQ format probably intended it as a redundancy to simplify
error checking (= to see if the record was complete) but it fails at that. In hindsight it
shouldn't have been included. Unfortunately we're stuck with it for now. – Konrad
Rudolph
Feb 21 at 17:06
FASTA file format is a DNA sequence format for specifying or representing DNA
sequences and was first described by Pearson (Pearson,W.R. and Lipman,D.J. (1988)
Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85,
2444–2448)
FASTQ is another DNA sequence file format that extends the FASTA format with
the ability to store the sequence quality. The quality scores are often represented in ASCII
characters which correspond to a phred score)
Both FASTA and FASTQ are common sequence representation formats and have emerged as key
data interchange formats for molecular biology and bioinformatics.
SAM is format for representing sequence alignment information from a read
aligner. It represents sequence information in respect to a given reference sequence. The
information is stored in a series of tab delimited ascii columns. The full SAM format
specification is available at http://samtools.sourceforge.net/SAM1.pdf
SAM can also (and is increasingly used for it, see PacBio) store unaligned sequence
information, and in this regard equivalent to FASTQ. – Konrad Rudolph
Jun 2 '17 at 10:43
Incidentally, the first part of your question is something you could have looked up yourself
as the first hits on Google of "NAME format" point you to primers on Wikipedia, no less. In
future, please do that before asking a question.
FASTA (officially) just stores the name of a sequence and the sequence, inofficially
people also add comment fields after the name of the sequence. FASTQ was invented to store
both sequence and associated quality values (e.g. from sequencing instruments). SAM was
invented to store alignments of (small) sequences (e.g. generated from sequencing) with
associated quality values and some further data onto a larger sequences, called reference
sequences, the latter being anything from a tiny virus sequence to ultra-large plant
sequences.
FASTA and FATSQ formats are both file formats that contain sequencing reads while SAM files
are these reads aligned to a reference sequence. In other words, FASTA and FASTQ are the "raw
data" of sequencing while SAM is the product of aligning the sequencing reads to a refseq.
A FASTA file contains a read name followed by the sequence. An example of one of these
reads for RNASeq might be:
>Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA
The FASTQ version of this read will have two more lines, one + as a space holder and then
a line of quality scores for the base calls. The qualities are given as characters with '!'
being the lowest and '~' being the highest, in increasing ASCII value. It would look
something like this
@Flow cell number: lane number: chip coordinates etc.
ATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTAATTGGCTA
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
A SAM file has many fields for each alignment, the header begins with the @ character. The
alignment contains 11 mandatory fields and various optional ones. You can find the spec file
here: https://samtools.github.io/hts-specs/SAMv1.pdf
.
Often you'll see BAM files which are just compressed binary versions of SAM files. You can
view these alignment files using various tools, such as SAMtools, IGV or USCS Genome
browser.
As to the benefits, FASTA/FASTQ vs. SAM/BAM is comparing apples and oranges. I do a lot of
RNASeq work so generally we take the FASTQ files and align them the a refseq using an aligner
such as STAR which outputs SAM/BAM files. There's a l ot you can do with just these alignment
files, looking at expression, but usually I'll use a tool such as RSEM to "count" the reads
from various genes to create an expression matrix, samples as columns and genes as rows.
Whether you get FASTQ or FASTA files just depends on your sequencing platform. I've never
heard of anybody really using the quality scores.
Careful, the FASTQ format description is wrong: a FASTQ record can span more than four lines;
also, + isn't a placeholder, it's a separator between the sequence and the
quality score, with an optional repetition of the record ID following it. Finally, the
quality score string has to be the same length as the sequence. – Konrad Rudolph
Jun 2 '17 at 10:47
FAST (FAST Analysis of Sequences Toolbox), built on BioPerl, provides open source command-line tools to filter, transform,
annotate and analyze biological sequence data. Modeled after the GNU (GNU's Not Unix) Textutils such as grep, cut, and
tr, FAST tools such as fasgrep, fascut, and fastr make it easy to rapidly prototype expressive bioinformatic workflows
in a compact and generic command vocabulary. Compact combinatorial encoding of data workflows with FAST commands can facilitate
better documentation and reproducibility of bioinformatic protocols, supporting better transparency in big biological data
science. Interface self-consistency and conformity with conventions of GNU, Matlab, Perl, BioPerl, R and GenBank, help
make FAST easy to learn. FAST automates numerical, text-based, sequence-based and taxonomic searching, sorting, selection
and transformation of sequence records and alignment sites based on indices, ranges, tags and feature annotations, and
analytics for composition and codon usage. Automated content- and feature-based extraction of sites and support for molecular
population genetic statistics makes FAST useful for molecular evolutionary analysis. FAST is portable, easy to install,
and secure, with stable releases posted to CPAN and development on Github.
The default data exchange format in FAST is Multi-FastA (specifically, a restriction of BioPerl FastA format). Sanger
and Illumina 1.8+ FASTQ formatted files are also supported. The command-line basis of FAST makes it easier for non-programmer
biologists to interactively investigate and control biological data at the speed of thought.
yes I might have confounded two different issues here - among the reasons that I don't use javascript is that I don't
know it would hold up in realistic data scenarios - but that is not relevant to charms since those are not supposed to
be used as a large scale data analysis platform.
long terms I see charms as a platform to demonstrate, teach and show concepts through the web. There will be "official"
charms available by default but at the same time any user will be able to create their own library and others will have
the option of loading these for their own use with one click.
Within their own library users may implement everything in javascript (that is the simplest approach) or they can choose
to connect to a RPC or HTTP service (on a different machine) via a simple API. When choosing these latter the Biostar server
will act as a proxy (since javacript cannot connect to different domains) and submits the data to the user's server, once
the result is back displays it to the user. The returned data may be a rich HTML document, it gets injected into the user's
page.
The entire concept is not fully worked out yet - what I would like is to have a methodology to let someone try a tool,
evaluate and demonstrate the use and operation of a software or library without asking the user to install it.
ADD REPLY • link modified 3.1 years ago • written
3.1 years ago by Istvan Albert ♦♦ 76k 1
Hmm.. When will a biostar user want to call charms? What "concepts" are you thinking to "demonstrate, teach and show"
through the web? If it is just for demonstration, it would be easier to show that on my own server (e.g. via ipython) as
I need it anyway to talk to the biostar server? And if the purpose is for teaching, we are not really talking about huge
data sets. Javascript would be a better fit than adding an extra layer interfacing to other languages over the internet?
I guess I need to see concrete use cases to understand what charms is for. Anyway, good that you are thinking about this.
ADD REPLY • link modified 3.1 years ago • written
3.1 years ago by lh3 ♦ 30k
The rationale comes from teaching more than one bioinformatics course a year and it always gets me just unexpectedly
difficult is to perform some of the simplest task unless we first spend several sessions making sure everyone's system
is set up just right, installed the right tools libraries etc. half of the course is about unix, and frankly some people
don't get unix at all - and I think that should be fine as well. One should be able to analyze data withouth understanding
unix.
Example: Say I knew the name of two genes and I want to find out which one is longer. I don't know how to do this in
a simple way. I can only do it by making use of a few decades of unix and other programming where chain together a series
of pipe delimited commands that in turn will rely on a series of tools and libraries that I have preinstalled (some after
quite the battle) on my system. or perhaps one can get there clicking about on webpages.
But what if there were a charm where getting sequence XYZ works in the browser by writing fetch(XYZ) and
I could get the length of it by writing length(fetch(XYZ)) .
And then if I want to align the two I could just say align(s1, s2) and performs the alignment right there.
It is true that writing it in straight javascript would be the best solution - it is more a matter of time and resources
to put towards that - hence I want to add the option of reuse existing libraries in a simple way.
ADD REPLY • link modified 3.1 years ago • written
3.1 years ago by Istvan Albert ♦♦ 76k 1
I see. That makes sense. What complex functionalities you need for teaching but are not easy to reimplement in javascript?
ADD REPLY • link written 3.1 years ago by
lh3 ♦ 30k 1
In the end it not just about the implementation but the usability of it. For example
http://keithwhor.github.io/NtSeq/ provides an aligner but
it in the current form it takes a lot of work to understand the results.
The library produces a series of objects that have all kinds of attributes and methods. What it does not seem to provide
a simple way to produce a biologically meaningful interpretable results. Like say a blast output, or a tabular blast output
or a SAM output.
I am not sure NtSeq does standard alignment. It seems to be using pattern matching. I have just written a real DP-based
algorithm: https://github.com/lh3/bioseq-js The APIs are
C-like (actually most of my javascript programs look like C because I don't know much about javascript). You don't need
deep knowledge in javascript to understand what it is doing.
ADD REPLY • link modified 3.1 years ago • written
3.1 years ago by lh3 ♦ 30k
whoa, this is so cool, it is exactly how I think about it. In fact this will be the way I will teach alignments from
now on.
I have added your library to the charms - with an example usage (you may need to force refresh the page to get the newest
javascript).
As an example I have created both a simplified aligner and formatter on top of the library that may be useful for some
people but not others. The goal is to make it easy for others to customize and share the way that they think about making
use of these libraries.
I will be working on a feature where a user can upload a javascript file into the charms and other users will be able
to simply click it and specify that, when they visit the charms they want that users's file to be activated as well. Now
this new library may override the original formatting or it may choose to add a new formatting option with a different
name.
ADD REPLY • link modified 3.1 years ago • written
3.1 years ago by Istvan Albert ♦♦ 76k
Maybe I misunderstand Charms, but I thought they are meant to demonstrate installable packages, not replace them. I
see the short explanations "Charms are small programs written in a javascript based language" and "It can also perform
calls to external programs" and took this to mean that Charms are Javascript wrappers of bioinformatics packages written
in a variety of languages.
BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA
sequence data. BBTools can handle common sequencing file formats such as fastq, fasta, sam, scarf,
fasta+qual, compressed or raw, with autodetection of quality encoding and interleaving. It is written
in Java and works on any platform supporting Java, including Linux, MacOS, and Microsoft Windows and
Linux; there are no dependencies other than Java (version 7 or higher). Program descriptions and
options are shown when running the shell scripts with no parameters.
SeqKit is a cross-platform, ultrafast, and practical FASTA/Q manipulations tool that is friendly
for researchers to complete wide ranges of FASTA/Q file processing. The toolkit supports plain or gzip-compressed
input and output from either standard stream or files, therefore, it could be easily used in
command-line pipe.
Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It
seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
This package provides a number of small and efficient programs to perform common tasks with high
throughput sequencing data in the FASTQ format. All of the programs work with typical FASTQ files as
well as gzipped FASTQ files.
Bioawk is an extension to
Brian
Kernighan's awk
, adding the support of several common biological data formats, including optionally
gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names. It also adds a few
built-in functions and an command line option to use TAB as the input/output delimiter. When the new
functionality is not used, bioawk is intended to behave exactly the same as the original BWK awk.
Seqmagick is a kickass little utility built in the spirit of
imagemagick
to
expose the file format conversion in Biopython in a convenient way. Instead of having a big mess of
scripts, there is one that takes arguments.
The Biopieces are a collection of bioinformatics tools that can be pieced together in a very easy
and flexible manner to perform both simple and complex tasks. The Biopieces work on a data stream in
such a way that the data stream can be passed through several different Biopieces, each performing one
specific task.
The Bio::ToolBox libraries provide an abstraction layer over a variety of different specialized
BioPerl-style modules. For example, there is a special emphasis on the collection data values for
defined genomic coordinate regions, regardless of whether the values come from a GFF
database, Bam file, BigWig file, etc.
Command-line tools for processing biological sequencing data. Barcode demultiplexing, adapter
trimming, etc. Primarily written to support an Illumina based pipeline - but should work with any
FASTQs.
The FAST Analysis of Sequences Toolbox (FAST) is a set of Unix tools (for example fasgrep, fascut,
fashead and fastr) for sequence bioinformatics modeled after the Unix textutils (such as grep, cut,
head, tr, etc). FAST workflows are designed for "inline" (serial) processing of flatfile biological
sequence record databases per-sequence, rather than per-line, through Unix command pipelines. The
default data exchange format is multifasta (specifically, a restriction of BioPerl FastA format). FAST
tools expose the power of Perl and BioPerl for sequence analysis to non-programmers in an easy-to-learn
command-line paradigm.
Question:
How To Parse Fasta Files In Perl
7
7.9 years ago by
nikulina
•
280
Cambridge
nikulina
•
280
wrote:
Dear colleagues! I have a file with lots of sequences in FASTA format. I want to write a perl script to
analyze each sequence (to count the length of certain fragment). So, how can I manage to treat each
sequence as a variable? Should I use an array to read my file?
So, here is my script. It might be not very nice, but it works. I would like to modify it in order to
work with FASTA data.
$string_filename = 'file.txt';
open(FILE, $string_filename);
@array = FILE;
close FILE;
foreach $string(@array){
$R = length $string;
if ( $string =~ /ggc/ ) {
$M = $';
$W = length $M;
if ( $string =~ /atg/ ) {
$K = $`;
$Z = length $K;
$x = $W + $Z - $R;
print " \n\ the distance is the following: \n\n ";
print $x;
} else {
print "\n\ I couldn\'t find the start codone.\n\n";
}
} else {
print "\n\ I couldn\'t find the binding site.\n\n"; }
}
exit;
modified 7.9 years ago by
Tarah
•
0
• written
7.9 years
ago
by
nikulina
•
280
Could you also show us an example sequence for which this code works? If the code is supposed to
do what I think it is supposed to do, I think there may be quite a few problems with it.
ADD REPLY
•
link
written
7.9 years ago
by
Neilfws
♦
47k
Are you really sure none of to 10000 topics about how to parse file XXX did match your needs?
ADD REPLY
•
link
written
6.4 years ago
by
Fabian Bull
•
1.3k
21
7.9 years ago by
Neilfws
♦
47k
Sydney, Australia
Neilfws
♦
47k
wrote:
First, there is no need to reinvent the wheel. As Stefano wrote, Bioperl will parse fasta sequences for
you and do a whole lot more besides. Once installed, it is as simple as:
use Bio::SeqIO;
my $seqio = Bio::SeqIO->new(-file => "file.fa", '-format' => 'Fasta');
while(my $seq = $seqio->next_seq) {
my $string = $seq->seq;
# do stuff with $string
}
Second, there are some issues with your code. It should be "@array = <FILE>" - although as Stefano
points out, you should not read the whole file into an array.
So far as I can tell, you are trying to find sub-sequences which begin "atg" and end with "ggc". Some
other issues with your code:
It seems to assume that there is only one each of "atg" and "ggc", because you use if() to match the
regular expressions, not while().
It returns negative values for length of the sub-sequence. Is this what you want? It is unclear
whether you are looking for "atg" which lie upstream of "ggc" or whether they can be at any position in
the sequence.
It looks as though you are looking for start codons. There may be alternatives to atg: gtg or ttg.
Your regular expressions are case-sensitive and would miss, for example, ATG.
Assuming that you are trying to find the region atg -> ggc, you could try something like:
while(my $string =~/atg(.*)ggc/gi) {
# do something with match
# e.g. match start = $-[0]+1, match end = $+[0]
}
That example uses the special Perl variables @- and @+ to get match positions, but Bioperl will also
provide you with plenty of methods for analysing sub-sequences.
ADD COMMENT
•
link
written
7.9 years ago
by
Neilfws
♦
47k
Thank you for your attention to my question. In fact I would like to find the distance between
binding sites for RNAP II and start of transcription in certain human genes. So, I used some sample
motifs ('ggc' and 'atg') in my perl script, just to make the task easier and to test how it works in
this simple variant. The binding site is situated before the start codone, that's why i didn't take
into consideration those variants, where the first motif is situated after the second. Once more,
thank you for your help.
If you are planning to read and manipulate a lot of files with fasta sequences, do it properly. Use
Bioperl. It make life easier (see
an example
here
). It takes some time to set it up and learn the "philosophy" behind, but then you can do much more: read
from NCBI/EMBL, read/write to different formats... all with the same interface. Already debugged for you.
Also, if you use big files, don't do this:
open(FILE, $string_filename); @array = FILE;
It will load the whole file in memory. Nowadays fasta files might be Huuuuuge.
Thank you! indeed i recognise that my variant is not very convinient and consumes lots of memory.
I'll try to examine BIOperl and use it for futher tasks.
Can I please add a question to this? What if you want to remove that string/sequence that you are
looking for? I have a control phage in my illumina data that I want to remove, but am having a hard time
finding out how to do this. Thanks so much!
Frederick Sanger OM CH CBE FAA (/ˈsæŋər/; 13 August 1918 – 19 November 2013) was a British biochemist who twice won the Nobel
Prize in Chemistry, one of only two people to have done so in the same category (the other is John Bardeen in physics),[4] the
fourth person overall with two Nobel Prizes, and the third person overall with two Nobel Prizes in the sciences. In 1958, he
was awarded a Nobel Prize in Chemistry "for his work on the structure of proteins, especially that of insulin". In 1980,
Walter Gilbert and Sanger shared half of the chemistry prize "for their contributions concerning the determination of base
sequences in nucleic acids". The other half was awarded to Paul Berg "for his fundamental studies of the biochemistry of
nucleic acids, with particular regard to recombinant DNA".[5]
Early life and education
Frederick Sanger was born on 13 August 1918 in Rendcomb, a small village in Gloucestershire, England, the second son of
Frederick Sanger, a general practitioner, and his wife, Cicely Sanger (née Crewdson).[6] He was one of three children. His
brother, Theodore, was only a year older, while his sister May (Mary) was five years younger.[7] His father had worked as an
Anglican medical missionary in China but returned to England because of ill health. He was 40 in 1916 when he married Cicely
who was four years younger. Sanger's father converted to Quakerism soon after his two sons were born and brought up the
children as Quakers. Sanger's mother was the daughter of a wealthy cotton manufacturer and had a Quaker background, but was
not a Quaker.[7]
When Sanger was around five years old the family moved to the small village of Tanworth-in-Arden in Warwickshire. The family
was reasonably wealthy and employed a governess to teach the children. In 1927, at the age of nine, he was sent to the Downs
School, a residential preparatory school run by Quakers near Malvern. His brother Theo was a year ahead of him at the same
school. In 1932, at the age of 14, he was sent to the recently established Bryanston School in Dorset. This used the Dalton
system and had a more liberal regime which Sanger much preferred. At the school he liked his teachers and particularly enjoyed
scientific subjects.[7] Able to complete his School Certificate a year early, for which he was awarded seven credits, Sanger
was able to spend most of his last year of school experimenting in the laboratory alongside his chemistry master, Geoffrey
Ordish, who had originally studied at Cambridge University and been a researcher in the Cavendish Laboratory. Working with
Ordish made a refreshing change from sitting and studying books and awakened Sanger's desire to pursue a scientific career.[8]
In 1936 Sanger went to St John's College, Cambridge to study natural sciences. His father had attended the same college.
For Part I of his Tripos he took courses in physics, chemistry, biochemistry and mathematics but struggled with physics and
mathematics. Many of the other students had studied more mathematics at school. In his second year he replaced physics with
physiology. He took three years to obtain his Part I. For his Part II he studied biochemistry and obtained a 1st Class Honours.
It was a relatively new department founded by Gowland Hopkins with enthusiastic lecturers who included Malcolm Dixon, Joseph
Needham and Ernest Baldwin.[7]
Both his parents died from cancer during his first two years at Cambridge. His father was 60 and his mother was 58. As an
undergraduate Sanger's beliefs were strongly influenced by his Quaker upbringing. He was a pacifist and a member of the Peace
Pledge Union. It was through his involvement with the Cambridge Scientists' Anti-War Group that he met his future wife, Joan
Howe, who was studying economics at Newnham College. They courted while he was studying for his Part II exams and married
after he had graduated in December 1940. Under the Military Training Act 1939 he was provisionally registered as a
conscientious objector, and again under the National Service (Armed Forces) Act 1939, before being granted unconditional
exemption from military service by a tribunal. In the meantime he undertook training in social relief work at the Quaker
centre, Spicelands, Devon and served briefly as a hospital orderly.[7]
Sanger began studying for a PhD in October 1940 under N.W. "Bill" Pirie. His project was to investigate whether edible protein
could be obtained from grass. After little more than a month Pirie left the department and Albert Neuberger became his
adviser.[7] Sanger changed his research project to study the metabolism of lysine[9] and a more practical problem concerning
the nitrogen of potatoes.[10] His thesis had the title, "The metabolism of the amino acid lysine in the animal body". He was
examined by Charles Harington and Albert Charles Chibnall and awarded his doctorate in 1943.[7]
Sequencing insulin
Neuberger moved to the National Institute for Medical Research in London, but Sanger stayed in Cambridge and in 1943 joined
the group of Charles Chibnall, a protein chemist who had recently taken up the chair in the Department of Biochemistry.
Chibnall had already done some work on the amino acid composition of bovine insulin[11] and suggested that Sanger look at the
amino groups in the protein. Insulin could be purchased from the pharmacy chain Boots and was one of the very few proteins
that were available in a pure form. Up to this time Sanger had been funding himself. In Chibnall's group he was initially
supported by the Medical Research Council and then from 1944 until 1951 by a Beit Memorial Fellowship for Medical Research.[6]
Sanger's first triumph was to determine the complete amino acid sequence of the two polypeptide chains of bovine insulin, A
and B, in 1952 and 1951, respectively.[12][13] Prior to this it was widely assumed that proteins were somewhat amorphous. In
determining these sequences, Sanger proved that proteins have a defined chemical composition.[7]
To get to this point, Sanger refined a partition chromatography method first developed by Richard Laurence Millington Synge
and Archer John Porter Martin to determine the composition of amino acids in wool. Sanger used a chemical reagent
1-fluoro-2,4-dinitrobenzene (now, also known as Sanger's reagent, fluorodinitrobenzene, FDNB or DNFB), sourced from poisonous
gas research by Bernhard Charles Saunders at the Chemistry Department at Cambridge University.
Sanger's reagent proved effective at labelling the N-terminal amino group at one end of the polypeptide chain.[14] He then
partially hydrolysed the insulin into short peptides, either with hydrochloric acid or using an enzyme such as trypsin. The
mixture of peptides was fractionated in two dimensions on a sheet of filter paper, first by electrophoresis in one dimension
and then, perpendicular to that, by chromatography in the other. The different peptide fragments of insulin, detected with
ninhydrin, moved to different positions on the paper, creating a distinct pattern that Sanger called "fingerprints". The
peptide from the N-terminus could be recognised by the yellow colour imparted by the FDNB label and the identity of the
labelled amino acid at the end of the peptide determined by complete acid hydrolysis and discovering which dinitrophenyl-amino
acid was there.[7]
By repeating this type of procedure Sanger was able to determine the sequences of the many peptides generated using
different methods for the initial partial hydrolysis. These could then be assembled into the longer sequences to deduce the
complete structure of insulin. Finally, because the A and B chains are physiologically inactive without the three linking
disulfide bonds (two interchain, one intrachain on A), Sanger and coworkers determined their assignments in 1955.[15][16]
Sanger's principal conclusion was that the two polypeptide chains of the protein insulin had precise amino acid sequences
and, by extension, that every protein had a unique sequence.
It was this achievement that earned him his first Nobel prize in Chemistry in 1958.[17] This discovery was crucial for the
later sequence hypothesis of Crick for developing ideas of how DNA codes for proteins.[18]
Sequencing RNA
From 1951 Sanger was a member of the external staff of the Medical Research Council[6] and when they opened the Laboratory
of Molecular Biology in 1962, he moved from his laboratories in the Biochemistry Department of the university to the top floor
of the new building. He became head of the Protein Chemistry division.[7]
Prior to his move, Sanger began exploring the possibility of sequencing RNA molecules and began developing methods for
separating ribonucleotide fragments generated with specific nucleases. This work he did while trying to refine the sequencing
techniques he had developed during his work on insulin.[18]
The key challenge in the work was finding a pure piece of RNA to sequence. In the course of the work he discovered in 1964,
with Kjeld Marcker, the formylmethionine tRNA which initiates protein synthesis in bacteria.[19]
He was beaten in the race to
be the first to sequence a tRNA molecule by a group led by Robert Holley from Cornell University, who published the sequence
of the 77 ribonucleotides of alanine tRNA from Saccharomyces cerevisiae in 1965.[20] By 1967 Sanger's group had determined the
nucleotide sequence of the 5S ribosomal RNA from Escherichia coli, a small RNA of 120 nucleotides.[21]
Sequencing DNA
He then turned to sequencing DNA, which would require an entirely different approach. He looked at different ways of using
DNA polymerase I from E. coli to copy single stranded DNA.[22] In 1975, together with Alan Coulson, he published a sequencing
procedure using DNA polymerase with radiolabelled nucleotides that he called the "Plus and Minus" technique.[23][24] This
involved two closely related methods that generated short oligonucleotides with defined 3' termini. These could be
fractionated by electrophoresis on a polyacrylamide gel and visualised using autoradiography. The procedure could sequence up
to 80 nucleotides in one go and was a big improvement on what had gone before, but was still very laborious. Nevertheless, his
group were able to sequence most of the 5,386 nucleotides of the single-stranded bacteriophage φX174.[25] This was the first
fully sequenced DNA-based genome. To their surprise they discovered that the coding regions of some of the genes overlapped
with one another.[3]
In 1977 Sanger and colleagues introduced the "dideoxy" chain-termination method for sequencing DNA molecules, also known as
the "Sanger method".[24][26] This was a major breakthrough and allowed long stretches of DNA to be rapidly and accurately
sequenced.
It earned him his second Nobel prize in Chemistry in 1980, which he shared with Walter Gilbert and Paul Berg.[5]
The new method was used by Sanger and colleagues to sequence human mitochondrial DNA (16,569 base pairs)[27] and bacteriophage
λ (48,502 base pairs).[28] The dideoxy method was eventually used to sequence the entire human genome.[29]
Postgraduate students
During the course of his career Sanger supervised more than ten PhD students, two of whom went on to also win Nobel Prizes.
His first graduate student was Rodney Porter who joined the research group in 1947.[3] Porter later shared the 1972 Nobel
Prize in Physiology or Medicine with Gerald Edelman for his work on the chemical structure of antibodies.[30] Elizabeth
Blackburn studied for a PhD in Sanger's laboratory between 1971 and 1974.[3][31] She shared the 2009 Nobel Prize in Physiology
or Medicine with Carol W. Greider and Jack W. Szostak for her work on telomeres and the action of telomerase.[32]
Awards and honours
As of 2015, Sanger is the only person to have been awarded the Nobel Prize in Chemistry twice, and one of only four
two-time Nobel laureates: The other three were Marie Curie (Physics, 1903 and Chemistry, 1911), Linus Pauling (Chemistry, 1954
and Peace, 1962) and John Bardeen (twice Physics, 1956 and 1972).[4] Elected Fellow of the Royal Society (FRS) in 1954[3] Commander of the Order of the British Empire – 1963 Order of the Companions of Honour – 1981 Order of Merit – 1986 Corresponding Fellow of the Australian Academy of Science – 1982 William Bate Hardy Prize – 1976 Nobel Prize in Chemistry – 1958, 1980 Corday–Morgan Medal – 1951 Royal Medal – 1969 Gairdner Foundation International Award – 1971 Copley Medal – 1977 G.W. Wheland Award – 1978 Louisa Gross Horwitz Prize of Columbia University – 1979 Albert Lasker Award for Basic Medical Research – 1979 Association of Biomolecular Resource Facilities Award – 1994 Citation for Chemical Breakthrough Award from the Division of History of Chemistry of the American Chemical Society –
2016[33][34][35] The Wellcome Trust Sanger Institute (formerly the Sanger Centre) is named in his honour.
Personal life
Sanger married Margaret Joan Howe in 1940. She died in 2012. They had three children - Robin, born in 1943, Peter born in
1946 and Sally Joan born in 1960.[6] He said that his wife had "contributed more to his work than anyone else by providing a
peaceful and happy home."[36]
Later life
Sanger retired in 1983, aged 65, to his home, "Far Leys", in Swaffham Bulbeck outside Cambridge.[3]
In 1992, the Wellcome Trust and the Medical Research Council founded the Sanger Centre (now the Sanger Institute), named
after him.[37] The Institute is located on the Wellcome Trust Genome Campus near Hinxton, only a few miles from Sanger's home.
He agreed to having the Centre named after him when asked by John Sulston, the founding director, but warned, "It had better
be good."[37] It was opened by Sanger in person on 4 October 1993, with a staff of fewer than 50 people, and went on to take a
leading role in the sequencing of the human genome.[37] The Institute now[when?] has over 900 people and is one of the world's
largest genomic research centres.
Sanger said he found no evidence for a God so he became an agnostic.[38] In an interview published in the Times newspaper
in 2000 Sanger is quoted as saying: "My father was a committed Quaker and I was brought up as a Quaker, and for them truth is
very important. I drifted away from those beliefs – one is obviously looking for truth, but one needs some evidence for it.
Even if I wanted to believe in God I would find it very difficult. I would need to see proof."[39]
He declined the offer of a knighthood, as he did not wish to be addressed as "Sir". He is quoted as saying, "A knighthood
makes you different, doesn't it, and I don't want to be different." In 1986, he accepted the award of an Order of Merit, which
can have only 24 living members.[36][38][39]
In 2007 the British Biochemical Society was given a grant by the Wellcome Trust to catalogue and preserve the 35 laboratory
notebooks in which Sanger recorded his research from 1944 to 1983. In reporting this matter, Science noted that Sanger, "the
most self-effacing person you could hope to meet", was spending his time gardening at his Cambridgeshire home.[40]
Sanger died in his sleep at Addenbrooke's Hospital in Cambridge on 19 November 2013.[36][41] As noted in his obituary, he
had described himself as "just a chap who messed about in a lab",[42] and "academically not brilliant".[43]
Storage and transmission of the data produced by modern DNA sequencing instruments has become a major
concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present
several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared
against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and
other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ
compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz:
https://sourceforge.net/projects/fastqz/, fqzcomp:
https://sourceforge.net/projects/fqzcomp/, and samcomp:
https://sourceforge.net/projects/samcomp/.
Baxevanis Andreas D, Ouellette Francis BF (2004) Bioinformatics: a practical guide to the analysis of genes
and proteins, 2nd edn. doi: 10.1007/s10439-006-9105-9
Kaipa KK, Lee K, Ahn T, Narayanan R (2010) System for random access dna sequence compression. IEEE
international conference on bioinformatics and biomedicine workshops system, pp 853–854.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5703942
16.
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory
I(3):337–338Google
Scholar
17.
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory
24(5):530–536MathSciNetCrossRefMATHGoogle
Scholar
18.
Huffman DA (1952) A method for the construction of minimum-redundancy codes. Proc IRE 27:1098,1101Google
Scholar
19.
Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly
efficient de novo assembly. Nucl Acids Res 40(22):1–9. doi: 10.1093/nar/gks754CrossRefGoogle
Scholar
20.
Pinho AJ, Pratas D, Garcia SP (2012) GReEn: a tool for efficient compression of genome resequencing data.
Nucl Acids Res 40(4):1–8. doi: 10.1093/nar/gkr1124CrossRefGoogle
Scholar
Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome
comparison. Genome informatics. Workshop on genome informatics 10:51–61
http://www.ncbi.nlm.nih.gov/pubmed/11072342
Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of
high-throughput sequencing technologies. Brief Bioinform 15(3):390–406. doi: 10.1093/bib/bbt088CrossRefGoogle
Scholar
Bhola V, Bopardikar AS, Narayanan R, Lee K, Ahn T (2011) No-reference compression of genomic data stored in
FASTQ format. Proceedings-2011 IEEE international conference on bioinformatics and biomedicine, BIBM 2011,
pp 147–150. doi: 10.1109/BIBM.2011.110
Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS (2012) DELIMINATE-a fast and efficient method for
loss-less compression of genomic sequences: sequence analysis. Bioinformatics (Oxford, England)
28(19):2527–9. doi: 10.1093/bioinformatics/bts467.
http://www.ncbi.nlm.nih.gov/pubmed/22833526
31.
Grassi E, Gregorio FD, Molineris I (2012) KungFQ: a simple and powerful approach to compress fastq files.
IEEE/ACM Trans Comput Biol Bioinform 9(6):1837–1842. doi: 10.1109/TCBB.2012.123CrossRefGoogle
Scholar
Dutta A, Haque MM, Bose T, Reddy C, Mande SS (2015) Fqc: a novel approach for efficient compression,
archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13(03):1541,003CrossRefGoogle
Scholar
WinRAR archiver, a powerful tool to process RAR and ZIP files.
http://www.rarlab.com/ [Online]. Accessed on 03 Dec 2014
47.
Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2015) A FASTQ compressor based on integer-mapped k-mer
indexing for biologist. Gene 579(1):75–81. doi: 10.1016/j.gene.2015.12.053CrossRefGoogle
Scholar
48.
Zhan X, Yao D (2014) A novel method to compress high-throughput dna sequence read archive. In: Software
intelligence technologies and applications international conference on frontiers of internet of things 2014,
international conference on, pp 58–61. doi: 10.1049/cp.2014.1536
Guo G, Qiu S, Ye Z, Wang B, Fang L, Lu M, See S, Mao R (2013) GPU-accelerated adaptive compression framework
for genomics data. 2013 IEEE international conference on big data GPU-accelerated, pp 181–186Google
Scholar
Awan F, Mukherjee A (2002) Lossless compression handbook, chap. text compression, pp 227–245.
Communications, networking and multimedia. Elsevier Science, LondonGoogle
Scholar
55.
Carus A, Mesut A (2010) Fast text compression using multiple static dictionaries. Inf Technol J 9:1013–1021CrossRefGoogle
Scholar
56.
Crochemore M, Lecroq T (2012) The computer science and engineering handbook, chap. pattern matching and text
compression algorithms, pp 3–77. CRC Press, Boca RatonGoogle
Scholar
57.
Burrows M, Wheeler DA (1994) A block-sorting lossless data compression algorithm. Tech. rep. digital
equipment corporation, CaliforniaGoogle
Scholar
Tembe W, Lowey J, Suh E, Genomics T, Street N (2010) G-SQZ: compact encoding of genomic sequence and quality
data. Bioinformatics 26(17):2192–2194. doi: 10.1093/bioinformatics/btq346CrossRefGoogle
Scholar
Salomon D, Bryant D, Motta G (2010) Handbook of data compression. Springer, LondonCrossRefMATHGoogle
Scholar
64.
Batu T, Ergun F, Sahinalp C (2006) Oblivious string embeddings and edit distance approximations. In:
Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm, pp 792–801. Society for
Industrial and Applied MathematicsGoogle
Scholar
The Last but not LeastTechnology is dominated by
two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt.
Ph.D
FAIR USE NOTICEThis site contains
copyrighted material the use of which has not always been specifically
authorized by the copyright owner. We are making such material available
to advance understanding of computer science, IT technology, economic, scientific, and social
issues. We believe this constitutes a 'fair use' of any such
copyrighted material as provided by section 107 of the US Copyright Law according to which
such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free)
site written by people for whom English is not a native language. Grammar and spelling errors should
be expected. The site contain some broken links as it develops like a living tree...
You can use PayPal to to buy a cup of coffee for authors
of this site
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or
referenced source) and are
not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society.We do not warrant the correctness
of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be
tracked by Google please disable Javascript for this site. This site is perfectly usable without
Javascript.