Compression of FASTA/FASTQ files

Deorowicz S (2013) Grabowski S (2013) Data compression for sequencing data. Algorithms for molecular biology: AMB 8(1):25. doi: 10.1186/1748-7188-8-25. http://www.almob.org/content/8/1/25

2.

Loh PR, Baym M, Berger B (2012) Compressive genomics. Nat Biotechnol 30(7):627–30. doi: 10.1038/nbt.2241. http://www.ncbi.nlm.nih.gov/pubmed/22781691

3.

RAID Incorporated (2015) Storing and managing petabytes of genome sequencing data. Tech. rep. http://webinfo.raidinc.com/storing-and-managing-petabytes-of-genome-sequencing-data [Online]. Accessed on 23 March 2015

4.

Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics (Oxford, England) 25(14):1731–1738. doi: 10.1093/bioinformatics/btp319. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2705231&tool=pmcentrez&rendertype=abstract

5.

Baxevanis Andreas D, Ouellette Francis BF (2004) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. doi: 10.1007/s10439-006-9105-9

6.

Format specification (FASTQ) (2014) http://maq.sourceforge.net/fastq.shtml [Online]. Accessed on 23 Sept 2014

7.

Howison M, Zapata F, Dunn CW (2013) Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics 29(23):2959–2963. doi: 10.1093/bioinformatics/btt525 CrossRef Google Scholar

8.

Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26(10):1135–1145. doi: 10.1038/nbt1486 CrossRef Google Scholar

9.

Bakr NS, Sharawi AA (2013) DNA lossless compression algorithms: review. Am J Bioinform Res 3(3):72–81. doi: 10.5923/j.bioinformatics.20130303.04 Google Scholar

10.

1000 Genomes (2014) A deep catalog of human genetic variation. http://www.1000genomes.org [Online]. Accessed on 03 Oct 2014

11.

Encyclopedia of DNA Elements (ENCODE) (2014) http://www.encodeproject.org/ [Online]. Accessed on 03 Oct 2014

12.

Genomics England (2014). http://www.genomicsengland.co.uk [Online]. Accessed on 03 Oct 2014

13.

ICGC Cancer Genome Projects (2014) https://icgc.org/ [Online]. Accessed on 03 Oct 2014

14.

Wandelt S, Bux M, Leser U (2013) Trends in genome compression. Curr Bioinform 1–24 .https://edit.rok.informatik.hu-berlin.de/wbi/research/publications/2013/2013-cbio.pdf

15.

Kaipa KK, Lee K, Ahn T, Narayanan R (2010) System for random access dna sequence compression. IEEE international conference on bioinformatics and biomedicine workshops system, pp 853–854. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5703942

16.

Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory I(3):337–338Google Scholar

17.

Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536MathSciNet CrossRef MATH Google Scholar

18.

Huffman DA (1952) A method for the construction of minimum-redundancy codes. Proc IRE 27:1098,1101Google Scholar

19.

Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucl Acids Res 40(22):1–9. doi: 10.1093/nar/gks754 CrossRef Google Scholar

20.

Pinho AJ, Pratas D, Garcia SP (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucl Acids Res 40(4):1–8. doi: 10.1093/nar/gkr1124 CrossRef Google Scholar

21.

Christley S, Lu Y, Li C, Xie X (2009) Human genomes as email attachments. Bioinformatics 25(2):274–275. doi: 10.1093/bioinformatics/btn582 CrossRef Google Scholar

22.

Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE engineering in medicine and biology magazine 20:61–66. doi: 10.1109/51.940049. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=940049

23.

Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Genome informatics. Workshop on genome informatics 10:51–61 http://www.ncbi.nlm.nih.gov/pubmed/11072342

24.

Stephane G, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf Process Manage 30:875–886. https://hal.archives-ouvertes.fr/file/index/docid/180949/filename/grumbach.pdf

25.

Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief Bioinform 15(3):390–406. doi: 10.1093/bib/bbt088 CrossRef Google Scholar

26.

Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. Genome Inf 11:43–52. doi: 10.11234/gi1990.11.43 Google Scholar

27.

Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6):860–862. doi: 10.1093/bioinformatics/btr014. http://www.ncbi.nlm.nih.gov/pubmed/21252073

28.

Bhola V, Bopardikar AS, Narayanan R, Lee K, Ahn T (2011) No-reference compression of genomic data stored in FASTQ format. Proceedings-2011 IEEE international conference on bioinformatics and biomedicine, BIBM 2011, pp 147–150. doi: 10.1109/BIBM.2011.110

29.

Yanovsky V (2011) ReCoil-an algorithm for compression of extremely large datasets of dna data. Algorithms Mol Biol 6(1):23. doi: 10.1186/1748-7188-6-23. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3219593&tool=pmcentrez&rendertype=abstractwww.almob.org/content/6/1/23

30.

Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS (2012) DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences: sequence analysis. Bioinformatics (Oxford, England) 28(19):2527–9. doi: 10.1093/bioinformatics/bts467. http://www.ncbi.nlm.nih.gov/pubmed/22833526

31.

Grassi E, Gregorio FD, Molineris I (2012) KungFQ: a simple and powerful approach to compress fastq files. IEEE/ACM Trans Comput Biol Bioinform 9(6):1837–1842. doi: 10.1109/TCBB.2012.123 CrossRef Google Scholar

32.

Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PLoS ONE 8(3):1–11. doi: 10.1371/journal.pone.0059190 CrossRef Google Scholar

33.

Roguski L, Deorowicz S (2014) DSRC 2-industry oriented compression of FASTQ files. Bioinformatics 30(15):2213–2215. doi: 10.1093/bioinformatics/btu208. http://bioinformatics.oxfordjournals.org/content/30/15/2213

34.

Sardaraz M, Tahir M, Ikram AA. Advances in high throughput dna sequence data compression. J Bioinform Comput Biol 0(0):1630,002(0). doi: 10.1142/S0219720016300021. http://www.worldscientific.com/doi/abs/10.1142/S0219720016300021. (PMID: 26846812)

35.

Zhu Z, Zhang Y, Ji Z, He S, Yang X (2013) High-throughput DNA sequence data compression. Brief Bioinform 16(1). doi: 10.1093/bib/bbt087. http://www.ncbi.nlm.nih.gov/pubmed/24300111

36.

Adler, M.: PIGZ Documentation. http://zlib.net/pigz/pigz.pdf. [Online; accessed: 2014-12-03]

37.

Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the burrows-wheeler transform. Bioinformatics (Oxford, England) 28(11):1415–1419. doi: 10.1093/bioinformatics/bts173. http://www.ncbi.nlm.nih.gov/pubmed/22556365

38.

Dutta A, Haque MM, Bose T, Reddy C, Mande SS (2015) Fqc: a novel approach for efficient compression, archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13(03):1541,003CrossRef Google Scholar

39.

Hach F, Numanagic I, Alkan C, Sahinalp SC, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23):3051–7. doi: 10.1093/bioinformatics/bts593. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3509486&tool=pmcentrez&rendertype=abstract

40.

Howison M (2013) High-throughput compression of FASTQ data with SeqDB. IEEE/ACM Trans Comput Biol Bioinform 10(1):213–218. doi: 10.1109/TCBB.2012.160 MathSciNet CrossRef Google Scholar

41.

Janin L, Schulz-Trieglaff O, Cox AJ (2014) BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics (Oxford, England) 1–6. doi: 10.1093/bioinformatics/btu387. http://www.ncbi.nlm.nih.gov/pubmed/24950811

42.

Linux man page (2014) Pbzip2: parallel bzip2 file compressor. http://linux.die.net/man/1/pbzip2 [Online]. Accessed on 03 Dec 2014

43.

Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20):3276–3281. doi: 10.1093/bioinformatics/btv384 CrossRef Google Scholar

44.

Oberhumer M (2015) Lzo real-time data compression library. http://www.oberhumer.com/opensource/lzo/ [Online]. Accessed on 03 March 2016

45.

Pavlov I (2016) 7-zip. http://www.7-zip.org [Online]. Accessed on 03 March 2016

46.

WinRAR archiver, a powerful tool to process RAR and ZIP files. http://www.rarlab.com/ [Online]. Accessed on 03 Dec 2014

47.

Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2015) A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579(1):75–81. doi: 10.1016/j.gene.2015.12.053 CrossRef Google Scholar

48.

Zhan X, Yao D (2014) A novel method to compress high-throughput dna sequence read archive. In: Software intelligence technologies and applications international conference on frontiers of internet of things 2014, international conference on, pp 58–61. doi: 10.1049/cp.2014.1536

49.

Daily K, Rigor P, Christley S, Xie X, Baldi P (2010) Data structures and compression algorithms for high-throughput sequencing technologies. BMC bioinformatics 11(1):514. doi: 10.1186/1471-2105-11-514. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2964686&tool=pmcentrez&rendertype=abstract

50.

Guo G, Qiu S, Ye Z, Wang B, Fang L, Lu M, See S, Mao R (2013) GPU-accelerated adaptive compression framework for genomics data. 2013 IEEE international conference on big data GPU-accelerated, pp 181–186Google Scholar

51.

Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G (2011) Compressing genomic sequence fragments using SlimGene. J Comput Biol 18(3):401–13. doi: 10.1089/cmb.2010.0253. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3123913&tool=pmcentrez&rendertype=abstract

52.

Sakib MN, Tang J, Zheng WJ, Huang CT (2011) Improving transmission efficiency of large sequence alignment/map (SAM) files. PLoS ONE 6(12):2–5. doi: 10.1371/journal.pone.0028251 CrossRef Google Scholar

53.

Vey G (2009) Differential direct coding: a compression algorithm for nucleotide sequence data. Database: the journal of biological databases and curation 2009:bap013. doi: 10.1093/database/bap013. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2797453&tool=pmcentrez&rendertype=abstract

54.

Awan F, Mukherjee A (2002) Lossless compression handbook, chap. text compression, pp 227–245. Communications, networking and multimedia. Elsevier Science, LondonGoogle Scholar

55.

Carus A, Mesut A (2010) Fast text compression using multiple static dictionaries. Inf Technol J 9:1013–1021CrossRef Google Scholar

56.

Crochemore M, Lecroq T (2012) The computer science and engineering handbook, chap. pattern matching and text compression algorithms, pp 3–77. CRC Press, Boca RatonGoogle Scholar

57.

Burrows M, Wheeler DA (1994) A block-sorting lossless data compression algorithm. Tech. rep. digital equipment corporation, CaliforniaGoogle Scholar

58.

7-zip soruceforge editor's review (2016) https://sourceforge.net/projects/sevenzip/editorial/?source=psp [Online]. Accessed on 03 March 2016

59.

Mahooney M (2016) Data compression explained. http://mattmahoney.net/dc/dce.html#Section_523 [Online]. Accessed on 03 March 2016

60.

Selva JJ, Chen X (2013) SRComp: short read sequence compression using burstsort and Elias omega coding. PLoS ONE 8(12):1–7. doi: 10.1371/journal.pone.0081414 CrossRef Google Scholar

61.

Tembe W, Lowey J, Suh E, Genomics T, Street N (2010) G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17):2192–2194. doi: 10.1093/bioinformatics/btq346 CrossRef Google Scholar

62.

Mahoney MV (2005) Adaptive weighing of context models for lossless data compression. Florida Tech., Melbourne CS-2005-16(x):1–6. http://professor.unisinos.br/linds/teoinfo/paq.pdf

63.

Salomon D, Bryant D, Motta G (2010) Handbook of data compression. Springer, LondonCrossRef MATH Google Scholar

64.

Batu T, Ergun F, Sahinalp C (2006) Oblivious string embeddings and edit distance approximations. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm, pp 792–801. Society for Industrial and Applied MathematicsGoogle Scholar

65.

Biocancer Research Journal: Transcriptoma (2014) http://www.biocancer.com/journal/1353/31-transcriptoma [Online]. Accessed on 03 Dec 2014

66.

Pevsner J (2009) Bioinformatics and functional genomics, 2nd edn. Springer, BerlinGoogle Scholar

67.

Deorowicz S, Grabowski S (2011) Robust relative compression of genomes with random access. Bioinformatics 27(21):2979–2986. doi: 10.1093/bioinformatics/btr505 CrossRef Google Scholar

Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

News	High Performance Computing (HPC)	Recommended Links	The FASTA Format	The FASTQ Format	Many studies have been carried o	Note on 2 bit compression of FASTA files
gzip
Tools	C3 Tools	PDSH -- a parallel remote shell	rdist	rsync		Parallel command execution
uptime command	mostat	top	ps	sar	ptree
vmstat	iostat	nfsstat	HPC Humor	Admin Horror Stories	Humor	Etc

	Sample	Compressor	Parameters	Orig size	Compressed size	Compress time		Compress speed	Decompress time		Speed relative to gzip -6	Size relative to gzip	Qratio (speed divided on square of size) All relative to gzip	% of space occupied by the compressed archive	Compression ratio
				(GB)	(GB)	Min	sec	MB/sec	Min	sec
1	FASTQ	gzip	-6	7.43	2.08	18	3	6.86	1	14	1.00	1.00	1.00	0.28	3.58
2	FASTQ	pigz		7.43	2.08	1	16	97.78	0	51	14.39	1.00	14.37	0.28	3.57
3	FASTQ	pigz	-9	7.43	2.04	2	32	48.89	0	51	7.20	0.98	7.47	0.27	3.64
4	FASTQ	bzip2		7.43	1.53	20	42	5.98	7	21	0.88	0.74	1.63	0.21	4.86
5	FASTQ	pbzip2		7.43	1.60	1	51	66.95	0	56	9.86	0.77	16.63	0.22	4.64
6	FASTQ	xz	-9	7.43	1.34	400	5	0.31	3	19	0.05	0.64	0.11	0.18	5.56
7	FASTQ	xz		7.43	1.53	206	1	0.60	3	19	0.09	0.74	0.16	0.21	4.86

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

	#!/usr/bin/perl -w

	#######################################################
	# Author : Aurelie Kapusta
	# email : [email protected]
	# Pupose : To extract sequences from a fasta or fastq file with filters on headers (matching IDs, containing a word etc - see usage)
	#######################################################
	use strict;
	use warnings;
	use Carp;
	use Getopt::Long;
	use Bio::SeqIO;

	my $version = " 2.2 " ;
	my $scriptname = " FetchSeqs.pl " ;

	# UPDATES
	my $changelog = "
	# - v1.0 = 19 Mar 2015
	# Basically merging 6 different scripts in one... It was a mess
	# - v2.0 = 17 Jul 2015
	# Merging can be messy too! Introdction of bugs. The -inv option didn't work.
	# Also, allow the -m IDfile to be a fasta file
	# Usage update
	# - v2.1 = 12 Apr 2016
	# fastq option
	# - v2.1 = 13 Apr 2016
	# grep option; faster indead when very large fastq file, but still super slow

	# TO DO: a bio db and not a SeqIO
	\n " ;

	my $usage = " \n Usage [ $version ]:
	perl FetchSeqs.pl -in <fa> -m <X> [-file] [-out <X>] [-fq] [-grep] [-desc] [-both] [-regex] [-inv] [-noc] [-chlog] [-v] [-h]

	This script allows to extract fasta sequences from a file.
	- matching ID (from command line or using another fasta file or a file containing a list of IDs using -file)
	- containing a word in the ID or in the description (-desc), or in both (-both)
	- the complement of that (meaning, extract when it does not match), option -inv (inverse match)

	Note that for a given fasta header:
	>ID description
	The ID corresponds to anything before the first space, description is anything that's after (even if spaces)

	Examples:
	To extract all sequences containing ERV or LTR in IDs only:
	perl fasta_FetchSeqs.pl -in fastafile.fa -m ERV,LTR -regex -v
	To extract all sequences that don't have the word \" virus \" in the description or in the ID
	perl fasta_FetchSeqs.pl -in fastafile.fa -m virus -both -inv -v
	To extract all sequences that have their ID listed in a file
	perl fasta_FetchSeqs.pl -in fastafile.fa -m list.txt -v
	To extract all sequences that have their full header listed in a file
	perl fasta_FetchSeqs.pl -in fastafile.fa -m list.txt -both -v

	MANDATORY:
	-in => (STRING) input fasta file
	-m => (STRING) provide (i) a word or a list of words, or (ii) a path to a file
	(i) in command line: you can set several words using , (comma) as a separator.
	For example: -m ERV,LTR
	Note that there can't be spaces in the command line, or they have to be escaped with \
	(ii) a file: it can be a fasta/fastq file, or simply a file with a list of IDs (one column)
	If the \" > \" or @ is kept with the ID, then all lines need to have it (unless -grep)
	Headers can contain:
	- fasta/fastq IDs only (no spaces) [defaults earch is done against IDs only]
	- full fasta headers (use -both to match both, otherwise only ID is looked at)
	- descriptions only (spaces allowed) if -desc is set
	Note that you need to use the -file flag

	OPTIONAL:
	-file => (BOOL) chose this if -m corresponds to a file
	-out => (STRING) to set the name of the output file (default = input.extract.fa)
	-fq => (BOOL) if input file is in fastq format; output will also be fastq
	-grep => (BOOL) Chose this with -fq to use grep instead of using BioSeq
	But this is even slower on large files.
	Only relevant if -fq is set as well, because the sequences
	will be extracted using grep -A 3 for each word set with -m
	(extracting line that matches + 3 lines after the match)
	Also, this makes irrelevant the use of these options:
	-desc, -both, -regex, -inv, -noc
	-desc => (BOOL) to look for match in the description and not the header
	-both => (BOOL) to look into both headers and description
	-regex => (BOOL) to look for containing the word and not an exact match
	Special characters in names or descriptions will be an issue;
	the only ones that are taken care of are: \| / . [ ]
	-inv => (BOOL) to extract what DOES NOT match
	-noc => (BOOL) to ignore case in matching
	-chlog => (BOOL) print updates
	-v => (BOOL) verbose mode, make the script talk to you
	-v => (BOOL) print version if only option
	-h\|help => (BOOL) print this help \n\n " ;

Compression of FASTA/FASTQ files

Introduction

FASTA/FASTQ compression ratio

BWT and FASTA/FASTQ compression

Major categories of FASTA/FASTQ files compression methods and programs

BWT and DNA

Compression of FASTA/FASTQ files using generic compression programs

NEWS CONTENTS

Old News ;-)

[May 20, 2018] How to use mkfifo named pipes with prinseq-lite.pl

May 20, 2018 | bogdan.org.ua

[May 13, 2018] What is the difference between FASTA, FASTQ, and SAM file formats

May 13, 2018 | bioinformatics.stackexchange.com

[May 10, 2018] FAST (FAST Analysis of Sequences Toolbox), built on BioPerl, provides open source command-line tools to filter, transform, annotate and analyze biological sequence data by Peter Becich •

May 10, 2018 | www.biostars.org

[May 09, 2018] FAST FAST Analysis of Sequences Toolbox

May 09, 2018 | www.biostars.org

[May 09, 2018] How To Parse Fasta Files In Perl

May 09, 2018 | www.biostars.org

[May 09, 2018] Fasta-fasta_FetchSeqs.pl at master · 4ureliek-Fasta

May 09, 2018 | github.com

Frederick Sanger - Wikipedia

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Etc