Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
May the source be with you, but remember the KISS principle ;-)

Compression Algorithms


Algorithms and Data Structures Recommended Books Recommended Links Recommended Papers Magazines eBooks FAQs eBooks
Huffman LZW LZH Bijective RLE JPEG History Humor Etc

These page is very incomplete and other pages give a much better overview of the various compression algorithms. The following types of compression are well documented elsewhere:

Bijective compression means that for any file X, F( F'( X ) ) == X. (F is either the compressor or decompressor, and F' is its opposite). this defines  lossless compression.  This type of compression is important in compressing files and in crypto algorithms.

Compression now is used as a standard first phase of encryption of large chunks of texts. It eliminates redundancy and thus creates significant additional difficulties for attempts to break particular cipher in many case making them simply impossible (using salt to make it unique and if a non standard algorithms with no headers is used). It destroys the notion of the dictionary of plaintext.  

The more we know the better we compress

Lossless compression reduces bits by identifying and eliminating statistical redundancy. The more heavily the source (let's say a text file) text is prepossessed and studied, the more efficiently it can be compressed. So efficiently of compression is reverse proportional to the speed of the compression. For example, no generic algorithm can compete with algorithms designed for compressing text string on regular text articles and books.  You can create even more specialized algorithm, for example specifically designed to compress Unix syslog files and it will beat any generic text compression algorithm. and so on and so forth. .

Compression always rely on non-randomness of the text: random sequence of bytes is poorly compressible. 

The Lempel–Ziv (LZ) compression methods are among the most popular algorithms for lossless storage.[7] DEFLATE is a variation on LZ optimized for decompression speed and compression ratio, but compression can be slow. DEFLATE is used in PKZIP, Gzip and PNG. LZW (Lempel–Ziv–Welch) is used in GIF images. Also noteworthy is the LZR (Lempel-Ziv–Renau) algorithm, which serves as the basis for the Zip method.[citation needed] LZ methods use a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX). Current LZ-based coding schemes that perform well are Brotli and LZX. LZX is used in Microsoft's CAB format.

Compression as creation of a special abstract machine and encoding the source as a program for this machine

In general the compression can be viewed as a conversion of static text into a special program.

In other words compression consists of converting a static text into some kind of code for a specially constructed (may be unique for this case) abstract machine.  Programs in this abstract machine are shorter then the original text while output of the program after processing by an interpreter (decompressor) is exactly the same (original) text. 

For example if we first process the text and construct a dictionary of words used and then replace each word with the index to a dictionary that will be a primitive compression that might work well on text with a small dictionary and frequently repeated words. For example it might work well for the answer to some lengthy form like "Yes No n/a Yes No n/a n/a Yes No No No Yes n/a n/a No". In this case the dictionary consists of three entries and each reply can be encoded in two bits. 

It is important to understand that the more you know about the structure of the text, the more specialized processor you and construct and the higher level of compression you can achieve. For example, if you know that a particular text is, say, an http log then you can compress it several times better then using any generic algorithms.

Compression and pattern matching

Compression is also connected with the pattern matching. The notion of a longest common subsequence (LCS) of two strings is widely used to compare files and can be used in compression by splitting text into several parts if diffing them against each other. In the most primitive case such diffing can be based on lines or paragraphs.

The "diff" command of UNIX system implement an algorithm of finding the shortest sequence of editing command able to convert one string to another. Informally, the result of a diff algorithm gives the minimum number of operations (insert a symbol, or delete a symbol) to transform one string into the other. This is related to an edit distance, called the Levenshtein distance, with the additional operation of substitution, and with weights associated to operations.

Hirschberg (1975) presents the computation of the LCS in linear space.  This ideas can be very efficiently used for compression of http logs (for example Apache logs) and, say, windows of the last 1K lines often contains at least one string that is very close to the current (in Levenshhtein metric) so that you can encode the whole string by the reference to the "etalon" string and a few operations needs to convert "etalon" to the current string.

In its turn, the search of LCS is connected with efficient string matching algorithms. The first discovered linear-time string-matching algorithm is from Morris and Pratt (1970). It has been improved by Knuth, Morris, and Pratt (1976). The search behaves like a recognition process by automaton, and a character of the text is compared to a character of the pattern no more than \log_\Phi(m+1) (\Phi is the golden ratio (1+\sqrt 5)/2).  The Boyer and Moore's algorithm (1977) is considered as the most efficient string-matching algorithm in usual applications. A simplified version of it (or the entire algorithm) is often implemented in text editors for the "search" and "substitute" commands. Several variants of Boyer and Moore's algorithm avoid the quadratic behavior when searching for all occurrences of the pattern.  In 1975, Aho and Corasick designed an O(n\log\sigma) algorithm to solve this problem, with a running time independent of the number of patterns. It is implemented by the "fgrep" command under the UNIX operating system. In applications where the text is to be searched for several patterns, it is the text that needs to be preprocessed. Even if no further information is known on their syntactic structure, it is possible and indeed extremely efficient to built an index that supports searches. Data structures to represent indexes on text files are: suffix trees (Weiner 1973, McCreight 1976, Ukkonen 1994), direct acyclic word graph (Blumer et al., 1985), suffix automata (Crochemore, 1986), and suffix arrays (Manber and Myers, 1993). All algorithms (except for suffix arrays) build the index in time O(n\log\sigma).

Notion of the optimal compression algorithm

If you can create a dictionary of symbols or words in the text (or just split the source into meaningful, repeatable chunks) the most optimal type of encoding is Huffman encoding.  please note that for ordinary English text calculating the frequency of letters and compression based on this freqncy is far from optimal approach. Words represent far better target. If separation of into "words" is impossible then "digraphs" are better deal then single letters.

Huffman coding - Wikipedia, the free encyclopedia

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding and/or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".[1]

The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman's method can be efficiently implemented, finding a code in linear time to the number of input weights if these weights are sorted.[2] However, although optimal among methods encoding symbols separately, Huffman coding is not always optimal among all compression methods.

... ... ...


A source generates 4 different symbols { a 1 , a 2 , a 3 , a 4 } {\displaystyle \{a_{1},a_{2},a_{3},a_{4}\}} with probability { 0.4 ; 0.35 ; 0.2 ; 0.05 } {\displaystyle \{0.4;0.35;0.2;0.05\}} . A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols. The process is repeated until there is just one symbol. The tree can then be read backwards, from right to left, assigning different bits to different branches. The final Huffman code is:



a1 0
a2 10
a3 110
a4 111

The standard way to represent a signal made of 4 symbols is by using 2 bits/symbol, but the entropy of the source is 1.74 bits/symbol. If this Huffman code is used to represent the signal, then the average length is lowered to 1.85 bits/symbol; it is still far from the theoretical limit because the probabilities of the symbols are different from negative powers of two.
The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols, n {\displaystyle n} . A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to n {\displaystyle n} leaf nodes and n − 1 {\displaystyle n-1} internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths.

The process essentially begins with the leaf nodes containing the probabilities of the symbol they represent, then a new node whose children are the 2 nodes with smallest probability is created, such that the new node's probability is equal to the sum of the children's probability. With the previous 2 nodes merged into one node (thus not considering them anymore), and with the new node being now considered, the procedure is repeated until only one node remains, the Huffman tree.

The simplest construction algorithm uses a priority queue where the node with lowest probability is given highest priority:
1.Create a leaf node for each symbol and add it to the priority queue.
2.While there is more than one node in the queue: 1.Remove the two nodes of highest priority (lowest probability) from the queue
2.Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
3.Add the new node to the queue.

3.The remaining node is the root node and the tree is complete.

Since efficient priority queue data structures require O(log n) time per insertion, and a tree with n leaves has 2n−1 nodes, this algorithm operates in O(n log n) time, where n is the number of symbols.

If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman tree using two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues:
1.Start with as many leaves as there are symbols.
2.Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).
3.While there is more than one node in the queues: 1.Dequeue the two nodes with the lowest weight by examining the fronts of both queues.
2.Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
3.Enqueue the new node into the rear of the second queue.

4.The remaining node is the root node; the tree has now been generated.

Although linear-time given sorted input, in the general case of arbitrary input, using this algorithm requires pre-sorting. Thus, since sorting takes O(n log n) time in the general case, both methods have the same overall complexity.

In many cases, time complexity is not very important in the choice of algorithm here, since n here is the number of symbols in the alphabet, which is typically a very small number (compared to the length of the message to be encoded); whereas complexity analysis concerns the behavior when n grows to be very large.

It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue. This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code.

Here's an example of optimized Huffman coding using the French subject string "j'aime aller sur le bord de l'eau les jeudis ou les jours impairs". Note that the original Huffman coding tree structure would be different from the given example:


Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream (reaching a leaf node necessarily terminates the search for that particular byte value). Before this can take place, however, the Huffman tree must be somehow reconstructed. In the simplest case, where character frequencies are fairly predictable, the tree can be preconstructed (and even statistically adjusted on each compression cycle) and thus reused every time, at the expense of at least some measure of compression efficiency. Otherwise, the information to reconstruct the tree must be sent a priori. A naive approach might be to prepend the frequency count of each character to the compression stream. Unfortunately, the overhead in such a case could amount to several kilobytes, so this method has little practical use. If the data is compressed using canonical encoding, the compression model can be precisely reconstructed with just B 2 B {\displaystyle B2^{B}} bits of information (where B {\displaystyle B} is the number of bits per symbol). Another method is to simply prepend the Huffman tree, bit by bit, to the output stream. For example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the latter is encountered the tree building routine simply reads the next 8 bits to determine the character value of that particular leaf. The process continues recursively until the last leaf node is reached; at that point, the Huffman tree will thus be faithfully reconstructed. The overhead using such a method ranges from roughly 2 to 320 bytes (assuming an 8-bit alphabet). Many other techniques are possible as well. In any case, since the compressed data can include unused "trailing bits" the decompressor must be able to determine when to stop producing output. This can be accomplished by either transmitting the length of the decompressed data along with the compression model or by defining a special code symbol to signify the end of input (the latter method can adversely affect code length optimality, however).

Main properties

The probabilities used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed. This requires that a frequency table must be stored with the compressed text. See the Decompression section above for more information about the various techniques employed for this purpose.


Although Huffman's original algorithm is optimal for a symbol-by-symbol coding (i.e., a stream of unrelated symbols) with a known input probability distribution, it is not optimal when the symbol-by-symbol restriction is dropped, or when the probability mass functions are unknown. Also, if symbols are not independent and identically distributed, a single code may be insufficient for optimality. Other methods such as arithmetic coding and LZW coding often have better compression capability: Both of these methods can combine an arbitrary number of symbols for more efficient coding, and generally adapt to the actual input statistics, useful when input probabilities are not precisely known or vary significantly within the stream. However, these methods have higher computational complexity. Also, both arithmetic coding and LZW were historically a subject of some concern over patent issues. However, as of mid-2010, the most commonly used techniques for these alternatives to Huffman coding have passed into the public domain as the early patents have expired.

However, the limitations of Huffman coding should not be overstated; it can be used adaptively, accommodating unknown, changing, or context-dependent probabilities. In the case of known independent and identically distributed random variables, combining symbols ("blocking") reduces inefficiency in a way that approaches optimality as the number of symbols combined increases. Huffman coding is optimal when each input symbol is a known independent and identically distributed random variable having a probability that is an the inverse of a power of two.

Prefix codes tend to have inefficiency on small alphabets, where probabilities often fall between these optimal points. The worst case for Huffman coding can happen when the probability of a symbol exceeds 2−1 = 0.5, making the upper limit of inefficiency unbounded. These situations often respond well to a form of blocking called run-length encoding; for the simple case of Bernoulli processes, Golomb coding is a probably optimal run-length code.

For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding, e.g., ASCII coding. This reflects the fact that compression is not possible with such an input.


Many variations of Huffman coding exist, some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need not even be polynomial time. An exhaustive list of papers on Huffman coding and its variations is given by "Code and Parse Trees for Lossless Source Encoding".[4]

n-ary Huffman coding

The n-ary Huffman algorithm uses the {0, 1, ... , n − 1} alphabet to encode message and build an n-ary tree. This approach was considered by Huffman in his original paper. The same algorithm applies as for binary (n equals 2) codes, except that the n least probable symbols are taken together, instead of just the 2 least probable. Note that for n greater than 2, not all sets of source words can properly form an n-ary tree for Huffman coding. In these cases, additional 0-probability place holders must be added. This is because the tree must form an n to 1 contractor; for binary coding, this is a 2 to 1 contractor, and any sized set can form such a contractor. If the number of source words is congruent to 1 modulo n-1, then the set of source words will form a proper Huffman tree.

Adaptive Huffman coding

A variation called adaptive Huffman coding involves calculating the probabilities dynamically based on recent actual frequencies in the sequence of source symbols, and changing the coding tree structure to match the updated probability estimates. It is used rarely in practice, since the cost of updating the tree makes it slower than optimized adaptive arithmetic coding, that is more flexible and has a better compression.

Huffman template algorithm

Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only that the weights form a totally ordered commutative monoid, meaning a way to order weights and to add them. The Huffman template algorithm enables one to use any kind of weights (costs, frequencies, pairs of weights, non-numerical weights) and one of many combining methods (not just addition). Such algorithms can solve other minimization problems, such as minimizing max i [ w i + l e n g t h ( c i ) ] {\displaystyle \max _{i}\left[w_{i}+\mathrm {length} \left(c_{i}\right)\right]} , a problem first applied to circuit design.

Length-limited Huffman coding/minimum variance Huffman coding

Length-limited Huffman coding is a variant where the goal is still to achieve a minimum weighted path length, but there is an additional restriction that the length of each codeword must be less than a given constant. The package-merge algorithm solves this problem with a simple greedy approach very similar to that used by Huffman's algorithm. Its time complexity is O ( n L ) {\displaystyle O(nL)} , where L {\displaystyle L}  is the maximum length of a codeword. No algorithm is known to solve this problem in linear or linearithmic time, unlike the presorted and unsorted conventional Huffman problems, respectively.

Huffman coding with unequal letter costs

In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.

Huffman coding with unequal letter costs is the generalization without this assumption: the letters of the encoding alphabet may have non-uniform lengths, due to characteristics of the transmission medium. An example is the encoding alphabet of Morse code, where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission time is higher. The goal is still to minimize the weighted average codeword length, but it is no longer sufficient just to minimize the number of symbols used by the message. No algorithm is known to solve this in the same manner or with the same efficiency as conventional Huffman coding, though it has been solved by Karp whose solution has been refined for the case of integer costs by Golin.

Optimal alphabetic binary trees (Hu–Tucker coding)

In the standard Huffman coding problem, it is assumed that any codeword can correspond to any input symbol. In the alphabetic version, the alphabetic order of inputs and outputs must be identical. Thus, for example, A = { a , b , c } {\displaystyle A=\left\{a,b,c\right\}}  could not be assigned code H ( A , C ) = { 00 , 1 , 01 } {\displaystyle H\left(A,C\right)=\left\{00,1,01\right\}} , but instead should be assigned either H ( A , C ) = { 00 , 01 , 1 }   or H ( A , C ) = { 0 , 10 , 11 }  . This is also known as the Hu–Tucker problem, after T. C. Hu and Alan Tucker, the authors of the paper presenting the first linearithmic solution to this optimal binary alphabetic problem,[5] which has some similarities to Huffman algorithm, but is not a variation of this algorithm. These optimal alphabetic binary trees are often used as binary search trees.

The canonical Huffman code

Main article: Canonical Huffman code

If weights corresponding to the alphabetically ordered inputs are in numerical order, the Huffman code has the same lengths as the optimal alphabetic code, which can be found from calculating these lengths, rendering Hu–Tucker coding unnecessary. The code resulting from numerically (re-)ordered input is sometimes called the canonical Huffman code and is often the code used in practice, due to ease of encoding/decoding. The technique for finding this code is sometimes called Huffman-Shannon-Fano coding, since it is optimal like Huffman coding, but alphabetic in weight probability, like Shannon-Fano coding. The Huffman-Shannon-Fano code corresponding to the example is { 000 , 001 , 01 , 10 , 11 }  , which, having the same codeword lengths as the original solution, is also optimal. But in canonical Huffman code, the result is { 110 , 111 , 00 , 01 , 10 }


  1. Aho, A.V. 1990. Algorithms for finding patterns in strings. In: Handbook of Theoretical Computer Science, Algorithms and Complexity, Vol. A, ch. 5, pp 255--330. J. van Leeuwen ed., Elsevier, Amsterdam.
  2. Bell, T.C., Cleary J.G. and Witten, I.H. 1990. Text Compression. Prentice Hall, Englewood Cliffs, New Jersey.
  3. Cormen, T.H., Leiserson C.E. and Rivest, R.L. 1990. Introduction to Algorithms, ch. 34, pp 853--885. MIT Press.
  4. Crochemore, M. and Rytter W. 1994. Text Algorithms. Oxford University Press.
  5. Gonnet, G.H. and Baeza-Yates, R.A. 1991. Handbook of Algorithms and Data Structures, ch. 7, pp 251--288. Addison-Wesley.
  6. Nelson, M. 1992. The Data Compression Book. M&T Books.
  7. Sedgewick R. 1990. Algorithms in C, ch. 19 and 22. Addison-Wesley.
  8. Stephen, G.A. 1994. String Searching Algorithms. World Scientific Press

HTTP compression

HTTP compression - Wikipedia, the free encyclopedia

Content-Encoding tokens[edit]

The official list of tokens available to servers and client is maintained by IANA,[4] and it includes:

In addition to these, a number of unofficial or non-standardized tokens are used in the wild by either servers or clients:

Servers that support HTTP compression[edit]

The compression in HTTP can also be achieved by using the functionality of server-side scripting languages like PHP, or programming languages like Java.

Top Visited
Past week
Past month


Old News ;-)

[May 04, 2018] Why are tar archive formats switching to xz compression to replace bzip2 and what about gzip

May 04, 2018 |

のbるしtyぱんky ,Jan 6, 2014 at 18:39

More and more tar archives use the xz format based on LZMA2 for compression instead of the traditional bzip2(bz2) compression. In fact made a late " Good-bye bzip2 " announcement, 27th Dec. 2013 , indicating kernel sources would from this point on be released in both tar.gz and tar.xz format - and on the main page of the website what's directly offered is in tar.xz .

Are there any specific reasons explaining why this is happening and what is the relevance of gzip in this context?

> ,

For distributing archives over the Internet, the following things are generally a priority:
  1. Compression ratio (i.e., how small the compressor makes the data);
  2. Decompression time (CPU requirements);
  3. Decompression memory requirements; and
  4. Compatibility (how wide-spread the decompression program is)

Compression memory & CPU requirements aren't very important, because you can use a large fast machine for that, and you only have to do it once.

Compared to bzip2, xz has a better compression ratio and lower (better) decompression time. It, however -- at the compression settings typically used -- requires more memory to decompress [1] and is somewhat less widespread. Gzip uses less memory than either.

So, both gzip and xz format archives are posted, allowing you to pick:

There isn't really a realistic combination of factors that'd get you to pick bzip2. So its being phased out.

I looked at compression comparisons in a blog post . I didn't attempt to replicate the results, and I suspect some of it has changed (mostly, I expect xz has improved, as its the newest.)

(There are some specific scenarios where a good bzip2 implementation may be preferable to xz: bzip2 can compresses a file with lots of zeros and genome DNA sequences better than xz. Newer versions of xz now have an (optional) block mode which allows data recovery after the point of corruption and parallel compression and [in theory] decompression. Previously, only bzip2 offered these. [2] However none of these are relevant for kernel distribution)

1: In archive size, xz -3 is around bzip -9 . Then xz uses less memory to decompress. But xz -9 (as, e.g., used for Linux kernel tarballs) uses much more than bzip -9 . (And even xz -0 needs more than gzip -9 ).

2: F21 System Wide Change: lbzip2 as default bzip2 implementation

> ,

First of all, this question is not directly related to tar . Tar just creates an uncompressed archive, the compression is then applied later on.

Gzip is known to be relatively fast when compared to LZMA2 and bzip2. If speed matters, gzip (especially the multithreaded implementation pigz ) is often a good compromise between compression speed and compression ratio. Although there are alternatives if speed is an issue (e.g. LZ4).

However, if a high compression ratio is desired LZMA2 beats bzip2 in almost every aspect. The compression speed is often slower, but it decompresses much faster and provides a much better compression ratio at the cost of higher memory usage.

There is not much reason to use bzip2 any more, except of backwards compatibility. Furthermore, LZMA2 was desiged with multithreading in mind and many implementations by default make use of multicore CPUs (unfortunately xz on Linux does not do this, yet). This makes sense since the clock speeds won't increase any more but the number of cores will.

There are multithreaded bzip2 implementations (e.g. pbzip ), but they are often not installed by default. Also note that multithreaded bzip2 only really pay off while compressing whereas decompression uses a single thread if the file was compress using a single threaded bzip2 , in contrast to LZMA2. Parallel bzip2 variants can only leverage multicore CPUs if the file was compressed using a parallel bzip2 version, which is often not the case.

Slyx ,Jan 6, 2014 at 19:14

Short answer : xz is more efficient in terms of compression ratio. So it saves disk space and optimizes the transfer through the network.
You can see this Quick Benchmark so as to discover the difference by practical tests.

Mark Warburton ,Apr 14, 2016 at 14:15

LZMA2 is a block compression system whereas gzip is not. This means that LZMA2 lends itself to multi-threading. Also, if corruption occurs in an archive, you can generally recover data from subsequent blocks with LZMA2 but you cannot do this with gzip. In practice, you lose the entire archive with gzip subsequent to the corrupted block. With an LZMA2 archive, you only lose the file(s) affected by the corrupted block(s). This can be important in larger archives with multiple files.


[May 04, 2018] Gzip vs Bzip2 vs XZ Performance Comparison

Notable quotes:
"... With XZ it is possible to specify the amount of threads to run which can greatly increase performance ..."
May 04, 2018 |

Gzip vs Bzip2 vs XZ Performance Comparison Posted by Jarrod on September 17, 2015 Leave a comment (21) Go to comments Gzip, Bzip2 and XZ are all popular compression tools used in UNIX based operating systems, but which should you use? Here we are going to benchmark and compare them against each other to get an idea of the trade off between the level of compression and time taken to achieve it.

For further information on how to use gzip, bzip2 or xz see our guides below:

The Test Server

The test server was running CentOS 7.1.1503 with kernel 3.10.0-229.11.1 in use, all updates to date are fully applied. The server had 4 CPU cores and 16GB of available memory, during the tests only one CPU core was used as all of these tools run single threaded by default, while testing this CPU core would be fully utilized. With XZ it is possible to specify the amount of threads to run which can greatly increase performance , for further information see example 9 here .

All tests were performed on linux-3.18.19.tar, a copy of the Linux kernel from . This file was 580,761,600 Bytes in size prior to compression.

The Benchmarking Process

The linux-3.18.19.tar file was compressed and decompressed 9 times each by gzip, bzip2 and xz at each available compression level from 1 to 9. A compression level of 1 indicates that the compression will be fastest but the compression ratio will not be as high so the file size will be larger. Compression level 9 on the other hand is the best possible compression level, however it will take the longest amount of time to complete.

There is an important trade off here between the compression levels between CPU processing time and the compression ratio. To get a higher compression ratio and save a greater amount of disk space, more CPU processing time will be required. To save and reduce CPU processing time a lower compression level can be used which will result in a lower compression ratio, using more disk space.

Each time the compression or decompression command was run, the 'time' command was placed in front so that we could accurately measure how long the command took to execute.

Below are the commands that were run for compression level 1:

time bzip2 -1v linux-3.18.19.tar
time gzip -1v linux-3.18.19.tar
time xz -1v linux-3.18.19.tar

All commands were run with the time command, verbosity and the compression level of -1 which was stepped through incrementally up to -9. To decompress, the same command was used with the -d flag.

The versions pf these tools were gzip 1.5, bzip2 1.0.6, and xz (XZ Utils) 5.1.2alpha.


The raw data that the below graphs have been created from has been provided in tables below and can also be accessed in this spreadsheet .

Compressed Size

The table below indicates the size in bytes of the linux-3.18.19.tar file after compression, the first column numbered 1..9 shows the compression level passed in to the compression tool.

gzip bzip2 xz
1 153617925 115280806 105008672
2 146373307 107406491 100003484
3 141282888 103787547 97535320
4 130951761 101483135 92377556
5 125581626 100026953 85332024
6 123434238 98815384 83592736
7 122808861 97966560 82445064
8 122412099 97146072 81462692
9 122349984 96552670 80708748
Compression Time

First we'll start with the compression time, this graph shows how long it took for the compression to complete at each compression level 1 through to 9.

gzip bzip2 xz
1 13.213 78.831 48.473
2 14.003 77.557 65.203
3 16.341 78.279 97.223
4 17.801 79.202 196.146
5 22.722 80.394 310.761
6 30.884 81.516 383.128
7 37.549 82.199 416.965
8 48.584 81.576 451.527
9 54.307 82.812 500.859
Gzip vs Bzip2 vs XZ Compression Time

Gzip vs Bzip2 vs XZ Compression Time

So far we can see that gzip takes slightly longer to complete as the compression level increases, bzip2 does not change very much, while xz increases quite significantly after a compression level of 3.

Compression Ratio

Now that we have an idea of how long the compression took we can compare this with how well the file compressed. The compression ratio represents the percentage that the file has been reduced to. For example if a 100mb file has been compressed with a compression ratio of 25% it would mean that the compressed version of the file is 25mb.

gzip bzip2 xz
1 26.45 19.8 18.08
2 25.2 18.49 17.21
3 24.32 17.87 16.79
4 22.54 17.47 15.9
5 21.62 17.22 14.69
6 21.25 17.01 14.39
7 21.14 16.87 14.19
8 21.07 16.73 14.02
9 21.06 16.63 13.89
Gzip vs Bzip2 vs XZ Compression Ratio

Gzip vs Bzip2 vs XZ Compression Ratio

The overall trend here is that with a higher compression level applied, the lower the compression ratio indicating that the overall file size is smaller. In this case xz is always providing the best compression ratio, closely followed by bzip2 with gzip coming in last, however as shown in the compression time graph xz takes a lot longer to get these results after compression level 3.

Compression Speed

The compression speed in MB per second can also be observed.

gzip bzip2 xz
1 43.95 7.37 11.98
2 41.47 7.49 8.9
3 35.54 7.42 5.97
4 32.63 7.33 2.96
5 25.56 7.22 1.87
6 18.8 7.12 1.52
7 15.47 7.07 1.39
8 11.95 7.12 1.29
9 10.69 7.01 1.16
Gzip vs Bzip2 vs XZ Compression Speed

Gzip vs Bzip2 vs XZ Compression Speed Decompression Time

Next up is how long each file compressed at a particular compression level took to decompress.

gzip bzip2 xz
1 6.771 24.23 13.251
2 6.581 24.101 12.407
3 6.39 23.955 11.975
4 6.313 24.204 11.801
5 6.153 24.513 11.08
6 6.078 24.768 10.911
7 6.057 23.199 10.781
8 6.033 25.426 10.676
9 6.026 23.486 10.623
Gzip vs Bzip2 vs XZ Decompression Time

Gzip vs Bzip2 vs XZ Decompression Time

In all cases the file decompressed faster if it had been compressed with a higher compression level. Therefore if you are going to be serving out a compressed file over the Internet multiple times it may be worth compressing it with xz with a compression level of 9 as this will both reduce bandwidth over time when transferring the file, and will also be faster for everyone to decompress.

Decompression Speed

The decompression speed in MB per second can also be observed.

1 85.77 23.97 43.83
2 88.25 24.1 46.81
3 90.9 24.24 48.5
4 91.99 24 49.21
5 94.39 23.7 52.42
6 95.55 23.45 53.23
7 95.88 25.03 53.87
8 96.26 22.84 54.4
9 96.38 24.72 54.67
Gzip vs Bzip2 vs XZ Decompression Speed

Gzip vs Bzip2 vs XZ Decompression Speed Performance Differences and Comparison

By default when the compression level is not specified, gzip uses -6, bzip2 uses -9 and xz uses -6. The reason for this is pretty clear based on the results. For gzip and xz -6 as a default compression method provides a good level of compression yet does not take too long to complete, it's a fair trade off point as higher compression levels take longer to process. Bzip2 on the other hand is best used with the default compression level of 9 as is also recommended in the manual page, the results here confirm this, the compression ratio increases but the time taken is almost the same and differs by less than a second between levels 1 to 9.

In general xz achieves the best compression level, followed by bzip2 and then gzip. In order to achieve better compression however xz usually takes the longest to complete, followed by bzip2 and then gzip.

xz takes a lot more time with its default compression level of 6 while bzip2 only takes a little longer than gzip at compression level 9 and compresses a fair amount better, while the difference between bzip2 and xz is less than the difference between bzip2 and gzip making bzip2 a good trade off for compression.

Interestingly the lowest xz compression level of 1 results in a higher compression ratio than gzip with a compression level of 9 and even completes faster. Therefore using xz with a compression level of 1 instead of gzip for a better compression ratio in a faster time.

Based on these results, bzip2 is a good middle ground for compression, gzip is only a little faster while xz may not really worth it at its higher default compression ratio of 6 as it takes much longer to complete for little extra gain.

However decompressing with bzip2 takes much longer than xz or gzip, xz is a good middle ground here while gzip is again the fastest.


So which should you use? It's going to come down to using the right tool for the job and the particular data set that you are working with.

If you are interactively compressing files on the fly then you may want to do this quickly with gzip -6 (default compression level) or xz -1, however if you're configuring log rotation which will run automatically over night during a low resource usage period then it may be acceptable to use more CPU resources with xz -9 to save the greatest amount of space possible. For instance compress the Linux kernel with xz, in this case spending extra time to compress the file well once makes sense when it will be downloaded and decompressed thousands of times resulting in bandwidth savings yet still decent decompression speeds.

Based on the results here, if you're simply after being able to compress and decompress files as fast as possible with little regard to the compression ratio, then gzip is the tool for you. If you want a better compression ratio to save more disk space and are willing to spend extra processing time to get it then xz will be best to use. Although xz takes the longest to compress at higher compression levels, it has a fairly good decompression speed and compresses quite fast at lower levels. Bzip2 provides a good trade off between compression ratio and processing speed however it takes the longest to decompress so it may be a good option if the content that is being compressed will be infrequently decompressed.

In the end the best option will come down to what you're after between processing time and compression ratio. With disk space continually becoming cheaper and available in larger sizes you may be fine with saving some CPU resources and processing time to store slightly larger files. Regardless of the tool that you use, compression is a great resource for saving storage space.

[May 04, 2018] Lempel–Ziv–Markov chain algorithm - Wikipedia

May 04, 2018 |

Lempel–Ziv–Markov chain algorithm From Wikipedia, the free encyclopedia Jump to: navigation , search "LZMA" redirects here. For the airport with the ICAO code "LZMA", see Martin Airport (Slovakia) .
hide This article has multiple issues. Please help improve it or discuss these issues on the talk page . ( Learn how and when to remove these template messages )
Wiki letter w.svg This article's lead section does not adequately summarize key points of its contents . Please consider expanding the lead to provide an accessible overview of all important aspects of the article. Please discuss this issue on the article's talk page . (July 2014)
This article is written like a manual or guidebook . Please help rewrite this article from a descriptive, neutral point of view , and remove advice or instruction. (July 2014) ( Learn how and when to remove this template message )
( Learn how and when to remove this template message )

The Lempel–Ziv–Markov chain algorithm ( LZMA ) is an algorithm used to perform lossless data compression . It has been under development either since 1996 or 1998 [1] and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio (generally higher than bzip2 ) [2] [3] and a variable compression-dictionary size (up to 4 GB ), [4] while still maintaining decompression speed similar to other commonly used compression algorithms. [5]

LZMA2 is a simple container format that can include both uncompressed data and LZMA data, possibly with multiple different LZMA encoding parameters. LZMA2 supports arbitrarily scalable multithreaded compression and decompression and efficient compression of data which is partially incompressible.

Contents [ hide ] Overview [ edit ]
This section needs additional citations for verification . Please help improve this article by adding citations to reliable sources . Unsourced material may be challenged and removed. (July 2010) ( Learn how and when to remove this template message )

LZMA uses a dictionary compression algorithm (a variant of LZ77 with huge dictionary sizes and special support for repeatedly used match distances), whose output is then encoded with a range encoder , using a complex model to make a probability prediction of each bit. The dictionary compressor finds matches using sophisticated dictionary data structures, and produces a stream of literal symbols and phrase references, which is encoded one bit at a time by the range encoder: many encodings are possible, and a dynamic programming algorithm is used to select an optimal one under certain approximations.

Prior to LZMA, most encoder models were purely byte-based (i.e. they coded each bit using only a cascade of contexts to represent the dependencies on previous bits from the same byte). The main innovation of LZMA is that instead of a generic byte-based model, LZMA's model uses contexts specific to the bitfields in each representation of a literal or phrase: this is nearly as simple as a generic byte-based model, but gives much better compression because it avoids mixing unrelated bits together in the same context. Furthermore, compared to classic dictionary compression (such as the one used in zip and gzip formats), the dictionary sizes can be and usually are much larger, taking advantage of the large amount of memory available on modern systems.

Compressed format overview [ edit ]
This section needs additional citations for verification . Please help improve this article by adding citations to reliable sources . Unsourced material may be challenged and removed. (July 2010) ( Learn how and when to remove this template message )

In LZMA compression, the compressed stream is a stream of bits, encoded using an adaptive binary range coder. The stream is divided into packets, each packet describing either a single byte, or an LZ77 sequence with its length and distance implicitly or explicitly encoded. Each part of each packet is modeled with independent contexts, so the probability predictions for each bit are correlated with the values of that bit (and related bits from the same field) in previous packets of the same type.

There are 7 types of packets: [ citation needed ]

packed code (bit sequence) packet name packet description
0 + byteCode LIT A single byte encoded using an adaptive binary range coder.
1+0 + len + dist MATCH A typical LZ77 sequence describing sequence length and distance.
1+1+0+0 SHORTREP A one-byte LZ77 sequence. Distance is equal to the last used LZ77 distance.
1+1+0+1 + len LONGREP[0] An LZ77 sequence. Distance is equal to the last used LZ77 distance.
1+1+1+0 + len LONGREP[1] An LZ77 sequence. Distance is equal to the second last used LZ77 distance.
1+1+1+1+0 + len LONGREP[2] An LZ77 sequence. Distance is equal to the third last used LZ77 distance.
1+1+1+1+1 + len LONGREP[3] An LZ77 sequence. Distance is equal to the fourth last used LZ77 distance.

LONGREP[*] refers to LONGREP[0-3] packets, *REP refers to both LONGREP and SHORTREP, and *MATCH refers to both MATCH and *REP.

LONGREP[n] packets remove the distance used from the list of the most recent distances and reinsert it at the front, to avoid useless repeated entry, while MATCH just adds the distance to the front even if already present in the list and SHORTREP and LONGREP[0] don't alter the list.

The length is encoded as follows:

Length code (bit sequence) Description
0+ 3 bits The length encoded using 3 bits, gives the lengths range from 2 to 9.
1+0+ 3 bits The length encoded using 3 bits, gives the lengths range from 10 to 17.
1+1+ 8 bits The length encoded using 8 bits, gives the lengths range from 18 to 273.

As in LZ77, the length is not limited by the distance, because copying from the dictionary is defined as if the copy was performed byte by byte, keeping the distance constant.

Distances are logically 32-bit and distance 0 points to the most recently added byte in the dictionary.

The distance encoding starts with a 6-bit "distance slot", which determines how many further bits are needed. Distances are decoded as a binary concatenation of, from most to least significant, two bits depending on the distance slot, some bits encoded with fixed 0.5 probability, and some context encoded bits, according to the following table (distance slots 0−3 directly encode distances 0−3).

6-bit distance slot Highest 2 bits Fixed 0.5 probability bits Context encoded bits
0 00 0 0
1 01 0 0
2 10 0 0
3 11 0 0
4 10 0 1
5 11 0 1
6 10 0 2
7 11 0 2
8 10 0 3
9 11 0 3
10 10 0 4
11 11 0 4
12 10 0 5
13 11 0 5
14–62 (even) 10 ((slot / 2) − 5) 4
15–63 (odd) 11 (((slot − 1) / 2) − 5) 4
Decompression algorithm details [ edit ]
This section possibly contains original research . Please improve it by verifying the claims made and adding inline citations . Statements consisting only of original research should be removed. (April 2012) ( Learn how and when to remove this template message )

No complete natural language specification of the compressed format seems to exist, other than the one attempted in the following text.

The description below is based on the compact XZ Embedded decoder by Lasse Collin included in the Linux kernel source [6] from which the LZMA and LZMA2 algorithm details can be relatively easily deduced: thus, while citing source code as reference isn't ideal, any programmer should be able to check the claims below with a few hours of work.

Range coding of bits [ edit ]

LZMA data is at the lowest level decoded one bit at a time by the range decoder, at the direction of the LZMA decoder.

Context-based range decoding is invoked by the LZMA algorithm passing it a reference to the "context", which consists of the unsigned 11-bit variable prob (typically implemented using a 16-bit data type) representing the predicted probability of the bit being 0, which is read and updated by the range decoder (and should be initialized to 2^10, representing 0.5 probability).

Fixed probability range decoding instead assumes a 0.5 probability, but operates slightly differently from context-based range decoding.

The range decoder state consists of two unsigned 32-bit variables, range (representing the range size), and code (representing the encoded point within the range).

Initialization of the range decoder consists of setting range to 2 32 − 1, and code to the 32-bit value starting at the second byte in the stream interpreted as big-endian; the first byte in the stream is completely ignored.

Normalization proceeds in this way:

  1. Shift both range and code left by 8 bits
  2. Read a byte from the compressed stream
  3. Set the least significant 8 bits of code to the byte value read

Context-based range decoding of a bit using the prob probability variable proceeds in this way:

  1. If range is less than 2^24, perform normalization
  2. Set bound to floor( range / 2^11) * prob
  3. If code is less than bound :
    1. Set range to bound
    2. Set prob to prob + floor((2^11 - prob ) / 2^5)
    3. Return bit 0
  4. Otherwise (if code is greater than or equal to the bound ):
    1. Set range to range - bound
    2. Set code to code - bound
    3. Set prob to prob - floor( prob / 2^5)
    4. Return bit 1

Fixed-probability range decoding of a bit proceeds in this way:

  1. If range is less than 2^24, perform normalization
  2. Set range to floor( range / 2)
  3. If code is less than range :
    1. Return bit 0
  4. Otherwise (if code is greater or equal than range ):
    1. Set code to code - range
    2. Return bit 1

The Linux kernel implementation of fixed-probability decoding in rc_direct, for performance reasons, doesn't include a conditional branch, but instead subtracts range from code unconditionally, and uses the resulting sign bit to both decide the bit to return, and to generate a mask that is combined with code and added to range .

Note that:

  1. The division by 2^11 when computing bound and floor operation is done before the multiplication, not after (apparently to avoid requiring fast hardware support for 32-bit multiplication with a 64-bit result)
  2. Fixed probability decoding is not strictly equivalent to context-based range decoding with any prob value, due to the fact that context-based range decoding discards the lower 11 bits of range before multiplying by prob as just described, while fixed probability decoding only discards the last bit
Range coding of integers [ edit ]

The range decoder also provides the bit-tree, reverse bit-tree and fixed probability integer decoding facilities, which are used to decode integers, and generalize the single-bit decoding described above. To decode unsigned integers less than limit , an array of ( limit − 1) 11-bit probability variables is provided, which are conceptually arranged as the internal nodes of a complete binary tree with limit leaves.

Non-reverse bit-tree decoding works by keeping a pointer to the tree of variables, which starts at the root. As long as the pointer doesn't point to a leaf, a bit is decoded using the variable indicated by the pointer, and the pointer is moved to either the left or right children depending on whether the bit is 0 or 1; when the pointer points to a leaf, the number associated with the leaf is returned.

Non-reverse bit-tree decoding thus happens from most significant to least significant bit, stopping when only one value in the valid range is possible (this conceptually allows to have range sizes that are not powers of two, even though LZMA doesn't make use of this).

Reverse bit-tree decoding instead decodes from least significant bit to most significant bits, and thus only supports ranges that are powers of two, and always decodes the same number of bits. It is equivalent to performing non-reverse bittree decoding with a power of two limit , and reversing the last log2( limit ) bits of the result.

Note that in the rc_bittree function in the Linux kernel, integers are actually returned in the [ limit , 2 * limit ) range (with limit added to the conceptual value), and the variable at index 0 in the array is unused, while the one at index 1 is the root, and the left and right children indices are computed as 2 i and 2 i + 1. The rc_bittree_reverse function instead adds integers in the [0, limit ) range to a caller-provided variable, where limit is implicitly represented by its logarithm, and has its own independent implementation for efficiency reasons.

Fixed probability integer decoding simply performs fixed probability bit decoding repeatedly, reading bits from the most to the least significant.

LZMA configuration [ edit ]

The LZMA decoder is configured by an lclppb "properties" byte and a dictionary size.

The value of the lclppb byte is lc + lp * 9 + pb * 9 * 5, where:

In non-LZMA2 streams, lc must not be greater than 8, and lp and pb must not be greater than 4. In LZMA2 streams, ( lc + lp ) and pb must not be greater than 4.

In the 7-zip LZMA file format, configuration is performed by a header containing the "properties" byte followed by the 32-bit little-endian dictionary size in bytes. In LZMA2, the properties byte can optionally be changed at the start of LZMA2 LZMA packets, while the dictionary size is specified in the LZMA2 header as later described.

LZMA coding contexts [ edit ]

The LZMA packet format has already been described, and this section specifies how LZMA statistically models the LZ-encoded streams, or in other words which probability variables are passed to the range decoder to decode each bit.

Those probability variables are implemented as multi-dimensional arrays; before introducing them, a few values that are used as indices in these multidimensional arrays are defined.

The state value is conceptually based on which of the patterns in the following table match the latest 2-4 packet types seen, and is implemented as a state machine state updated according to the transition table listed in the table every time a packet is output.

The initial state is 0, and thus packets before the beginning are assumed to be LIT packets.

state previous packets next state when next packet is
4th previous 3rd previous 2nd previous previous LIT MATCH LONGREP[*] SHORTREP
0 LIT LIT LIT 0 7 8 9
1 MATCH LIT LIT 0 7 8 9
2 LONGREP[*] LIT LIT 0 7 8 9
4 MATCH LIT 1 7 8 9
5 LONGREP[*] LIT 2 7 8 9
7 LIT MATCH 4 10 11 11
8 LIT LONGREP[*] 5 10 11 11
9 LIT SHORTREP 6 10 11 11
10 *MATCH MATCH 4 10 11 11
11 *MATCH *REP 5 10 11 11

The pos_state and literal_pos_state values consist of respectively the pb and lp (up to 4, from the LZMA header or LZMA2 properties packet) least significant bits of the dictionary position (the number of bytes coded since the last dictionary reset modulo the dictionary size). Note that the dictionary size is normally the multiple of a large power of 2, so these values are equivalently described as the least significant bits of the number of uncompressed bytes seen since the last dictionary reset.

The prev_byte_lc_msbs value is set to the lc (up to 4, from the LZMA header or LZMA2 properties packet) most significant bits of the previous uncompressed byte.

The is_REP value denotes whether a packet that includes a length is a LONGREP rather than a MATCH.

The match_byte value is the byte that would have been decoded if a SHORTREP packet had been used (in other words, the byte found at the dictionary at the last used distance); it is only used just after a *MATCH packet.

literal_bit_mode is an array of 8 values in the 0-2 range, one for each bit position in a byte, which are 1 or 2 if the previous packet was a *MATCH and it is either the most significant bit position or all the more significant bits in the literal to encode/decode are equal to the bits in the corresponding positions in match_byte , while otherwise it is 0; the choice between the 1 or 2 values depends on the value of the bit at the same position in match_byte .

The literal/Literal set of variables can be seen as a "pseudo-bit-tree" similar to a bit-tree but with 3 variables instead of 1 in every node, chosen depending on the literal_bit_mode value at the bit position of the next bit to decode after the bit-tree context denoted by the node.

The claim, found in some sources, that literals after a *MATCH are coded as the XOR of the byte value with match_byte is incorrect; they are instead coded simply as their byte value, but using the pseudo-bit-tree just described and the additional context listed in the table below.

The probability variable groups used in LZMA are those:

XZ name LZMA SDK name parameterized by used when coding mode if bit 0 then if bit 1 then
is_match IsMatch state , pos_state packet start bit LIT *MATCH
is_rep IsRep state after bit sequence 1 bit MATCH *REP
is_rep0 IsRepG0 state after bit sequence 11 bit SHORTREP/


is_rep0_long IsRep0Long state , pos_state after bit sequence 110 bit SHORTREP LONGREP[0]
is_rep1 IsRepG1 state after bit sequence 111 bit LONGREP[1] LONGREP[2/3]
is_rep2 IsRepG2 state after bit sequence 1111 bit LONGREP[2] LONGREP[3]
literal Literal prev_byte_lc_msbs , literal_pos_state , literal_bit_mode [bit position], bit-tree context after bit sequence 0 256 values pseudo-bit-tree literal byte value
dist_slot PosSlot min( match_length , 5), bit-tree context distance: start 64 values bit-tree distance slot
dist_special SpecPos distance_slot , reverse bit-tree context distance: 4-13 distance slots (( distance_slot >> 1) − 1)-bit reverse bit-tree low bits of distance
dist_align Align reverse bit-tree context distance: 14+ distance slots, after fixed probability bits 4-bit reverse bit-tree low bits of distance
len_dec.choice LenChoice is_REP match length: start bit 2-9 length 10+ length
len_dec.choice2 LenChoice2 is_REP match length: after bit sequence 1 bit 10-17 length 18+ length
len_dec.low LenLow is_REP , pos_state , bit-tree context match length: after bit sequence 0 8 values bit-tree low bits of length
len_dec.mid LenMid is_REP , pos_state , bit-tree context match length: after bit sequence 10 8 values bit-tree middle bits of length
len_dec.high LenHigh is_REP , bit-tree context match length: after bit sequence 11 256 values bit-tree high bits of length
LZMA2 format [ edit ]

The LZMA2 container supports multiple runs of compressed LZMA data and uncompressed data. Each LZMA compressed run can have a different LZMA configuration and dictionary. This improves the compression of partially or completely incompressible files and allows multithreaded compression and multithreaded decompression by breaking the file into runs that can be compressed or decompressed independently in parallel.

The LZMA2 header consists of a byte indicating the dictionary size:

LZMA2 data consists of packets starting with a control byte, with the following values:

Bits 5-6 for LZMA chunks can be:

LZMA state resets cause a reset of all LZMA state except the dictionary, and specifically:

Uncompressed chunks consist of:

LZMA chunks consist of:

xz and 7z formats [ edit ]

The . xz format, which can contain LZMA2 data, is documented at , [7] while the .7z file format, which can contain either LZMA or LZMA2 data, is documented in the 7zformat.txt file contained in the LZMA SDK. [8]

Compression algorithm details [ edit ]

Similar to the decompression format situation, no complete natural language specification of the encoding techniques in 7-zip or xz seems to exist, other than the one attempted in the following text.

The description below is based on the XZ for Java encoder by Lasse Collin, [9] which appears to be the most readable among several rewrites of the original 7-zip using the same algorithms: again, while citing source code as reference isn't ideal, any programmer should be able to check the claims below with a few hours of work.

Range encoder [ edit ]

The range encoder cannot make any interesting choices, and can be readily constructed based on the decoder description.

Initialization and termination are not fully determined; the xz encoder outputs 0 as the first byte which is ignored by the decompressor, and encodes the lower bound of the range (which matters for the final bytes).

The xz encoder uses an unsigned 33-bit variable called low (typically implemented as a 64-bit integer, initialized to 0), an unsigned 32-bit variable called range (initialized to 2 32 − 1), an unsigned 8-bit variable called cache (initialized to 0), and an unsigned variable called cache_size which needs to be large enough to store the uncompressed size (initialized to 1, typically implemented as a 64-bit integer).

The cache / cache_size variables are used to properly handle carries, and represent a number defined by a big-endian sequence starting with the cache value, and followed by cache_size 0xff bytes, which has been shifted out of the low register, but hasn't been written yet, because it could be incremented by one due to a carry.

Note that the first byte output will always be 0 due to the fact that cache and low are initialized to 0, and the encoder implementation; the xz decoder ignores this byte.

Normalization proceeds in this way:

  1. If low is less than (2^32 - 2^24):
    1. Output the byte stored in cache to the compressed stream
    2. Output cache_size - 1 bytes with 0xff value
    3. Set cache to bits 24-31 of low
    4. Set cache_size to 0
  2. If low is greater or equal than 2^32:
    1. Output the byte stored in cache plus one to the compressed stream
    2. Output cache_size - 1 bytes with 0 value
    3. Set cache to bits 24-31 of low
    4. Set cache_size to 0
  3. Increment cache_size
  4. Set low to the lowest 24 bits of low shifted left by 8 bits
  5. Set range to range shifted left by 8 bits

Context-based range encoding of a bit using the prob probability variable proceeds in this way:

  1. If range is less than 2^24, perform normalization
  2. Set bound to floor( range / 2^11) * prob
  3. If encoding a 0 bit:
    1. Set range to bound
    2. Set prob to prob + floor((2^11 - prob ) / 2^5)
  4. Otherwise (if encoding a 1 bit):
    1. Set range to range - bound
    2. Set low to low + bound
    3. Set prob to prob - floor( prob / 2^5)

Fixed-probability range encoding of a bit proceeds in this way:

  1. If range is less than 2^24, perform normalization
  2. Set range to floor( range / 2)
  3. If encoding a 1 bit:
    1. Set low to low + range

Termination proceeds this way:

  1. Perform normalization 5 times

Bit-tree encoding is performed like decoding, except that bit values are taken from the input integer to be encoded rather than from the result of the bit decoding functions.

For algorithms that try to compute the encoding with the shortest post-range-encoding size, the encoder also needs to provide an estimate of that.

Dictionary search data structures [ edit ]

The encoder needs to be able to quickly locate matches in the dictionary. Since LZMA uses very large dictionaries (potentially on the order of gigabytes) to improve compression, simply scanning the whole dictionary would result in an encoder too slow to be practically usable, so sophisticated data structures are needed to support fast match searches.

Hash chains [ edit ]

The simplest approach, called "hash chains", is parameterized by a constant N which can be either 2, 3 or 4, which is typically chosen so that 2^(8× N ) is greater than or equal to the dictionary size.

It consists of creating, for each k less than or equal to N , a hash table indexed by tuples of k bytes, where each of the buckets contains the last position where the first k bytes hashed to the hash value associated with that hash table bucket.

Chaining is achieved by an additional array which stores, for every dictionary position, the last seen previous position whose first N bytes hash to the same value of the first N bytes of the position in question.

To find matches of length N or higher, a search is started using the N -sized hash table, and continued using the hash chain array; the search stop after a pre-defined number of hash chain nodes has been traversed, or when the hash chains "wraps around", indicating that the portion of the input that has been overwritten in the dictionary has been reached.

Matches of size less than N are instead found by simply looking at the corresponding hash table, which either contains the latest such match, if any, or a string that hashes to the same value; in the latter case, the encoder won't be able to find the match. This issue is mitigated by the fact that for distant short matches using multiple literals might require less bits, and having hash conflicts in nearby strings is relatively unlikely; using larger hash tables or even direct lookup tables can reduce the problem at the cost of higher cache miss rate and thus lower performance.

Note that all matches need to be validated to check that the actual bytes match currently at that specific dictionary position match, since the hashing mechanism only guarantees that at some past time there were characters hashing to the hash table bucket index (some implementations may not even guarantee that, because they don't initialize the data structures).

LZMA uses Markov chains , as implied by "M" in its name.

Binary trees [ edit ]

The binary tree approach follows the hash chain approach, except that it logically uses a binary tree instead of a linked list for chaining.

The binary tree is maintained so that it is always both a search tree relative to the suffix lexicographic ordering, and a max-heap for the dictionary position [10] (in other words, the root is always the most recent string, and a child cannot have been added more recently than its parent): assuming all strings are lexicographically ordered, these conditions clearly uniquely determine the binary tree (this is trivially provable by induction on the size of the tree).

Since the string to search for and the string to insert are the same, it is possible to perform both dictionary search and insertion (which requires to rotate the tree) in a single tree traversal.

Patricia tries [ edit ]

Some old LZMA encoders also supported a data structure based on Patricia tries , but such support has since been dropped since it was deemed inferior to the other options. [10]

LZMA encoder [ edit ]

LZMA encoders can freely decide which match to output, or whether to ignore the presence of matches and output literals anyway.

The ability to recall the 4 most recently used distances means that, in principle, using a match with a distance that will be needed again later may be globally optimal even if it is not locally optimal, and as a result of this, optimal LZMA compression probably requires knowledge of the whole input and might require algorithms too slow to be usable in practice.

Due to this, practical implementations tend to employ non-global heuristics.

The xz encoders use a value called nice_len (the default is 64): when any match of length at least nice_len is found, the encoder stops the search and outputs it, with the maximum matching length.

Fast encoder [ edit ]

The XZ fast encoder [11] (derived from the 7-zip fast encoder) is the shortest LZMA encoder in the xz source tree.

It works like this:

  1. Perform combined search and insertion in the dictionary data structure
  2. If any repeated distance matches with length at least nice_len :
    • Output the most frequently used such distance with a REP packet
  3. If a match was found of length at least nice_len :
    • Output it with a MATCH packet
  4. Set the main match to the longest match
  5. Look at the nearest match of every length in decreasing length order, and until no replacement can be made:
    • Replace the main match with a match which is one character shorter, but whose distance is less than 1/128 the current main match distance
  6. Set the main match length to 1 if the current main match is of length 2 and distance at least 128
  7. If a repeated match was found, and it is shorter by at most 1 character than the main match:
    • Output the repeated match with a REP packet
  8. If a repeated match was found, and it is shorter by at most 2 characters than the main match, and the main match distance is at least 512:
    • Output the repeated match with a REP packet
  9. If a repeated match was found, and it is shorter by at most 3 characters than the main match, and the main match distance is at least 32768:
    • Output the repeated match with a REP packet
  10. If the main match size is less than 2 (or there isn't any match):
    • Output a LIT packet
  11. Perform a dictionary search for the next byte
  12. If the next byte is shorter by at most 1 character than the main match, with distance less than 1/128 times the main match distance, and if the main match length is at least 3:
    • Output a LIT packet
  13. If the next byte has a match at least as long as the main match, and with less distance than the main match:
    • Output a LIT packet
  14. If the next byte has a match at least one character longer than the main match, and such that 1/128 of its distance is less or equal than the main match distance:
    • Output a LIT packet
  15. If the next byte has a match more than one character longer than the main match:
    • Output a LIT packet
  16. If any repeated match is shorter by at most 1 character than the main match:
    • Output the most frequently used such distance with a REP packet
  17. Output the main match with a MATCH packet
Normal encoder [ edit ]

The XZ normal encoder [12] (derived from the 7-zip normal encoder) is the other LZMA encoder in the xz source tree, which adopts a more sophisticated approach that tries to minimize the post-range-encoding size of the generated packets.

Specifically, it encodes portions of the input using the result of a dynamic programming algorithm, where the subproblems are finding the approximately optimal encoding (the one with minimal post-range-encoding size) of the substring of length L starting at the byte being compressed.

The size of the portion of the input processed in the dynamic programming algorithm is determined to be the maximum between the longest dictionary match and the longest repeated match found at the start position (which is capped by the maximum LZMA match length, 273); furthermore, if a match longer than nice_len is found at any point in the range just defined, the dynamic programming algorithm stops, the solution for the subproblem up to that point is output, the nice_len -sized match is output, and a new dynamic programming problem instance is started at the byte after the match is output.

Subproblem candidate solutions are incrementally updated with candidate encodings, constructed taking the solution for a shorter substring of length L', extended with all possible "tails", or sets of 1-3 packets with certain constraints that encode the input at the L' position. Once the final solution of a subproblem is found, the LZMA state and least used distances for it are computed, and are then used to appropriately compute post-range-encoding sizes of its extensions.

At the end of the dynamic programming optimization, the whole optimal encoding of the longest substring considered is output, and encoding continues at the first uncompressed byte not already encoded, after updating the LZMA state and least used distances.

Each subproblem is extended by a packet sequence which we call "tail", which must match one of the following patterns:

1st packet 2nd packet 3rd packet

The reason for not only extending with single packets is that subproblems only have the substring length as the parameter for performance and algorithmic complexity reasons, while an optimal dynamic programming approach would also require to have the last used distances and LZMA state as parameter; thus, extending with multiple packets allows to better approximate the optimal solution, and specifically to make better use of LONGREP[0] packets.

The following data is stored for each subproblem (of course, the values stored are for the candidate solution with minimum price ), where by "tail" we refer to the packets extending the solution of the smaller subproblem, which are described directly in the following structure:

XZ for Java member name description
price quantity to be minimized: number of post-range-encoding bits needed to encode the string
optPrev uncompressed size of the substring encoded by all packets except the last one
backPrev -1 if the last packet is LIT, 0-3 if it is a repetition using the last used distance number 0-3, 4 + distance if it is a MATCH (this is always 0 if prev1IsLiteral is true, since the last packet can only be a LONGREP[0] in that case)
prev1IsLiteral true if the "tail" contains more than one packet (in which case the one before the last is a LIT)
hasPrev2 true if the "tail" contains 3 packets (only valid if prev1IsLiteral is true)
optPrev2 uncompressed size of the substring encoded by all packets except the "tail" (only valid if prev1IsLiteral and hasPrev2 are true)
backPrev2 -1 if the first packet in the "tail" is LIT, 0-3 if it is a repetition using the last used distance number 0-3, 4 + distance if it is a MATCH (only valid if prev1IsLiteral and hasPrev2 are true)
reps[4] the values of the 4 last used distances after the packets in the solution (computed only after the best subproblem solution has been determined)
state the LZMA state value after the packets in the solution (computed only after the best subproblem solution has been determined)

Note that in the XZ for Java implementation, the optPrev and backPrev members are reused to store a forward single-linked list of packets as part of outputting the final solution.

LZMA2 encoder [ edit ]

The XZ LZMA2 encoder processes the input in chunks (of up to 2 MB uncompressed size or 64 KB compressed size, whichever is lower), handing each chunk to the LZMA encoder, and then deciding whether to output an LZMA2 LZMA chunk including the encoded data, or to output an LZMA2 uncompressed chunk, depending on which is shorter (LZMA, like any other compressor, will necessarily expand rather than compress some kinds of data).

The LZMA state is reset only in the first block, if the caller requests a change of properties and every time a compressed chunk is output. The LZMA properties are changed only in the first block, or if the caller requests a change of properties. The dictionary is only reset in the first block.

Upper encoding layers [ edit ]

Before LZMA2 encoding, depending on the options provided, xz can apply the BCJ filter, which filters executable code to replace relative offsets with absolute ones that are more repetitive, or the delta filter, which replaces each byte with the difference between it and the byte N bytes before it.

Parallel encoding is performed by dividing the file in chunks which are distributed to threads, and ultimately each encoded (using, for instance, xz block encoding) separately, resulting in a dictionary reset between chunks in the output file.

7-Zip reference implementation [ edit ]

The LZMA implementation extracted from 7-Zip is available as LZMA SDK. It was originally dual-licensed under both the GNU LGPL and Common Public License , [13] with an additional special exception for linked binaries, but was placed by Igor Pavlov in the public domain on December 2, 2008, with the release of version 4.62. [8]

LZMA2 compression, which is an improved version of LZMA, [14] is now the default compression method for the .7z format, starting with version 9.30 on October 26th, 2012. [15]

The reference open source LZMA compression library is written in C++ and has the following properties:

In addition to the original C++ , the LZMA SDK contains reference implementations of LZMA compression and decompression ported to ANSI C , C# , and Java . [8] There are also third-party Python bindings for the C++ library, as well as ports of LZMA to Pascal , Go and Ada . [16] [17] [18] [19]

The 7-Zip implementation uses several variants of hash chains , binary trees and Patricia tries as the basis for its dictionary search algorithm.

Decompression-only code for LZMA generally compiles to around 5 KB, and the amount of RAM required during decompression is principally determined by the size of the sliding window used during compression. Small code size and relatively low memory overhead, particularly with smaller dictionary lengths, and free source code make the LZMA decompression algorithm well-suited to embedded applications.

Other implementations [ edit ]

In addition to the 7-Zip reference implementation, the following support the LZMA format.

LZHAM [ edit ]

LZHAM (LZ, Huffman, Arithmetic, Markov), is an LZMA-like implementation that trades compression throughput for very high ratios and higher decompression throughput. [23]

References [ edit ]
  1. Jump up ^ Igor Pavlov has asserted multiple times on SourceForge that the algorithm is his own creation. (2004-02-19). "LZMA spec?" . Archived from the original on 2009-08-25 . Retrieved 2013-06-16 .
  2. ^ Jump up to: a b Lasse Collin (2005-05-31). "A Quick Benchmark: Gzip vs. Bzip2 vs. LZMA" . Retrieved 2015-10-21 . - LZMA Unix Port was finally replaced by xz which features better and faster compression; from here we know even LZMA Unix Port was a lot better than gzip and bzip2.
  3. Jump up ^ Klausmann, Tobias (2008-05-08). "Gzip, Bzip2 and Lzma compared" . Blog of an Alpha animal . Retrieved 2013-06-16 .
  4. Jump up ^ Igor Pavlov (2013). "7z Format" . Retrieved 2013-06-16 .
  5. Jump up ^ Mahoney, Matt. "Data Compression Explained" . Retrieved 2013-11-13 .
  6. Jump up ^ Collin, Lasse; Pavlov, Igor. "lib/xz/xz_dec_lzma2.c" . Retrieved 2013-06-16 .
  7. Jump up ^ "The .xz File Format" . 2009-08-27 . Retrieved 2013-06-16 .
  8. ^ Jump up to: a b c Igor Pavlov (2013). "LZMA SDK (Software Development Kit)" . Retrieved 2013-06-16 .
  9. Jump up ^ "XZ in Java" . Retrieved 2013-06-16 . [ permanent dead link ]
  10. ^ Jump up to: a b Solomon, David (2006-12-19). Data Compression: The Complete Reference (4 ed.). Springer Publishing . p. 245. ISBN 978-1846286025 .
  11. Jump up ^ Collin, Lasse; Pavlov, Igor. "" . Archived from the original on 2012-07-16 . Retrieved 2013-06-16 .
  12. Jump up ^ Collin, Lasse; Pavlov, Igor. "" . Archived from the original on 2012-07-08 . Retrieved 2013-06-16 .
  13. Jump up ^ "Browse /LZMA SDK/4.23" . Sourceforge . Retrieved 2014-02-12 .
  14. Jump up ^ "Inno Setup Help" . . Retrieved 2013-06-16 . LZMA2 is a modified version of LZMA that offers a better compression ratio for uncompressible data (random data expands about 0.005%, compared to 1.35% with original LZMA), and optionally can compress multiple parts of large files in parallel, greatly increasing compression speed but with a possible reduction in compression ratio.
  15. Jump up ^ "HISTORY of the 7-Zip" . 2012-10-26 . Retrieved 2013-06-16 .
  16. Jump up ^ Bauch, Joachim (2010-04-07). "PyLZMA – Platform independent python bindings for the LZMA compression library" . Retrieved 2013-06-16 .
  17. Jump up ^ Birtles, Alan (2006-06-13). "Programming Help: Pascal LZMA SDK" . Retrieved 2013-06-16 .
  18. Jump up ^ Vieru, Andrei (2012-06-28). "compress/lzma package for Go 1" . Retrieved 2013-06-16 .
  19. Jump up ^ "Zip-Ada" .
  20. Jump up ^ Guillem Jover. "Accepted dpkg 1.17.0 (source amd64 all)" . Debian Package QA . Retrieved 2015-10-21 .
  21. Jump up ^ Diaz, Diaz. "Lzip Benchmarks" . LZIP (nongnu).
  22. Jump up ^ "What is a Zipx File?" . . Retrieved 2016-03-14 .
  23. Jump up ^ "LZHAM – Lossless Data Compression Codec" . Richard Geldreich. LZHAM is a lossless data compression codec written in C/C++ with a compression ratio similar to LZMA but with 1.5x-8x faster decompression speed.
External links [ edit ]
[ show ] Archive formats
Archiving only
Compression only
Archiving and compression
Software packaging and distribution
Document packaging and distribution
[ show ] Data compression methods
Entropy type
Dictionary type
Other types
Codec parts
Codec parts
Retrieved from "–Ziv–Markov_chain_algorithm&oldid=831085554 " Categories : Hidden categories: Navigation menu Personal tools Namespaces Variants Views More Navigation Interaction Tools Print/export Languages Edit links

[Mar 13, 2018] New Ubuntu Installs Could Be Speed Up by 10% with the Zstd Compression Algorithm

Mar 13, 2018 |

(Mar 12, 2018, 11:00) ( 0 talkbacks )

zstd is an open-source lossless data compression algorithm designed to offer fast real-time compression and decompression speeds, even faster than xz or gzip.

[May 13, 2011] Lzip

Lzip is a lossless data compressor based on the LZMA algorithm, with very safe integrity checking and a user interface similar to the one of gzip or bzip2. Lzip decompresses almost as quickly as gzip and compresses better than bzip2, which makes it well suited for software distribution and data archiving. Lziprecover is a data recovery tool for lzip compressed files able to repair slightly damaged files, recover badly damaged files from two or more copies, and extract undamaged members from multi-member files.


Re: Other LZMA tools

I think you are right that the standalone 'lzma' program (replacing the older lzmash) has a very basic data format. But still, it works, and is the more established tool. I would be happy for lzip to replace it if lzip is better, but to do that it should include support for decompressing legacy .lzma files.

(I note that the gzip format has provision for alternative compression methods but nobody ever seems to use it.)

> As for lrzip, it is actually

> an extension of rzip---and the two are

> more of a proof-of-concept than a

> realworld-workable format.

The file format may be basic but the tool is very good. It usually compresses better than plain LZMA (the algorithm, used in both lzma-utils and lzip) and faster too. LZMA is better for all-purpose use but for batch compression tasks where you don't mind relatively high memory usage, lrzip can give a big improvement. For some Subversion dump files I back up overnight it gave a fourfold increase in compression for about the same speed.


Re: Other LZMA tools

As I gather, lzma-utils-produced files lack magic identification bytes and a checksum, and if you believe forum archives, lzma-utils did not manage to come up with a suitable new format in more than two years. It is about time lzip came along-7z sounds nice too, but seems to have gotten no ground in the Unix world due to subpar unix integration.

As for lrzip, it is actually an extension of rzip-and the two are more of a proof-of-concept than a realworld-workable format.

[Jan 30, 2008] History of Data Compression in Japan

In 1987, I was asked by a magazine editor to write an article about data compression. I wrote a manuscript and an accompanying program, sent them to the editor, and forgot about them. The next time I heard from him I was told that the magazine was discontinued. So I uploaded my program, a simple implementation of the LZSS compression algorithm (see below) to PC-VAN, a big Japanese BBS run by NEC. That was May 1, 1988.

Soon a number of hobby programmers gathered and began improving on that program. The project culminated in Kazuhiko Miki's archiver LArc, which was fairly widey used in Japan. (Dr. Miki was then a medical specialist working at a governmental office. I heard he left office and began work on freeware/shareware promotion.)

The LZSS algorithm is based on a very simple idea. Suppose I'm going to write "compression" here. But probably I've already used that word before in this file. If I used that word 57 characters before, I might as well write "go 57 characters backward, and read 11 characters," or <57,11> for short. In general, when I've already used the string of characters among the recent 4096 characters, say, I encode the string by a <position,length> pair.

In Storer's [8] terminology, this is a sliding dictionary algorithm, analyzed first by Ziv and Lempel [14] and then by Storer and Szymanski [9], among others.

Later versions of my LZSS implementations and Miki's LArc used binary search trees to make string search faster; see Bell [1].

Incidentally, there are two distinct Ziv-Lempel (LZ) methods: sliding dictionary [14] and dynamic dictionary [15] in Storer's [8] terminology. The LZW algorithm [12] belongs to the latter. Most pre-LHarc compression tools, such as 'compress', 'ARC', and 'PKARC', used LZW.

During the summer of 1988, I wrote another compression program, LZARI. This program is based on the following observation: Each output of LZSS is either a single character or a <position,length> pair. A single character can be coded as an integer between 0 and 255. As for the <length> field, if the range of <length> is 2 to 257, say, it can be coded as an integer between 256 and 511. Thus, I can say that there are 512 kinds of "characters," and the "characters" 256 through 511 are accompanied by a <position> field. These 512 "characters" can be Huffman-coded, or better still, algebraically coded. The <position> field can be coded in the same manner. In LZARI I used an adaptive algebraic compression [13], [2] to encode the "characters," and static algebraic compression to encode the <position> field. (There were several versions of LZARI; some of them were slightly different from the above description.) The compression of LZARI was very tight, though rather slow.

Haruyasu Yoshizaki (Yoshi), a physician and guru hobby programmer, worked very hard to make LZARI faster. Most importantly, he replaced LZARI's algebraic compression by dynamic Huffman coding.

His program, LZHUF, was very successful. It was much faster than my LZARI. As for compression ratio, Huffman cannot beat algebraic compression, but the difference turned out to be very small.

Yoshi rewrote the compression engine of LZHUF in assembler, and added a nifty user interface. His archiver, LHarc, soon became the de facto standard among Japanese BBS users. After Prof. Kenjirou Okubo, a mathematician, introduced LHarc to the United States, it became world-famous. Other vendors began using similar techniques: sliding dictionary plus statistical compressions such as Huffman and Shannon-Fano. (I wondered why they used Shannon-Fano rather than Huffman which is guaranteed to compress tighter than Shannon-Fano. As it turned out, a then-popular book on compression published in U.S. contained a wrong description and buggy sample programs, such that Shannon-Fano outperformed (buggy) Huffman on many files.)

Although LHarc was much faster than LZARI, we weren't quite satisfied with its speed. Because LHarc was based on dynamic Huffman, it had to update Huffman tree every time it received a character. Yoshi and I tried other dynamic Huffman algorithms [5], [10], [11], but improvements were not as great as we desired.

So I took a different step: replacing LHarc's dynamic Huffman by a static Huffman method.

Traditional static Huffman coding algorithm first scans the input file to count character distribution, then builds Huffman tree and encodes the file. In my approach, the input file is read only once. It is first compressed by a sliding dictionary method like LZARI and LHarc, and at the same time the distributions of the "characters" (see above) and positions are counted. The output of this process is stored in main memory. When the buffer in memory is full (or the input is exhausted), the Huffman trees are constructed, and the half-processed content of the buffer is actually compressed and output.

In static Huffman, the Huffman tree must be stored in the compressed file. In the traditional approach this information consumes hundreds of bytes. My approach was to standardize Huffman trees so that (1) each left subtree is no deeper than its right counterpart, and (2) the leaves at the same level are sorted in ascending order. In this way the Huffman tree can be uniquely specified by the lengths of the codewords. Moreover, the resulting table is again compressed by the same Huffman algorithm.

To make the decoding program simpler, the Huffman tree is adjusted so that the codeword lengths do not exceed 16 bits. Since this adjusting is rarely needed, the algorithm is made very simple. It does not create optimal length-limited Huffman trees; see e.g. [6] for an optimal algorithm. Incidentally, my early program had a bug here, which was quickly pointed out and corrected by Yoshi.

The sliding dictionary algorithm is also improved by Yoshi using a "PATRICIA tree" data structure; see McCreight [7] and Fiala and Greene [4].

After completing my algorithm, I learned that Brent [3] also used a sliding dictionary plus Huffman coding. His method, SLH, is simple and elegant, but since it doesn't find the most recent longest match, the distribution of match position becomes flat. This makes the second-stage Huffman compression less efficient.

On the basis of these new algorithms, Yoshi began to rewrite his LHarc, but it took him so long (remember he was a busy doctor!) that I decided to write my own archiver. My archiver was quite recklessly named 'ar'. (Actually I appended version numbers as in 'ar002' for version 0.02.) I should have named it 'har' (after my name), say, because 'ar' collides with the name of UNIX's archiver. I didn't want my program to compete with LHarc, but I wanted many people to try the algorithm, so I wrote it in pure ANSI C. This is the reason 'ar' lacked many bells and whistles necessary for a real archiver.

Note: The version of 'ar002' most often found in the U.S. had a bug. Line 24 of maketbl.c should read, of course,
    while (i <= 16) {
        weight[i] = 1U << (16 - i);  i++;
Somehow the bug didn't show up when compiled by Turbo C.

Yoshi finally showed us his new archiver written in C. It was tentatively named LHx. He then rewrote the main logic in assembler. Yoshi and I wrote an article describing his new archiver, which would be named LH, in the January, 1991, issue of "C Magazine" (in Japanese). The suffix 'arc' of LHarc was deliberately dropped because the people who sold ARC did not want others to use the name.

Then we learned that for the new DOS 5.0, LH meaned LoadHigh, an internal command. We decided to rename LH to LHA.

Also, I was told that the algorithm described in Fiala and Greene [4] got patented ("Textual Substitution Data Compression With Finite Length Search Windows," U.S. Patent 4,906,991, Mar. 6, 1990. Actually they got three patents! The other two were: "Start, Step, Stop Unary Encoding for Data Compression," Application Ser. No. 07/187,697, and "Search Tree Data Structure Encoding for Textual Substitution Data Compression Systems," Application Ser. No. 07/187,699.)

Furthermore, I learned that the original Ziv-Lempel compression method (Eastman et al., U.S. Patent 4,464,650, 8/1984) and the LZW method (Welch, 4,558,302, 12/1985) were patented. I also heard that Richard Stallman, of the Free Software Foundation, author of the EMACS editor and leader of the GNU project, ceased to use 'compress' program any more because its LZW algorithm got patented.

Are algorithms patentable? (See [16].) If these patents should turn out to be taken seriously, all compression programs now in use may infringe some of these patents. (Luckily, not all claims made by those algorithm patents seems to be valid.)

The foregoing is a slight modification of what I wrote in 1991. The year 1991 was a very busy year for me. In 1992, I joined the faculty of Matsusaka University. This opportunity should have given me more free time, but as it turned out I got ever busier. I stopped hacking on my compression algorithms; so did Yoshi.

Luckily, all good things in LHA were taken over, and all bad things abandoned, by the new great archiver zip and the compression tool gzip. I admire the efforts of Jean-loup Gailly and others.

A brief historical comment on PKZIP: At one time a programmer for PK and I were in close contact. We exchanged a lot of ideas. No wonder PKZIP and LHA are so similar.

Another historical comment: LHICE and ICE are definitely not written by Yoshi (or me or anyone I know). I think they are faked versions of LHarc.


Timothy C. Bell. Better OPM/L text compression. IEEE Transactions on Communications, COM-34(12):1176--1182, 1986.
Timothy C. Bell, John G. Cleary, and Ian H. Witten. Text Compression. Prentice Hall, 1990.
R. P. Brent. A linear algorithm for data compression. The Australian Computer Journal, 19(2):64--68, 1987.
Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Communications of the ACM, 32(4):490--505, 1989.
Donald E. Knuth. Dynamic Huffman coding. Journal of Algorithms, 6:163--180, 1985.
Lawrence L. Larmore and Daniel S. Hirschberg. A fast algorithm for optimal length-limited Huffman codes. Journal of the Association for Computing Machinery, 37(3):464--473, 1990.
Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the Association for Computing Machinery, 23(2):262--272, 1976.
James A. Storer. Data Compression: Methods and Theory. Computer Science Press, Rockville, MD., 1988.
James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. Journal of the Association for Computing Machinery, 29(4):928--951, 1982.
Jeffrey Scott Vitter. Design and analysis of dynamic Huffman codes. Journal of the Association for Computing Machinery, 34(4):825--845, 1987.
Jeffrey Scott Vitter. Algorithm 673: Dynamic Huffman coding. ACM Transactions on Mathematical Software, 15(2):158--167, 1989.
Terry A. Welch. A technique for high-performance data compression. IEEE Computer}, 17(6):8--19, 1984.
Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520--540, 1987.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, IT-23(3):337--343, 1977.
Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, IT-24(5):530--536, 1978.
Edward N. Zalta. Are algorithms patentable? Notices of the American Mathematical Society, 35(6):796--799, 1988.

Haruhiko Okumura,

Last modified: Tue Mar 17 17:02:03 1998

Arturo Campos home page

The column "html" shows the size of the article in html format and if clicked opens it in a new window. The "zip" column contains the zipped article and images or source if any. Note that brighter rows mean new or updated article. If you take the time to read the articles, please take the time to tell me what do you think of them.
Title Html Zip Last update
Arithmetic coding, entropy coder 20k 7k 22-Jul-1999
Bwt, block reduction 21k 8k 22-Jul-1999
Canonical huffman. 15k 5k 23-Jul-1999
Crc-32, the standard Crc 6k 7k 10-Aug-1999
Finite context modeling 37k 12k 16-Nov-1999
Flexible parsing, improvement of lz 10k 4k 24-Jul-1999
Lz77, also called lzss 40k 14k 23-Jul-1999
Lzw, for gif decoding 27k 9k 23-Jul-1999
Mtf, a transformation scheme 9k 3k 24-Jul-1999
Implementing ppmc with hash tables 111k 327k 21-March-2000
Quasi Static model 19k 6k 13-Aug-1999
Range coder 24k 8k 17-Nov-1999
Rle, Basic scheme 7k 3k 24-Jul-1999

As you can read in the index I'm going to publish as soon as possible new articles, if you want to get an email when this happens, please fill this form, moreover you can use it to tell me which file format do you prefer. If you have any question about the articles feel free to email me.

My goal is to offer the best articles about compression for free. Thus I need your feedback on the articles to further improve them: What you don't know. What was confusing or unclear. What difficulties you found while implementing it. Or what you miss. Teeling me so is the only way of further improving articles for you, and depends only of you.

Note that since my first articles both my knowledge about compression and english have greatly improved, thus you can notice a difference in quality between old articles or the latest ones. Unfortunately I don't have the time to rewrite all the articles (as I would like to) so instead please tell me which articles do you find worth rewriting, and I'll do so.

[Oct 31, 2004] Data Compression -- mini-tutorial by Greg Goebel.

The spread of computing has led to an explosion in the volume of data to be stored on hard disks and sent over the Internet. This growth has led to a need for "data compression", that is, the ability to reduce the amount of storage or Internet bandwidth required to handle data.

This document provides a survey of data compression techniques. The focus is on the most prominent data compression schemes, particularly popular archivers, JPEG image compression, and MPEG video and audio compression.

[April 22, 2000] Slashdot Phillip W. Katz, Creator Of PKZIP, Dead At 37

Gone are the days... (Score:5) by zpengo (99887) on Saturday April 22 2000, @06:31PM (#1116033) Homepage

Oh, the memories...

I used to go down to the local computer store, which had bins and bins of the latest shareware, all on precious 5 1/4 disks. Each one held some sort of magic that would transform my XT with Hercules graphics into a completely absorbing experience.

Video games, clones of major applications, dinky little Pascal compilers, my first version of Spacewar....

But there was a key to all of that magic. Back then, there were no auto-installing CDs. There was no "setup.exe" There would just be a single file, with that ever-familiar extension: ".ZIP"

I had been on the scene long enough to know what was up, so I not only had PKZIP/PKUNZIP installed on my 4 meg harddrive, but I even had it in the PATH.

A few keystrokes later, the magic was unlocked.

We don't know how much we owe to this great man. I genuinely mourn his passing.

37 (Score:4) by roman_mir (125474) on Saturday April 22 2000, @06:36PM (#1116035) Homepage Journal

So many celebrities, poets, actors, revolutionaries, wariers, politicians etc have died on 33 and 37, I tell you, if you pass 37 you'll probably live a long life.

(to those of us who remember Vladimir Visotskiy) Na zifre 37, kovaren bog, rebrom vopros postavil: ili, ili
Na etom rubeje legli i Bairon i Rembo a nineshnie kak-to proskochili...

Sad day.... by maelstrom (Score:1) Saturday April 22 2000, @06:37PM

*sigh* (Score:5) by Seumas (6865) on Saturday April 22 2000, @06:37PM (#1116040)

This is the first I've heard of his death and I have to say that it really makes me feel sad. I'm not aware of much that he's done outside of PKZIP, but I sure remember using ZIP for everything online (especially when a 2400 baud modem was considered fast and a zipped file could half your online time).

Huffman, Postel, Stevens . . . Now P.W. Katz. I feel guilty for not ever considering any of these people beyond what their program does or does not do for me -- or how I benefitted from their books, until after their death. To think that while we're all out there unzipping our latest copy of the Jargon file or stashing a bunch of porn in a password protected ZIP file, this guy was suffering a serious problem which eventually took his life at the age of *thirty-seven*.

I'm only 22. I spend all my time working at a desk. I haven't been in-shape for almost six years. I could be next. I could be next and I haven't offered a damn thing to the computer or internet community. These people -- and many others, have.

I hope that we'll remember these things in subsequent posts in reply to this article. The last thing we need is another disgustingly barbaric replay of the posts we saw when W. Richard Stevens died.

I hope you have peace, Phillip.

W. Richard Stevens Slashdot Article []
W. Richard Stevens Home Page []
David Huffman Slashdot Article []
Jon Postel Slashdot Article []
Jon Postel's Home Page []

Re:PK vs. SEA controversy (Score:4) by farrellj (563) on Saturday April 22 2000, @07:04PM (#1116054) Homepage Journal

O.K., here is the story as I remember it.

Phil wrote a better compression program that was compatible with System Enhancements Associates (SEA) program called ARC. So they litigated. And so Phil went off and found a better algorithem for compression, and brought out PKZIP.

Many people in the BBS community thought that SEA was a little heavyhanded (Perception, I don't know the reality), and moved to PKZIP. Others moved over for the speed and the better compression. The rest is history.

See also "arc wars" []MIT Jargon File ver 299. This story seems to have been dropped from the current Jargon File [] for some reason.

Farrell McGovern

Former Sysop, Data/SFnet (One of the first few hundred Fidonet BBSs!) and Solsbury Hill, founding member of PODSnet.

This is really sad! (Score:5) by DeepDarkSky (111382) on Saturday April 22 2000, @07:04PM (#1116055)

I definitely remember Phil Katz and all the controversy surrounding him, and how grateful I was to have discovered his programs. I remember the first compression program which was SEA's ARC program. It was very slow. Then my friend and I discovered PKARC and PKXARC, which were much faster than ARC. As PKARC gained popularity because of its overall superiority, SEA sued Phil Katz, and he in turn created PKPAK/PKUNPAK (I think it was still paired like that). Tha PKPAK series didn't last long. The PKZIP series came out next, and that was the series that created the ubiquitous ZIP format that we see today. If I remember correctly, PKZ204G was the last official DOS version of the program, and there were plenty of trojans, etc. that were going around, and Phil created self-authenticating zip files, etc. Lots of neat little cool things. I also remember that other programs were giving PK a run for his money, such as ARJ and LHARC, but they never achieved the overall speed/performance/compression that PKZIP ever did (they were often better in one thing or another but not overall). Then WINZIP came out, and I kind lost sight of PK.

I still have thousands of ZIP files that were zipped with PKZIP. If it wasn't for him, I wouldn't have been as into computers as I am, it was because of those early days of playing around with PKARC and PKXARC that really got me started. I am terribly sad to see him go and in such (I think) indignant way.

Re:Would like to know the rest of the story (Score:3) by Trepidity (597) <delirium-slashdot.hackish@org> on Saturday April 22 2000, @07:18PM (#1116062) Homepage

Well, PKZip seems to have stopped development around 1993, well before WinZip became popular. PKZIp v2.04g was pretty much the last version I know of, and it came out february 1, 1993. Up until then there had been fairly frequent updates, but throughout 1993, 1994, and 1995, PKZIp v2.04g for DOS remained the standard compression tool.

Only then, after 2-3 years of no updates, did other tools like WinZip become popular. PKZIp finally made a Windows product in 1997 or 1998, but they were long gone by then. I'm not sure what led to the development halt, but the original stuff is fine coding...

Re:Would like to know the rest of the story (Score:5) by Harinath (26296) on Saturday April 22 2000, @07:28PM (#1116067) Homepage

IIRC, the ZIP file format was made public domain, thus allowing the Info-ZIP people to write a program that reads ZIP archives, which in turn allowed WinZip to not have to license software from PKware.

Unlike LZW, the ZIP "deflate" algorithms (LZ77 + Shannon Fano encoding) are unemcumbered. These compression algorithms are used in GNU Zip (gzip) partly for that reason. I think gzip can even read .zip archives with only one file inside. The zip algorithm is also in the zlib library, which is used in the PNG format, for one. The "deflate" algorithm is also described in RFC 1951.

So, thanks PK, for providing one of the tools that enable us to thumb our noses at Unisys :-)

Some more links... (Score:2) by Megane (129182) on Saturday April 22 2000, @07:31PM (#1116069)

From the Milwaukee Journal Sentinel:
The original obituary notice []
A more complete article []

One interesting quote:
"It was just a hobby," he said. "I didn't expect it to turn into a business."

I had a moderately successful shareware program myself during the '80s, and it sure didn't help my life much. Fortunately I have no interest in booze or drugs -- they just get in the way of hacking. And also fortunately, I let it go when it wasn't successful any more. Maybe a little later than I should have, but I did move on.

Phillip Katz's patents (Score:2) by ContinuousPark (92960) on Saturday April 22 2000, @07:32PM (#1116071)

Couldn't find much so far about him but I came across this page where several patents on data compression algorithms are mentioned and this led me to one of his patents [], a so-called string searcher and compressor.

It would be interesting to know if what the patent decribes is the technology behind PKZIP.

I first found out about this on CNET's front page; there was this little message in memoriam of PK but I don't think it was mentioned on; that was strange. This is a sad event and I think it would be more convenient and respectful if we didn't get to know the details of his death just because it turned out to be a morbid and attention-attracting story for the media.

Re:How many people here registered pkzip? (Score:1) by Anonymous Coward on Saturday April 22 2000, @06:53PM (#1116049)

i didn't either.

do you think it would be appropriate to register now, and contribute the check to his estate?

[April 19, 2000] News - Paid Death Notices - Katz, Phillip W.

Katz, Phillip W.
Publication Date: April 19, 2000
Age 37. Passed away unexpectedly on Fri., April 14, 2000. Beloved son of Hildegard and beloved brother of Cynthia. Also survived by other relatives and friends. Phil was a graduate of UWM Computer Science Engineering Program. He was the author of the PKZIP/PKUNZIP software and owner of PKWARE Inc. Co. Private services have been held. Memorials to the charity of your choice would be appreciated.

ZWASKA FUNERAL HOME 354-5330 Serving the Family

The Rise and Fall of a Software Star - Phil Katz Loved Code -- and Liquor

LZW and GIF explained John Barkaus's "LZW and GIF explained"

>From: john@cooper.cooper.EDU (John Barkaus)
Subject: GIF file format responses 5/5
Date: 21 Apr 89 20:58:01 GMT
Organization: The Cooper Union (NY, NY)

LZW and GIF explained----Steve Blackstock

I hope this little document will help enlighten those of you out there who want to know more about the Lempel-Ziv Welch compression algorithm, and, specifically, the implementation that GIF uses.

Before we start, here's a little terminology, for the purposes of this document:

a fundamental data element. In normal text files, this is just a single byte. In raster images, which is what we're interested in, it's an index that specifies the color of a given pixel. I'll refer to an arbitray character as "K".
a stream of characters, as in a data file.
a number of continuous characters, anywhere from one to very many characters in length. I can specify an arbitrary string as "[...]K".
almost the same as a string, but with the implication that a prefix immediately precedes a character, and a prefix can have a length of zero. So, a prefix and a character make up a string. I will refer to an arbitrary prefix as "[...]".
a single-character string. For most purposes, this is a character, but we may occasionally make a distinction. It is [...]K, where [...] is empty.
a number, specified by a known number of bits, which maps to a string.
the output stream of codes, as in the "raster data"
a code and its string.
"string table":
a list of entries; usually, but not necessarily, unique.

That should be enough of that.

LZW is a way of compressing data that takes advantage of repetition of strings in the data. Since raster data usually contains a lot of this repetition, LZW is a good way of compressing and decompressing it.

For the moment, lets consider normal LZW encoding and decoding. GIF's variation on the concept is just an extension from there.

LZW manipulates three objects in both compression and decompression: the charstream, the codestream, and the string table. In compression, the charstream is the input and the codestream is the output. In decompression, the codestream is the input and the charstream is the output. The string table is a product of both compression and decompression, but is never passed from one to the other.

The first thing we do in LZW compression is initialize our string table. To do this, we need to choose a code size (how many bits) and know how many values our characters can possibly take. Let's say our code size is 12 bits, meaning we can store 0->FFF, or 4096 entries in our string table. Lets also say that we have 32 possible different characters. (This corresponds to, say, a picture in which there are 32 different colors possible for each pixel.) To initialize the table, we set code#0 to character#0, code #1 to character#1, and so on, until code#31 to character#31. Actually, we are specifying that each code from 0 to 31 maps to a root. There will be no more entries in the table that have this property.

Now we start compressing data. Let's first define something called the "current prefix". It's just a prefix that we'll store things in and compare things to now and then. I will refer to it as "[.c.]". Initially, the current prefix has nothing in it. Let's also define a "current string", which will be the current prefix plus the next character in the charstream. I will refer to the current string as "[.c.]K", where K is some character. OK, look at the first character in the charstream. Call it P. Make [.c.]P the current string. (At this point, of course, it's just the root P.) Now search through the string table to see if [.c.]P appears in it. Of course, it does now, because our string table is initialized to have all roots. So we don't do anything. Now make [.c.]P the current prefix. Look at the next character in the charstream. Call it Q. Add it to the current prefix to form [.c.]Q, the current string. Now search through the string table to see if [.c.]Q appears in it. In this case, of course, it doesn't. Aha! Now we get to do something. Add [.c.]Q (which is PQ in this case) to the string table for code#32, and output the code for [.c.] to the codestream. Now start over again with the current prefix being just the root Q. Keep adding characters to [.c.] to form [.c.]K, until you can't find [.c.]K in the string table. Then output the code for [.c.] and add [.c.]K to the string table. In pseudo-code, the algorithm goes something like this:

  1. Initialize string table;
  2. [.c.] <- empty;
  3. K <- next character in charstream;
  4. Is [.c.]K in string table?
    • yes:
      • [.c.] <- [.c.]K;
      • go to [3];
    • no:
      • add [.c.]K to the string table;
      • output the code for [.c.] to the codestream;
      • [.c.] <- K;
      • go to [3];

It's as simple as that! Of course, when you get to step [3] and there aren't any more characters left, you just output the code for [.c.] and throw the table away. You're done.

Wanna do an example? Let's pretend we have a four-character alphabet: A,B,C,D. The charstream looks like ABACABA. Let's compress it. First, we initialize our string table to: #0=A, #1=B, #2=C, #3=D. The first character is A, which is in the string table, so [.c.] becomes A. Next we get AB, which is not in the table, so we output code #0 (for [.c.]), and add AB to the string table as code #4. [.c.] becomes B. Next we get [.c.]A = BA, which is not in the string table, so output code #1, and add BA to the string table as code #5. [.c.] becomes A. Next we get AC, which is not in the string table. Output code #0, and add AC to the string table as code #6. Now [.c.] becomes C. Next we get [.c.]A = CA, which is not in the table. Output #2 for C, and add CA to table as code#7. Now [.c.] becomes A. Next we get AB, which IS in the string table, so [.c.] gets AB, and we look at ABA, which is not in the string table, so output the code for AB, which is #4, and add ABA to the string table as code #8. [.c.] becomes A. We can't get any more characters, so we just output #0 for the code for A, and we're done. So, the codestream is #0#1#0#2#4#0.

A few words (four) should be said here about efficiency: use a hashing strategy. The search through the string table can be computationally intensive, and some hashing is well worth the effort. Also, note that "straight LZW" compression runs the risk of overflowing the string table - getting to a code which can't be represented in the number of bits you've set aside for codes. There are several ways of dealing with this problem, and GIF implements a very clever one, but we'll get to that.

An important thing to notice is that, at any point during the compression, if [...]K is in the string table, [...] is there also. This fact suggests an efficient method for storing strings in the table. Rather than store the entire string of K's in the table, realize that any string can be expressed as a prefix plus a character: [...]K. If we're about to store [...]K in the table, we know that [...] is already there, so we can just store the code for [...] plus the final character K.

Ok, that takes care of compression. Decompression is perhaps more difficult conceptually, but it is really easier to program.

Here's how it goes: We again have to start with an initialized string table. This table comes from what knowledge we have about the charstream that we will eventually get, like what possible values the characters can take. In GIF files, this information is in the header as the number of possible pixel values. The beauty of LZW, though, is that this is all we need to know. We will build the rest of the string table as we decompress the codestream. The compression is done in such a way that we will never encounter a code in the codestream that we can't translate into a string.

We need to define something called a "current code", which I will refer to as "<code>", and an "old-code", which I will refer to as "<old>". To start things off, look at the first code. This is now <code>. This code will be in the intialized string table as the code for a root. Output the root to the charstream. Make this code the old-code <old>. *Now look at the next code, and make it <code>. It is possible that this code will not be in the string table, but let's assume for now that it is. Output the string corresponding to <code> to the codestream. Now find the first character in the string you just translated. Call this K. Add this to the prefix [...] generated by <old> to form a new string [...]K. Add this string [...]K to the string table, and set the old-code <old> to the current code <code>. Repeat from where I typed the asterisk, and you're all set. Read this paragraph again if you just skimmed it!!! Now let's consider the possibility that <code> is not in the string table. Think back to compression, and try to understand what happens when you have a string like P[...]P[...]PQ appear in the charstream. Suppose P[...] is already in the string table, but P[...]P is not. The compressor will parse out P[...], and find that P[...]P is not in the string table. It will output the code for P[...], and add P[...]P to the string table. Then it will get up to P[...]P for the next string, and find that P[...]P is in the table, as the code just added. So it will output the code for P[...]P if it finds that P[...]PQ is not in the table. The decompressor is always "one step behind" the compressor. When the decompressor sees the code for P[...]P, it will not have added that code to it's string table yet because it needed the beginning character of P[...]P to add to the string for the last code, P[...], to form the code for P[...]P. However, when a decompressor finds a code that it doesn't know yet, it will always be the very next one to be added to the string table. So it can guess at what the string for the code should be, and, in fact, it will always be correct. If I am a decompressor, and I see code#124, and yet my string table has entries only up to code#123, I can figure out what code#124 must be, add it to my string table, and output the string. If code#123 generated the string, which I will refer to here as a prefix, [...], then code#124, in this special case, will be [...] plus the first character of [...]. So just add the first character of [...] to the end of itself. Not too bad. As an example (and a very common one) of this special case, let's assume we have a raster image in which the first three pixels have the same color value. That is, my charstream looks like: QQQ.... For the sake of argument, let's say we have 32 colors, and Q is the color#12. The compressor will generate the code sequence 12,32,.... (if you don't know why, take a minute to understand it.) Remember that #32 is not in the initial table, which goes from #0 to #31. The decompressor will see #12 and translate it just fine as color Q. Then it will see #32 and not yet know what that means. But if it thinks about it long enough, it can figure out that QQ should be entry#32 in the table and QQ should be the next string output. So the decompression pseudo-code goes something like:

  1. Initialize string table;
  2. get first code: <code>;
  3. output the string for <code> to the charstream;
  4. <old> = <code>;
  5. <code> <- next code in codestream;
  6. does <code> exist in the string table?
    • yes:
      • output the string for <code> to the charstream;
      • [...] <- translation for <old>;
      • K <- first character of translation for <code>;
      • add [...]K to the string table;
      • <old> <- <code>;
    • no:
      • [...] <- translation for <old>;
      • K <- first character of [...];
      • output [...]K to charstream and add it to string table;
      • <old> <- <code>
  7. go to [5];

Again, when you get to step [5] and there are no more codes, you're finished. Outputting of strings, and finding of initial characters in strings are efficiency problems all to themselves, but I'm not going to suggest ways to do them here. Half the fun of programming is figuring these things out!

Now for the GIF variations on the theme. In part of the header of a GIF file, there is a field, in the Raster Data stream, called "code size". This is a very misleading name for the field, but we have to live with it. What it is really is the "root size". The actual size, in bits, of the compression codes actually changes during compression/decompression, and I will refer to that size here as the "compression size". The initial table is just the codes for all the roots, as usual, but two special codes are added on top of those. Suppose you have a "code size", which is usually the number of bits per pixel in the image, of N. If the number of bits/pixel is one, then N must be 2: the roots take up slots #0 and #1 in the initial table, and the two special codes will take up slots #4 and #5. In any other case, N is the number of bits per pixel, and the roots take up slots #0 through #(2**N-1), and the special codes are (2**N) and (2**N + 1). The initial compression size will be N+1 bits per code. If you're encoding, you output the codes (N+1) bits at a time to start with, and if you're decoding, you grab (N+1) bits from the codestream at a time. As for the special codes: <CC> or the clear code, is (2**N), and <EOI>, or end-of-information, is (2**N + 1). <CC> tells the compressor to re-initialize the string table, and to reset the compression size to (N+1). <EOI> means there's no more in the codestream. If you're encoding or decoding, you should start adding things to the string table at <CC> + 2. If you're encoding, you should output <CC> as the very first code, and then whenever after that you reach code #4095 (hex FFF), because GIF does not allow compression sizes to be greater than 12 bits. If you're decoding, you should reinitialize your string table when you observe <CC>. The variable compression sizes are really no big deal. If you're encoding, you start with a compression size of (N+1) bits, and, whenever you output the code (2**(compression size)-1), you bump the compression size up one bit. So the next code you output will be one bit longer. Remember that the largest compression size is 12 bits, corresponding to a code of 4095. If you get that far, you must output <CC> as the next code, and start over. If you're decoding, you must increase your compression size AS SOON AS YOU write entry #(2**(compression size) - 1) to the string table. The next code you READ will be one bit longer. Don't make the mistake of waiting until you need to add the code (2**compression size) to the table. You'll have already missed a bit from the last code. The packaging of codes into a bitsream for the raster data is a potential stumbling block for the novice encoder or decoder. The lowest order bit in the code should coincide with the lowest available bit in the first available byte in the codestream. For example, if you're starting with 5-bit compression codes, and your first three codes are, say, <abcde>, <fghij>, <klmno>, where e, j, and o are bit#0, then your codestream will start off like:

      byte#0: hijabcde
      byte#1: .klmnofg

So the differences between straight LZW and GIF LZW are: two additional special codes and variable compression sizes. If you understand LZW, and you understand those variations, you understand it all!

Just as sort of a P.S., you may have noticed that a compressor has a little bit of flexibility at compression time. I specified a "greedy" approach to the compression, grabbing as many characters as possible before outputting codes. This is, in fact, the standard LZW way of doing things, and it will yield the best compression ratio. But there's no rule saying you can't stop anywhere along the line and just output the code for the current prefix, whether it's already in the table or not, and add that string plus the next character to the string table. There are various reasons for wanting to do this, especially if the strings get extremely long and make hashing difficult. If you need to, do it.

Hope this helps out. ----steve blackstock

Recommended Links

Google matched content

Softpanorama Recommended

Top articles



compression FAQ


Data Compression Debra A. Lelewer and Daniel S. Hirschberg

Sometimes you can find old O'Reilly graphic formats book on the web too:

Encyclopedia of Graphics File Formats, 2nd Edition
The Complete Reference on CD-ROM with Links to Internet Resources

By James D. Murray, William vanRyper
2nd Edition May 1996

It contains a chapter about graphic algorithms, see TOC

Recommended Papers

AnandTech - All About Compression, Part I

Games++ Games & Game Programming-Algorithms, Tutorials, Source Code, Direct X, C-C++, Java, and Much More.

E186-1 Michael Schindler

Huffman code

Queue, Huffman, Compression and Encryption Applets in C++

Forums - Efficient Huffman implementation is certainly possible to construct the Huffman code without an explicit tree. The main observation is that there is a canonical Huffman code entirely defined by the code lengths of the letters. And, there is a simple data structure to compute the latter in linear time after a sorting phase without explicitly constructing the tree. This does save half the memory required for an explicit tree. For example, see my Huffman implementation in the Vcodex package at



LZW(Lempel-Ziv-Welch) is the most common algorithm used in computer graphics. This lossless method of data compression is found in several image file formats, such as GIF and TIFF, and is also part of the V.42bis modem compression standard and PostScript Level 2.

In 1977, Abraham Lempel and Jakob Ziv created the first of what we now call the LZ family of substitutional compressors. The LZ77 compression algorithms are commonly found in text compression and archiving programs, such as compress, zoo, lha, pkzip, and arj. The LZ78 compression algorithms are more commonly used to compress binary data, such as bitmaps.

In 1984, while working for Unisys, Terry Welch modified the LZ78 compressor for implementation in high-performance disk controllers. The result was the LZW algorithm that is commonly found today. It is covered by U.S. Patent 4,558,302 (plus its foreign counterparts, issued or pending). All patents are held by Unisys Corporation. That means that without obtaining a license from Unisys you formally cannot read or write a GIF file :-(. See TF_WP_unisys for more information.

Bijective compression

I recommend first check - Tutorials, Reference, Presentations

Creating a ''One to One Adaptive Huffman Compression-Decompression Program for the real world of 8 bit byte files

The Mandala Centre - Compression and Security - One on one compression FAQ

One to One Compression -- This site discusses a characteristic of some compression algorithms that the author refers to as One to One compression. In a nutshell, this property means that for any file X, F( F'( X ) ) == X. (F is either the compressor or decompressor, and F' is its opposite number.) This is definitely not the case for most conventional compression algorithms.

Bijective Arithmetic Encoding with Really Good End Treatment - Adaptive Huffman Coding
... Adaptive Huffman Encoding, Rate, A library to perform adpative ... an implementation of
Vitter's dynamic Huffman compressor, adapted so that it is bijective. ...


Phil Katz - Wikipedia

patenting gifs, lzw compression, unisys gif patent, software for gifs, products using lzw compression algorithm, png file format, development of png file format

Gifs and patents

The patent for LZW compression algorithm is held by Unisys since 1985. Compuserve Gif uses this algorithm. In 1994, Unisys implemented their patent rights by asking for license fees from developers of products that read or write gifs. This announcement took developers by surprise and caused quite a stir on the Internet. There were rumours that Unisys might charge web developers for usage of gifs on their sites. At the same time, it was argued that Gifs are the products of LZW compression and the patent does not extend to the usage of the end product. Actually, web developers had nothing to fear since Unisys planned to collect license fees only from software companies whose products employ LZW algorithm.

Web developers should not be worried. They are free to use gifs on their web sites. However, if you've developed a software that creates or modifies gifs, you would have to pay licensing fees to Unisys.

The business acumen of the people at Unisys has to be admired. It seems that they had waited for Gifs to become popular and beneficial (from a web developers' point of view) before implementing the patent rights. However, there was an interesting and fortunate (?) side effect of this story. It lead to the development of the PNG file format. PNG is a much better and more versatile image format than both JPG and GIF. It has all the bells and whistles of these two file formats.

At the time of writing the browser support for PNG is still quite patchy, though the future does look bright.

LZW compression: conjugations patented

Google Groups View Thread LZW Patent Expiry

Can the LZW algorithm be used freely in all countries after the US
patent expiry in June 2003? I have gathered this information from

1. The main US patent expires the 20 June 2003. The valid time can not
be extended, if they do not make adjustments/extensions to the patent.
US patent:

2. There are other US patents covering this field, but these are filed
in 1999, and are therefore not enforceable. Are there other US

3. Unisys claim that they have patents in Canada, France, Germany,
U.K. and Italy, but they have never sued any company in European
court though they have had chances. There are some references in the
US patent: CA1223965, DE3476617D, EP0129439. Are these patents
enforceable? Are there other patents?

4. Patents are pending in Japan. The US patent refers to these:
JP2610084B2, JP5068893B, JP60116228, JP7249996, JP8237138. Do we have
to worry about these?

Can anyone confirm the information above and answer the questions?
Does this mean that we can use the lzw algorithm freely in all
countries after June 2003?


FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes.   If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner. 

ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.  


Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy


War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes


Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law


Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least

Copyright © 1996-2016 by Dr. Nikolai Bezroukov. was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case is down you can use the at


The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: May, 04, 2018