Predecessors
The LZW algorithm is based on LZ78 algorithm, which was published by
Abraham Lempel and Jacob Ziv in 1978. Terry Welch obtained a patent on an
LZW implementation in 1983, and the algorithm was published the following
year. Both of these algorithms (along with LZ78's predecessor, LZ77) come
from a class of compression algorithms called dictionary coders, which use
the fact that most inputs contain many sequences of characters which
appear multiple times as a means to reduce file size.
LZ78
As LZW is so closely related to the somewhat simpler LZ78 algorithm, it
worthwhile to understand how LZ78 works before moving on to LZW. LZ78 is
based on a dictionary which maps binary codes to sequences of characters
and vice versa. This dictionary is initialized to include all 256 possible
characters, plus a code to indicate the end of the input file. As
compression progresses, new sequences are added to the the dictionary up
until it runs out of unique binary codes for these sequences that can be
represented in some number of bits (often 12). Once the dictionary is
full, no new entries are added, and compression continues using the (now
unchanging) dictionary.
Compression with the LZ78 algorithm works as follows:
- Find the longest prefix of the input that exists in the
dictionary.
- Output the code corresponding to that prefix.
- If this prefix ends at the end of the input, output the code for the
prefix followed by the end-of-file code and stop. Otherwise, output the
code corresponding to the next single character in the input.
- Add the first prefix followed by the character just output to the
dictionary (if it isn't full), assigning this sequence the next
available code.
- Move position in the input to after the single character that was
output and repeat.
Decompression simply reverses these steps:
- Read the next two codes from the input. The second must represent a
single character, as can be seen from looking at the compression part
of the algorithm, unless the end of the file has been reached, in
which case there may only be one code, or the second one may be the
end-of-file code.
- Output the character sequences in the dictionary that correspond to
these two codes.
- Add the character sequence for the first code with the character for
the second code appended to it to the dictionary (if possible),
assigning this new sequence the next available code.
- Repeat.
An Example
As an example to show how compression and decompression work, consider a
reduced alphabet with only 3 characters, X, Y,
and Z, and a dictionary of 16 entries (i.e. 4 bit codes),
represented by the 16 hexidecimal digits. The message being encoded will
be
XYZYZYXXYZXYZYYYYYYXYZY
The operation of the compressor will be on the left, and the decompressor
on the right. The first line is the input, the next the output, and the
table shows the state of the dictionary at the end of the step.
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
01121504271119183
XYYZYZYXXYZXYZYYYYYYXYZY
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
Skipping ahead a few steps, as they work like the ones before...
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | XYZY | 8 |
Y | 1 | YY | 9 |
Z | 2 | YYY | A |
[EOF] | 3 | | |
XY | 4 | | |
ZY | 5 | | |
ZYX | 6 | | |
XYZ | 7 | | |
LZW
LZW is a simple variant of LZ78: when a code is output, instead of then
outputting the code for the next single character in the input, it uses
the next character as the beginning of a new character sequence. When
compressing, this change is quite simple to implement. When decompressing,
however, there is a complication: when the decompressor receives a code,
what does it add to the dictionary? In most cases, it can just read the
next code, then add the sequence corresponding to the previous code
followed by the first character of the new sequence. The decompressor is
then always one dictionary entry behind the compressor, so it is possible,
if the compressor adds a new entry to the dictionary and then emits it as
the next code, for the decompressor to encounter a code that is not in its
dictionary. That particular case, the compressor knows that the unknown
entry must begin with the character sequence that it just received. The
last character of the new entry, then, must be the same as the first,
because when the compressor added this code, the character after the known
code would be the beginning of the current character sequence.
An Example
Consider compressing the same example as was used for LZ78 using LZW. Take
note of the fact that, unlike above, the compressor has one extra entry in
the dictionary compared to the decompressor:
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
Again skipping ahead, now to an instance of the special case where the
decompressor encounters a code not yet in its dictionary:
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X | 0 | YX | 8 |
Y | 1 | XX | 9 |
Z | 2 | XYZ | A |
[EOF] | 3 | ZX | B |
XY | 4 | XYZY | C |
YZ | 5 | YY | D |
ZY | 6 | YYY | E |
YZY | 7 | YYYX | F |
and so on. In this case LZW only saves a couple of codes as compared to
LZ78, but then the message being encoded is extremely short, and the
dictionary is extremely small. For larger inputs, LZW improves on LZ78 in
two ways. First, it fills the dictionary up faster, which can improve the
compression ratio initially because having more entries in the dictionary
means more, longer character sequences that can be represented by a single
code. Second, there is no requirement that every other code represent a
single character, which hurts the compression ratio because a code for a
single character in the output is longer than the usual single-byte
representation of that character because of the need to distinguish it
from codes representing multiple characters.
A copy of the slides from my in-class presentation on this material can be
found here.