Predecessors

The LZW algorithm is based on LZ78 algorithm, which was published by Abraham Lempel and Jacob Ziv in 1978. Terry Welch obtained a patent on an LZW implementation in 1983, and the algorithm was published the following year. Both of these algorithms (along with LZ78's predecessor, LZ77) come from a class of compression algorithms called dictionary coders, which use the fact that most inputs contain many sequences of characters which appear multiple times as a means to reduce file size.

LZ78

As LZW is so closely related to the somewhat simpler LZ78 algorithm, it worthwhile to understand how LZ78 works before moving on to LZW. LZ78 is based on a dictionary which maps binary codes to sequences of characters and vice versa. This dictionary is initialized to include all 256 possible characters, plus a code to indicate the end of the input file. As compression progresses, new sequences are added to the the dictionary up until it runs out of unique binary codes for these sequences that can be represented in some number of bits (often 12). Once the dictionary is full, no new entries are added, and compression continues using the (now unchanging) dictionary.

Compression with the LZ78 algorithm works as follows:

Find the longest prefix of the input that exists in the dictionary.
Output the code corresponding to that prefix.
If this prefix ends at the end of the input, output the code for the prefix followed by the end-of-file code and stop. Otherwise, output the code corresponding to the next single character in the input.
Add the first prefix followed by the character just output to the dictionary (if it isn't full), assigning this sequence the next available code.
Move position in the input to after the single character that was output and repeat.

Decompression simply reverses these steps:

Read the next two codes from the input. The second must represent a single character, as can be seen from looking at the compression part of the algorithm, unless the end of the file has been reached, in which case there may only be one code, or the second one may be the end-of-file code.
Output the character sequences in the dictionary that correspond to these two codes.
Add the character sequence for the first code with the character for the second code appended to it to the dictionary (if possible), assigning this new sequence the next available code.
Repeat.

An Example

As an example to show how compression and decompression work, consider a reduced alphabet with only 3 characters, X, Y, and Z, and a dictionary of 16 entries (i.e. 4 bit codes), represented by the 16 hexidecimal digits. The message being encoded will be
XYZYZYXXYZXYZYYYYYYXYZY
The operation of the compressor will be on the left, and the decompressor on the right. The first line is the input, the next the output, and the table shows the state of the dictionary at the end of the step.

XYZYZYXXYZXYZYYYYYYXYZY

0121504271119183

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

0121504271119183

XYZYZYXXYZXYZYYYYYYXYZY

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

XYZYZYXXYZXYZYYYYYYXYZY

0121504271119183

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

01121504271119183

XYYZYZYXXYZXYZYYYYYYXYZY

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

XYZYZYXXYZXYZYYYYYYXYZY

0121504271119183

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

0121504271119183

XYZYZYXXYZXYZYYYYYYXYZY

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

XYZYZYXXYZXYZYYYYYYXYZY

0121504271119183

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

0121504271119183

XYZYZYXXYZXYZYYYYYYXYZY

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

Skipping ahead a few steps, as they work like the ones before...

XYZYZYXXYZXYZYYYYYYXYZY

0121504271119183

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

0121504271119183

XYZYZYXXYZXYZYYYYYYXYZY

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

XYZYZYXXYZXYZYYYYYYXYZY

0121504271119183

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

0121504271119183

XYZYZYXXYZXYZYYYYYYXYZY

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

XYZYZYXXYZXYZYYYYYYXYZY

0121504271119183

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

0121504271119183

XYZYZYXXYZXYZYYYYYYXYZY

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

XYZYZYXXYZXYZYYYYYYXYZY

0121504271119183

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

0121504271119183

XYZYZYXXYZXYZYYYYYYXYZY

X	0	XYZY	8
Y	1	YY	9
Z	2	YYY	A
[EOF]	3
XY	4
ZY	5
ZYX	6
XYZ	7

LZW

LZW is a simple variant of LZ78: when a code is output, instead of then outputting the code for the next single character in the input, it uses the next character as the beginning of a new character sequence. When compressing, this change is quite simple to implement. When decompressing, however, there is a complication: when the decompressor receives a code, what does it add to the dictionary? In most cases, it can just read the next code, then add the sequence corresponding to the previous code followed by the first character of the new sequence. The decompressor is then always one dictionary entry behind the compressor, so it is possible, if the compressor adds a new entry to the dictionary and then emits it as the next code, for the decompressor to encounter a code that is not in its dictionary. That particular case, the compressor knows that the unknown entry must begin with the character sequence that it just received. The last character of the new entry, then, must be the same as the first, because when the compressor added this code, the character after the known code would be the beginning of the current character sequence.

An Example

Consider compressing the same example as was used for LZ78 using LZW. Take note of the fact that, unlike above, the compressor has one extra entry in the dictionary compared to the decompressor:

XYZYZYXXYZXYZYYYYYYXYZY

01251042A1DEC3

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

01251042A1DEC3

XYZYZYXXYZXYZYYYYYYXYZY

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

XYZYZYXXYZXYZYYYYYYXYZY

01251042A1DEC3

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

01251042A1DEC3

XYZYZYXXYZXYZYYYYYYXYZY

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

XYZYZYXXYZXYZYYYYYYXYZY

01251042A1DEC3

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

01251042A1DEC3

XYZYZYXXYZXYZYYYYYYXYZY

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

XYZYZYXXYZXYZYYYYYYXYZY

01251042A1DEC3

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

01251042A1DEC3

XYZYZYXXYZXYZYYYYYYXYZY

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

XYZYZYXXYZXYZYYYYYYXYZY

01251042A1DEC3

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

01251042A1DEC3

XYZYZYXXYZXYZYYYYYYXYZY

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

Again skipping ahead, now to an instance of the special case where the decompressor encounters a code not yet in its dictionary:

XYZYZYXXYZXYZYYYYYYXYZY

01251042A1DEC3

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

01251042A1DEC3

XYZYZYXXYZXYZYYYYYYXYZY

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

XYZYZYXXYZXYZYYYYYYXYZY

01251042A1DEC3

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

01251042A1DEC3

XYZYZYXXYZXYZYYYYYYXYZY

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

XYZYZYXXYZXYZYYYYYYXYZY

01251042A1DEC3

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

01251042A1DEC3

XYZYZYXXYZXYZYYYYYYXYZY

X	0	YX	8
Y	1	XX	9
Z	2	XYZ	A
[EOF]	3	ZX	B
XY	4	XYZY	C
YZ	5	YY	D
ZY	6	YYY	E
YZY	7	YYYX	F

and so on. In this case LZW only saves a couple of codes as compared to LZ78, but then the message being encoded is extremely short, and the dictionary is extremely small. For larger inputs, LZW improves on LZ78 in two ways. First, it fills the dictionary up faster, which can improve the compression ratio initially because having more entries in the dictionary means more, longer character sequences that can be represented by a single code. Second, there is no requirement that every other code represent a single character, which hurts the compression ratio because a code for a single character in the output is longer than the usual single-byte representation of that character because of the need to distinguish it from codes representing multiple characters.

A copy of the slides from my in-class presentation on this material can be found here.

LZW Compression

Final Project: Tufts University COMP150 - Special Topics Data Structures and Algorithms

Predecessors

LZ78

An Example

LZW

An Example