LZW Compression

Final Project: Tufts University COMP150 - Special Topics Data Structures and Algorithms

Predecessors

The LZW algorithm is based on LZ78 algorithm, which was published by Abraham Lempel and Jacob Ziv in 1978. Terry Welch obtained a patent on an LZW implementation in 1983, and the algorithm was published the following year. Both of these algorithms (along with LZ78's predecessor, LZ77) come from a class of compression algorithms called dictionary coders, which use the fact that most inputs contain many sequences of characters which appear multiple times as a means to reduce file size.

LZ78

As LZW is so closely related to the somewhat simpler LZ78 algorithm, it worthwhile to understand how LZ78 works before moving on to LZW. LZ78 is based on a dictionary which maps binary codes to sequences of characters and vice versa. This dictionary is initialized to include all 256 possible characters, plus a code to indicate the end of the input file. As compression progresses, new sequences are added to the the dictionary up until it runs out of unique binary codes for these sequences that can be represented in some number of bits (often 12). Once the dictionary is full, no new entries are added, and compression continues using the (now unchanging) dictionary.

Compression with the LZ78 algorithm works as follows:

Decompression simply reverses these steps:

An Example

As an example to show how compression and decompression work, consider a reduced alphabet with only 3 characters, X, Y, and Z, and a dictionary of 16 entries (i.e. 4 bit codes), represented by the 16 hexidecimal digits. The message being encoded will be
XYZYZYXXYZXYZYYYYYYXYZY
The operation of the compressor will be on the left, and the decompressor on the right. The first line is the input, the next the output, and the table shows the state of the dictionary at the end of the step.

XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
01121504271119183
XYYZYZYXXYZXYZYYYYYYXYZY
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7

Skipping ahead a few steps, as they work like the ones before...

XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
XYZYZYXXYZXYZYYYYYYXYZY
0121504271119183
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7
0121504271119183
XYZYZYXXYZXYZYYYYYYXYZY
X 0 XYZY 8
Y 1 YY 9
Z 2 YYY A
[EOF] 3
XY 4
ZY 5
ZYX 6
XYZ 7

LZW

LZW is a simple variant of LZ78: when a code is output, instead of then outputting the code for the next single character in the input, it uses the next character as the beginning of a new character sequence. When compressing, this change is quite simple to implement. When decompressing, however, there is a complication: when the decompressor receives a code, what does it add to the dictionary? In most cases, it can just read the next code, then add the sequence corresponding to the previous code followed by the first character of the new sequence. The decompressor is then always one dictionary entry behind the compressor, so it is possible, if the compressor adds a new entry to the dictionary and then emits it as the next code, for the decompressor to encounter a code that is not in its dictionary. That particular case, the compressor knows that the unknown entry must begin with the character sequence that it just received. The last character of the new entry, then, must be the same as the first, because when the compressor added this code, the character after the known code would be the beginning of the current character sequence.

An Example

Consider compressing the same example as was used for LZ78 using LZW. Take note of the fact that, unlike above, the compressor has one extra entry in the dictionary compared to the decompressor:

XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F

Again skipping ahead, now to an instance of the special case where the decompressor encounters a code not yet in its dictionary:

XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
XYZYZYXXYZXYZYYYYYYXYZY
01251042A1DEC3
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F
01251042A1DEC3
XYZYZYXXYZXYZYYYYYYXYZY
X 0 YX 8
Y 1 XX 9
Z 2 XYZ A
[EOF] 3 ZX B
XY 4 XYZY C
YZ 5 YY D
ZY 6 YYY E
YZY 7 YYYX F

and so on. In this case LZW only saves a couple of codes as compared to LZ78, but then the message being encoded is extremely short, and the dictionary is extremely small. For larger inputs, LZW improves on LZ78 in two ways. First, it fills the dictionary up faster, which can improve the compression ratio initially because having more entries in the dictionary means more, longer character sequences that can be represented by a single code. Second, there is no requirement that every other code represent a single character, which hurts the compression ratio because a code for a single character in the output is longer than the usual single-byte representation of that character because of the need to distinguish it from codes representing multiple characters.


A copy of the slides from my in-class presentation on this material can be found here.