LZW Compression

Final Project: Tufts University COMP150 - Special Topics Data Structures and Algorithms

Lossless Data Compression

As the name would suggest, the objective of lossless data compression is to take an input and represent it as succinctly as possible such that the original data can be reconstructed perfectly based only on the compressed data and knowledge of how it was compressed. In contrast, lossy data compression attempts to reconstruct a good approximation to the original input, allowing for a much greater degree of compression at the cost of some accuracy; this type of compression is common in multimedia applications, where the lost data is tiny details in an image or video that are not easily visible.

In lossless compression, the main measure of effectiveness is the compression ratio, which is the ratio of a file's uncompressed size to its compressed size. Thus, a compression ratio of 1 means that the file is the same size after compression as before, a compression ratio of 2 means that it is half the original size after compression, and so on. The compression ratio that a compressor can achieve depends greatly on the data being compressed, but there exist various benchmarks (for example Maximum Compression) which seek to measure compression utilities' compression ratios on some approximation to real-world data.

The underlying reason that lossless data compression works is that most files contain decidedly non-random data, and thus contain patterns which can be used to represent the data in fewer bits. For example, the source for this webpage is written in plain text, and each character is represented as a byte. A byte can represent 256 distinct values, but there are many fewer than 256 (indeed, fewer than 128) distinct characters in the website source, so it is possible to uniquely represent each character in less than a byte. The ways in which different compression algorithms leverage this information vary greatly. If characters appear with radically different frequencies in the input file, it is possible to represent the more common ones with shorter sequences of bits than the longer ones — this is the basis of Huffman coding. Alternatively, if certain sequences of characters occur frequently, these sequences can (in any of various ways) be stored just once, and then referenced when they reappear — this is the basis of dictionary coders, including the LZW algorithm.