Obtaining and Building the Source Code
The source code for the compressor, and the data analysis code, are available on GitHub here. Building the compressor requires g++; it has been tested with GCC 6.3, but should compile correctly on GCC 5. The analysis code requires Wolfram Mathematica, and has only been tested on Mathematica 11, though it should work with any reasonably recent version.
Once the source code has been acquired, move into the src directory and run make. This will create the bin directory alongside the src directory, and the lzw executable inside of it.
Invocation and Options
The lzw program should be invoked as follows:
Statistics File Format
The files generated with the -s flag above during compression are normal ASCII comma-separated values files. Each line corresponds to the state of the compressor after emitting a code. Note that in general all sizes are measured in bits. The columns represent, in order:
- Step: the number of codes that have been emitted so far. Note that this is not the position in the input file, and for modes with variable-width codes it doesn't directly correspond to position in the output file either.
- Last input size: the number of bits represented by the code that was just emitted. For instance, if the code represented a sequence of 4 characters, the last input size would be 32.
- Last code size: the number of bits in the last code that was emitted.
- Total input size: the total number of input bits that have been compressed so far. If the statistics log interval is set to 1 (see the -i option above), this should equal the sum of all of the last input size entries so far.
- Total output size: the total number of bits that have been output into the compressed file so far. Again, if the statistics log interval is 1, this should be the sum of all last code size entries so far.
- Dictionary size: the total number of entries in the dictionary so far, including all of the initial dictionary entires.
- Last dictionary entry size: the number of bits in the entry added to the dictionary when emitting the current code, or zero if no entry was added (for instance if the dictionary was full.