If LZ compression is compilation
, what other instructions are useful to add? I've spent the last several years, off and on, trying to figure this out.
LZSS has LIT [byte]
and COPY [length, distance]
. LZ_XOR is like LZSS
but with one new instruction the decompressor can execute. This variant adds XOR [length, distance, 1 or more bytes]
XOR (or ADD) gives the system partial matches, which is particularly useful on GPU texture data, log files, object/executable files, bitmaps, and audio files. In LZ, the COPY instruction must refer to dictionary positions that perfectly match the input file bytes (otherwise, the decompressor will copy the wrong bytes!). LZ_XOR can use any
match distance/length pair, even referring to bytes in the dictionary which don't perfectly match (or match at all
!) the bytes to compress, although not every distance/length pair will result in an efficient partial match. The compiler's goal is to find those XOR's that result in partial matches which are efficient to code.
A LZ_XOR compiler ("compressor" in LZ-parlance) can optimize for minimum total Hamming distance
between the lookahead buffer and the previously encoded/emitted bytes in the sliding dictionary. Or, the compressor can first optimize for minimal Hamming distance in one pass, and then minimal bit prices in a second optimal parsing (optimizing compiler) pass.
LZ_XOR is surprisingly strong and flexible:
1. You can constrain the XOR instruction in the compiler (parser) to only use a limited # of unique symbols.
This property is valuable when implementing shuffle-based Huffman decoding in AVX2/AVX-512, which is limited to only 16 or 32 unique symbols.
2. XOR is a superset of plain LZ: With 8-bit symbols you don't need LIT's or COPY instructions at all ("pure" LZ_XOR, or XOR-only). This simplifies the decompressor (eliminating unpredictable jumps), and simplifies the instruction stream so less signaling needs to be encoded. Everything is an XOR so there's no need to tell the decompressor that the instruction is a LIT or COPY.
3. Most of the usual tricks used to increase LZSS's ratio by other codecs (more contexts, LZMA-like state machines+contexts, REP matches, circular match history buffers, fast entropy coding, larger dictionaries, etc.) are compatible with LZ_XOR too. I've implemented and tested several LZ_XOR variants with GPU parsing that are LZ4-like and Zstd-like, along with ideas from LZMA/LZHAM.
4. A file compressed with LZ_XOR will have significantly less overall instructions to execute vs. LZSS to decompress the file (roughly 30-60% less). This results in a decompressor which spends more time copying or XOR'ing bytes, and less time unpacking and interpreting the compressed instruction stream.
5. Another strong variant on some files (audio, bitmaps) is LZ_ADD. Instead of optimizing for minimum total Hamming distance, you optimize for minimum total absolute error.
6. LZ_XOR is strong on uncompressible files (with very few matches), such as GPU texture data. These are the files that LZ4 fails to compress well.
7. LZ_XOR is like LZMA's or LZHAM's LAM's ("Literals after Matches") taken to the next level.
Example disassembly of a simple LZ_XOR system with only LIT and XOR constrained to only use 32 unique XOR bytes (no copies - they are implemented as XOR's with all-0 XOR bytes):
Another example that uses XOR, COPY, and LITS, along with REP0/1 distances (from LZMA), and a significantly more complex control/instruction stream:
Here's "Alice in Wonderland" compressed with plain LZSS on the left, and LZ_ADD on the right. Notice how much faster it makes progress through the file vs. LZSS:
Compressing the 25 char ASCII string "executable files go here". First run uses LZ-style LIT+COPY plus a new instruction, ADD. Second run just uses LIT+COPY:
XOR-only decompression can be done in two simple steps, using a single output buffer: first entropy decode the XOR bytes into a buffer the size of the output block. This can be done at ~1 GB/sec. using a fast order-0 Huffman or FSE decoder, such as this one
. The bulk of the XOR byte values will be 0. The frequency histogram will have one enormous spike with a quick falloff.
Next, execute the XOR control stream and XOR the bytes in the sliding dictionary with the "patch" bytes in the lookahead buffer. (The "patch" bytes are in the brackets in the above disassemblies. In XOR-only there's guaranteed to be a single XOR byte for every byte in the block.) This step is roughly 1-1.5 GiB/sec. It's in-place so it's very cache friendly.
LZ_XOR places very
heavy demands on the compressor's parser to find the partial matches with minimal Hamming distance (or emitted bits), making it a good fit for GPU parsing. My experimental compiler currently evaluates every single possible way to code the lookahead bytes against all the bytes in the sliding dictionary. It uses Dijkstra's shortest path algorithm
, where each edge is an instruction, the costs are bit prices, and the nodes are lookahead byte positions. There are numerous heuristics and search optimizations that can be done to speed up partial matching. This is the Approximate String Matching Problem
LZ_XOR gives an RDO GPU texture compressor
a lot more freedom to distort a texture block's encoded bytes to increase ratio. With plain LZ/LZSS-style systems, your primary tool to increase the compression ratio (and trade off texture quality) is to increase the number and length of the LZ matches in the encoded texture data. With LZ_XOR you can replace bytes with other bytes which minimize the Hamming distance between the block you're encoding and the already coded blocks in the sliding dictionary. This is a more flexible way of increasing distortion without slamming in large 3-4 byte matches.
While building the above experimental codecs, I also would use the GPU to compute Hamming Correlation Matrix visualizations. This is for the first 494 bytes of alice29.txt (Alice in Wonderland). Each row represents how much the 18 byte pattern starting at that file offset correlates with all the previous byte patterns. White=low Hamming distance, black=high distance.