Tuesday, January 18, 2022

Lagrangian RDO PNG

Turns out PNG is very amendable to RDO optimization approaches, but few have really tried.

This is something I've been wanting to try for a while. This experiment only injects 3 pixel matches into the PNG Paeth (#4) predictor bytes. It uses an accurate Deflate bitprice model which is computed by first compressing the image to 24bpp PNG using predictor #4, then the 3 pixel matches are inserted.

Original PNG (oxipng): 16.06bpp


Lagrangian RDO proprocess on Paeth bytes+oxipng: 6.17bpp 28.141 dB


Note that blogger seems to losslessly recompress these PNG's (why?), but if you run oxipng on them you get the expected size:


Monday, January 10, 2022

LZ_XOR on canterbury corpus

LZ_XOR 128KB dictionary, AVX2, BMI1, mid-level CPU parsing, Ice Lake CPU (Core i7 1065G7 @ 1.3GHz, Dell Laptop).


Only the XOR bytes are entropy coded, otherwise everything else (the control stream, the usually rare literal runs) are sent byte-wise. It uses 6-bit length limited prefix codes in 16 streams, AVX2 gathers and shuffle-based LUT's. I also have a two gather version (one gather to get the bits, another to do the Huffman lookups) that decodes 2 symbols per gather, which is slightly faster (2.2 GiB/sec. vs. 1.9 GiB/sec.) but only on large buffers. I posted a pic of cppspmd_fast inner loop on my Twitter.

BMI1 made very little if any difference that I could detect.

The compressor isn't optimized yet. It's like 100KB/sec. and all on one thread. That's next. LZ_XOR trades off strong parsing in the encoder for less instructions to execute in the decompressor, longer XOR matches, and usually rare (.5-1%) literals.

Saturday, January 8, 2022

One nice property of highly constrained length limited prefix codes

I realized earlier that I don't need AVX2 for really fast Huffman (really length limited prefix) decoding. Using a 6-bit max Huffman code size, a 12-bit decode table (16KB - which fits in the cache) is guaranteed to always give you 2 symbols per lookup. With "normal" fast Huffman decoding using a 12-bit table/max code size, you only get 1 or 2. Big difference.

This means less conditionals during decoding, and higher baseline throughput. The downside is you have to figure out how to exploit a constrained decoder like this, but I've already found one way (LZ_XOR). LZ_XOR is magical and fits a length limited prefix decoder design like this well because during your partial matching searches you can just skip any non-codable partial matches (i.e. ones that use XOR delta bytes you can't code).

LZ_XOR has so much flexibility that it can just toss uncodable partial matches and it still has plenty of usable things it can code. LZ_XOR's delta bytes are limited to 32 unique decodable symbols - which turns out to be plenty.

The compiler's instruction selector can shorten a long XOR match to make it codable using other instructions.

At 6-bits the algorithm you use to length limit the Huffman codes really matters. The one I'm using now (from miniz) is dead simple and not very efficient. I need Package-Merge.

Also, Universal codes can be decoded without using lookups (to decode the index anyway) using CLZ instructions. Unfortunately to do CLZ with anything less than AVX512 seems to require exploiting the FP instructions. In my tests with LZ_XOR various modified types of Universal codes are only 1-4% less efficient than Huffman.

Friday, January 7, 2022

LZ_XOR on enwik8

First results of LZ_XOR on enwik8 (a common 100MB test file of Wikipedia data, higher ratio is smaller file):

lzham_codec_devel: 74.93%  .205 GiB/sec decompress
Zstd:              69.03%  .639 
LZ_XOR 1/10:       63.72% 1.188 (36,277,397 bytes)
miniz (Deflate):   63.54%  .288 
LZ_XOR:            63.08% 1.204 (36,922,969 bytes)
LZ4:               58.09% 2.369 

In this run LZ_XOR used a 128KB dictionary and I limited its parsing depth to speed up the encode.

Options:

LZ4_HC: level LZ4HC_CLEVEL_MAX
Zstd: level 11
lzham_codec_devel: defaults
LZ_XOR: 128KB, GPU assisted, exhaustive partial match searching disabled

Some LZ_XOR statistics:

Total LIT: 263887
Total XOR: 2380411
Total COPY: 9544076

Total used distance history: 1238007

Total LIT bytes: 755210
Total XOR bytes: 15157561
Total COPY bytes: 84087229

Total instructions: 12188374

Thursday, January 6, 2022

Other LZ_XOR variants

Other LZ instruction set variants that can utilize XOR:

1. ROLZ_XOR: Reduced-Offset LZ (or here) that uses XOR partial matches. Simplifies the search, but the decompressor has to keep the offsets table up to date which costs extra memory and time. Can issue just XOR's, or LIT/XOR, or LIT/COPY/XOR.

I implemented a ROLZ variant of LZHAM a few years back but the decompressor overhead wasn't acceptable to me (otherwise it worked surprisingly well).

2. LZRW_XOR: A variant of LZRW (or here) that instead of matches it issues XOR or XOR+LIT instructions instead. The compression search problem is simplified, however it's more complex vs. plain LZRW. XOR is more flexible vs. COPY (any distance can be utilized) so this should have higher gain vs. plain LZRW with a large enough hash table.


LZ_XOR

If LZ compression is compilation, what other instructions are useful to add? I've spent the last several years, off and on, trying to figure this out.

LZSS has LIT [byte] and COPY [length, distance]. LZ_XOR is like LZSS but with one new instruction the decompressor can execute. This variant adds XOR [length, distance, 1 or more bytes].

XOR (or ADD) gives the system partial matches, which is particularly useful on GPU texture data, log files, object/executable files, bitmaps, and audio files. In LZ, the COPY instruction must refer to dictionary positions that perfectly match the input file bytes (otherwise, the decompressor will copy the wrong bytes!). LZ_XOR can use any match distance/length pair, even referring to bytes in the dictionary which don't perfectly match (or match at all!) the bytes to compress, although not every distance/length pair will result in an efficient partial match. The compiler's goal is to find those XOR's that result in partial matches which are efficient to code.

A LZ_XOR compiler ("compressor" in LZ-parlance) can optimize for minimum total Hamming distance between the lookahead buffer and the previously encoded/emitted bytes in the sliding dictionary. Or, the compressor can first optimize for minimal Hamming distance in one pass, and then minimal bit prices in a second optimal parsing (optimizing compiler) pass.

LZ_XOR is surprisingly strong and flexible:

1. You can constrain the XOR instruction in the compiler (parser) to only use a limited # of unique symbols. 

This property is valuable when implementing shuffle-based Huffman decoding in AVX2/AVX-512, which is limited to only 16 or 32 unique symbols.

2. XOR is a superset of plain LZ: With 8-bit symbols you don't need LIT's or COPY instructions at all ("pure" LZ_XOR, or XOR-only). This simplifies the decompressor (eliminating unpredictable jumps), and simplifies the instruction stream so less signaling needs to be encoded. Everything is an XOR so there's no need to tell the decompressor that the instruction is a LIT or COPY.

3. Most of the usual tricks used to increase LZSS's ratio by other codecs (more contexts, LZMA-like state machines+contexts, REP matches, circular match history buffers, fast entropy coding, larger dictionaries, etc.) are compatible with LZ_XOR too. I've implemented and tested several LZ_XOR variants with GPU parsing that are LZ4-like and Zstd-like, along with ideas from LZMA/LZHAM. 

4. A file compressed with LZ_XOR will have significantly less overall instructions to execute vs. LZSS to decompress the file (roughly 30-60% less). This results in a decompressor which spends more time copying or XOR'ing bytes, and less time unpacking and interpreting the compressed instruction stream.

5. Another strong variant on some files (audio, bitmaps) is LZ_ADD. Instead of optimizing for minimum total Hamming distance, you optimize for minimum total absolute error.

6. LZ_XOR is strong on uncompressible files (with very few matches), such as GPU texture data. These are the files that LZ4 fails to compress well.

7. LZ_XOR is like LZMA's or LZHAM's LAM's ("Literals after Matches") taken to the next level. 

Example disassembly of a simple LZ_XOR system with only LIT and XOR constrained to only use 32 unique XOR bytes (no copies - they are implemented as XOR's with all-0 XOR bytes):


Another example that uses XOR, COPY, and LITS, along with REP0/1 distances (from LZMA), and a significantly more complex control/instruction stream:


Here's "Alice in Wonderland" compressed with plain LZSS on the left, and LZ_ADD on the right. Notice how much faster it makes progress through the file vs. LZSS:


Compressing the 25 char ASCII string "executable files go here". First run uses LZ-style LIT+COPY plus a new instruction, ADD. Second run just uses LIT+COPY:


XOR-only decompression can be done in two simple steps, using a single output buffer: first entropy decode the XOR bytes into a buffer the size of the output block. This can be done at ~1 GB/sec. using a fast order-0 Huffman or FSE decoder, such as this one. The bulk of the XOR byte values will be 0. The frequency histogram will have one enormous spike with a quick falloff.

Next, execute the XOR control stream and XOR the bytes in the sliding dictionary with the "patch" bytes in the lookahead buffer. (The "patch" bytes are in the brackets in the above disassemblies. In XOR-only there's guaranteed to be a single XOR byte for every byte in the block.) This step is roughly 1-1.5 GiB/sec. It's in-place so it's very cache friendly.

LZ_XOR places very heavy demands on the compressor's parser to find the partial matches with minimal Hamming distance (or emitted bits), making it a good fit for GPU parsing. My experimental compiler currently evaluates every single possible way to code the lookahead bytes against all the bytes in the sliding dictionary. It uses Dijkstra's shortest path algorithm, where each edge is an instruction, the costs are bit prices, and the nodes are lookahead byte positions. There are numerous heuristics and search optimizations that can be done to speed up partial matching. This is the Approximate String Matching Problem.

LZ_XOR gives an RDO GPU texture compressor a lot more freedom to distort a texture block's encoded bytes to increase ratio. With plain LZ/LZSS-style systems, your primary tool to increase the compression ratio (and trade off texture quality) is to increase the number and length of the LZ matches in the encoded texture data. With LZ_XOR you can replace bytes with other bytes which minimize the Hamming distance between the block you're encoding and the already coded blocks in the sliding dictionary. This is a more flexible way of increasing distortion without slamming in large 3-4 byte matches.

While building the above experimental codecs, I also would use the GPU to compute Hamming Correlation Matrix visualizations. This is for the first 494 bytes of alice29.txt (Alice in Wonderland). Each row represents how much the 18 byte pattern starting at that file offset correlates with all the previous byte patterns. White=low Hamming distance, black=high distance.