Saturday, January 8, 2022

One nice property of highly constrained length limited prefix codes

I realized earlier that I don't need AVX2 for really fast Huffman (really length limited prefix) decoding. Using a 6-bit max Huffman code size, a 12-bit decode table (16KB - which fits in the cache) is guaranteed to always give you 2 symbols per lookup. With "normal" fast Huffman decoding using a 12-bit table/max code size, you only get 1 or 2. Big difference.

This means less conditionals during decoding, and higher baseline throughput. The downside is you have to figure out how to exploit a constrained decoder like this, but I've already found one way (LZ_XOR). LZ_XOR is magical and fits a length limited prefix decoder design like this well because during your partial matching searches you can just skip any non-codable partial matches (i.e. ones that use XOR delta bytes you can't code).

LZ_XOR has so much flexibility that it can just toss uncodable partial matches and it still has plenty of usable things it can code. LZ_XOR's delta bytes are limited to 32 unique decodable symbols - which turns out to be plenty.

The compiler's instruction selector can shorten a long XOR match to make it codable using other instructions.

At 6-bits the algorithm you use to length limit the Huffman codes really matters. The one I'm using now (from miniz) is dead simple and not very efficient. I need Package-Merge.

Also, Universal codes can be decoded without using lookups (to decode the index anyway) using CLZ instructions. Unfortunately to do CLZ with anything less than AVX512 seems to require exploiting the FP instructions. In my tests with LZ_XOR various modified types of Universal codes are only 1-4% less efficient than Huffman.

No comments:

Post a Comment