John Brooks (CEO of Blue Shift Inc.) and I were discussing LZHAM's current decompression performance vs. Google's Brotli codec and we came up with these ideas:
- Basic block optimization:
The current decompressor's inner loop is a general purpose coroutine based design, so a single implementation can handle both streaming and non-streaming scenarios. This is hurting perf. because the coroutine structure consists of a huge switch statement, which causes the compiler to dump locals to registers (and read them back in) a lot.
I'm going to add an alternative non-coroutine version of the inner loop that won't support streaming to optimize the very common memory to memory scenario. (LZHAM already has a few non-streaming optimizations in there, but it still uses a huge switch statement that breaks up the inner loop into tons of basic blocks.)
- LZHAM doesn't have any SIMD optimizations in its Huffman routines. I've been hesitant to use SIMD code anywhere in LZHAM because it complicates testing, but some of the Huffman routines should be easy to optimize with SIMD code.
- Finally, LZHAM's small block performance suffers vs. LZMA, Brotli, or zlib because it must compute Huffman tables on the fly at a fairly high frequency near the beginning of streams. There's a simple solution to this, which is to use precomputed Huffman table codelengths at the start of streams, then switch to dynamically updated Huffman tables once it makes sense to do so.
I already have the code to do this stuff (from the original LZHAM prototype), but it would require breaking v1.0 format compatibility. (And I'm not going to break format compatibility - if/when I do the new thing will have a different name.)