The is just a sketch of an idea. It might not pan out, there are too many details to know until I try it. LZHAM is an old codec now, but maybe I could stretch it a bit further before moving on to a new design.
It should be possible to decode multiple segments of each LZHAM block simultaneously with vector instructions. Porting the decoder to C, then vectorizing the decoder loop with SPMD (using iscp) shouldn't be that hard. The compressed stream will need some hints so it knows where each lane needs to start decoding from.
So for 8 parallel decodes, at the beginning of the block you memcpy() the arithmetic and Huffman statistics into 8 copies, run the SPMD decoder on say 1-2K bytes per lane, then sync the statistics after 8 1-2K blocks are processed. Then repeat the process.
Lane 0 will be able to process match copies from any previously decompressed bytes, but lane 1 can't access lane 0's decoded bytes, and lane 2 can't access lane 1/0's, etc. That'll definitely lower ratio. However, it should be possible for the compressor to simulate lane 0's decompressor and predict which bytes are available in lane 1 at each point in the compressed data stream.
The various if() statements in the decoder's inner loop won't be particularly coherent, which would impact lane efficiency. But even a 2-3x improvement in decoding speed would be pretty nice.