LZ_XOR 128KB dictionary, AVX2, BMI1, mid-level CPU parsing, Ice Lake CPU (Core i7 1065G7 @ 1.3GHz, Dell Laptop).
Only the XOR bytes are entropy coded, otherwise everything else (the control stream, the usually rare literal runs) are sent byte-wise. It uses 6-bit length limited prefix codes in 16 streams, AVX2 gathers and shuffle-based LUT's. I also have a two gather version (one gather to get the bits, another to do the Huffman lookups) that decodes 2 symbols per gather, which is slightly faster (2.2 GiB/sec. vs. 1.9 GiB/sec.) but only on large buffers. I posted a pic of cppspmd_fast inner loop on my Twitter.
BMI1 made very little if any difference that I could detect.
The compressor isn't optimized yet. It's like 100KB/sec. and all on one thread. That's next. LZ_XOR trades off strong parsing in the encoder for less instructions to execute in the decompressor, longer XOR matches, and usually rare (.5-1%) literals.