Richard Geldreich's Blog: May 2023

Friday, May 5, 2023

LZ_XOR/LZ_ADD progress

I'm tired of all the endless LZ clones, so I'm trying something different.

I now have two prototype LZ_XOR/ADD lossless codecs. In this design a new fundamental instruction is added to the usual LZ virtual machine, either XOR or ADD. Currently the specific instruction added is decided at the file level. (From now on I'm just going to say XOR, but I really mean XOR or ADD.)

These new instructions are like the usual LZ matches, except XOR's are followed by a list of entropy coded byte values that are XOR'd to the string bytes matched in the sliding dictionary. On certain types of content these new ops are a win (to a big win), but I'm still benchmarking it.

The tradeoff is an expensive fuzzy search problem. Also, with this design you're on your own - because there's nobody to copy ideas from. The usual bag of parsing heuristic tricks that everybody copies from LZMA don't work anymore, or have to be modified.

One prototype is byte oriented and is somewhat fast to decompress (>1 GiB/sec.), the other is like LZMA and uses a bitwise range coder. Fuzzy matching is difficult but I've made a lot of headway. It's no longer a terrifying search problem, now it's just scary.

The ratio of XOR's vs. literals or COPY ops highly depends on the source data. On plain text XOR's are weak and not worth the trouble. They're extremely strong on audio and image data, and they excel on binary or structured content.

With the LZMA-like codec LZ_XOR instructions using mostly 0 delta bytes can become so cheap to code they can be preferred over COPY's, which is at first surprising to see. It can be cheaper to extend an LZ_XOR with some more delta bytes vs. truncating one and starting a COPY instruction. On some repetitive log files nearly all emitted instructions are long XOR's.

COPY ops must stop on the first mismatch, while XOR ops can match right through minor mismatches and still have net gain. Adding XOR ops can drastically reduce the # of overall instructions the VM ("decompressor") has to process, and also gives the parser an expanded amount of freedom to trade off reduced instructions vs. ratio. It's not all about ratio, it's about decompression speed.

Overall this appears to be a net win, assuming you can optimize the parsing. GPU parsing is probably required to pull this off, which I'm steadily moving towards.

The other improvement that shows net gain on many files is to emit an optional "history distance delta offset" value. This allows the encoder to specify a [-128,127] offset relative to one of the "REP" match history distances. The offset is entropy coded.