I now have two prototype LZ_XOR/ADD lossless codecs. In this design a new fundamental instruction is added to the usual LZ virtual machine, either XOR or ADD. Currently the specific instruction added is decided at the file level. (From now on I'm just going to say XOR, but I really mean XOR or ADD.)
These new instructions are like the usual LZ matches, except XOR's are followed by a list of entropy coded byte values that are XOR'd to the string bytes matched in the sliding dictionary. On certain types of content these new ops are a win (to a big win), but I'm still benchmarking it.
The tradeoff is an expensive fuzzy search problem. Also, with this design you're on your own - because there's nobody to copy ideas from. The usual bag of parsing heuristic tricks that everybody copies from LZMA don't work anymore, or have to be modified.
One prototype is byte oriented and is somewhat fast to decompress (>1 GiB/sec.), the other is like LZMA and uses a bitwise range coder. Fuzzy matching is difficult but I've made a lot of headway. It's no longer a terrifying search problem, now it's just scary.
With the LZMA-like codec LZ_XOR instructions using mostly 0 delta bytes can become so cheap to code they can be preferred over COPY's, which is at first surprising to see. It can be cheaper to extend an LZ_XOR with some more delta bytes vs. truncating one and starting a COPY instruction. On some repetitive log files nearly all emitted instructions are long XOR's.
COPY ops must stop on the first mismatch, while XOR ops can match right through minor mismatches and still have net gain. Adding XOR ops can drastically reduce the # of overall instructions the VM ("decompressor") has to process, and also gives the parser an expanded amount of freedom to trade off reduced instructions vs. ratio. It's not all about ratio, it's about decompression speed.
Overall this appears to be a net win, assuming you can optimize the parsing. GPU parsing is probably required to pull this off, which I'm steadily moving towards.
No comments:
Post a Comment