Friday, May 12, 2023

The claimed rANS decoding advantages are being exaggerated

To put it simply:

SSE 4.1, on an Intel Ice Lake CPU:

SIMD Range: 64 streams: 738 MiB/sec., 436,349 bytes

SIMD Range: 16 streams: 619.1 MiB/sec.

SIMD rANS: 8 streams: 556.5 MiB/s, 435,626 bytes

SIMD Huffman: 32 streams: 1.1 GiB/sec.

The vectorized 24-bit Pavlov/Subbotin range coder is here

The rANS decoder I'm comparing against appears to be well implemented. (If I could use rANS, which I can't because it's patented, I would start with this code.) It could be unrolled more to 16 (vs. 8) or more streams, and that's going to boost decoding perf. some. Let's say ~25% faster, so ~695 MiB/sec. Range is still comparable, and Huffman is of course faster. The benchmark data is showing it's pointless to risk infringement.

Also, if you show me probably any vectorized rANS decoder, many of the vectorization techniques also easily apply to SIMD range and Huffman decoders too. They each have to do many of the same things, and this helps equalize the perceived performance advantages of rANS decoding. 

Ultimately working around range coding's division requirement is not the biggest problem. It's the gathers and keeping all the lanes normalized and fed efficiently - which are the same problems the other decoders face. Once you solve these problems for one entropy coder you can apply them to the others. Many of the implementation differences of the actual entropy coding technique melt away once vectorized.

Note encoding is a different beast. I've been unable to see a way to easily interleave the output bytes from X parallel range coders. The SSE 4.1 range sample code linked to above has to swizzle the bytes after encoding into a single stream, however there is no special side band or signaling information needed (just like with rANS). (This is just like my LZHAM codec, which has been interleaving arithmetic and Huffman coded streams with no extra signaling or side band data for many years now - including before FSE/ANS was a thing.) With rANS encoding interleaving the outputs seems very easy. With more research into other entropy coders perhaps it's possible to eliminate this advantage too.

A straightforward AVX-2 port of this SSE 4.1 range coder is ~65% faster on Ice Lake, putting it roughly in the same ballpark as the AVX-2 rANS decoders I've seen benchmarked. This makes sense because many of the bottlenecks are similar and have little to do with the actual entropy coding technique.

(Note I could upload my SSE 4.1 Huffman decoder to github, but honestly 1.1 GiB/sec. is not a big deal. Fast scalar Huffman on my machine gets ~800 MiB/sec. SSE 4.1 lacks easy variable per-lane bit shifting which hurts Huffman a lot. Straightforward AVX-2 Huffman decoding with an 8-bit alphabet gets ~1.8 GiB/sec. on this machine.)

Friday, May 5, 2023

LZ_XOR/LZ_ADD progress

I'm tired of all the endless LZ clones, so I'm trying something different.

I now have two prototype LZ_XOR/ADD lossless codecs. In this design a new fundamental instruction is added to the usual LZ virtual machine, either XOR or ADD. Currently the specific instruction added is decided at the file level. (From now on I'm just going to say XOR, but I really mean XOR or ADD.)

These new instructions are like the usual LZ matches, except XOR's are followed by a list of entropy coded byte values that are XOR'd to the string bytes matched in the sliding dictionary. On certain types of content these new ops are a win (to a big win), but I'm still benchmarking it.

The tradeoff is an expensive fuzzy search problem. Also, with this design you're on your own - because there's nobody to copy ideas from. The usual bag of parsing heuristic tricks that everybody copies from LZMA don't work anymore, or have to be modified.

One prototype is byte oriented and is somewhat fast to decompress (>1 GiB/sec.), the other is like LZMA and uses a bitwise range coder. Fuzzy matching is difficult but I've made a lot of headway. It's no longer a terrifying search problem, now it's just scary.

The ratio of XOR's vs. literals or COPY ops highly depends on the source data. On plain text XOR's are weak and not worth the trouble. They're extremely strong on audio and image data, and they excel on binary or structured content. 

With the LZMA-like codec LZ_XOR instructions using mostly 0 delta bytes can become so cheap to code they can be preferred over COPY's, which is at first surprising to see. It can be cheaper to extend an LZ_XOR with some more delta bytes vs. truncating one and starting a COPY instruction. On some repetitive log files nearly all emitted instructions are long XOR's. 

COPY ops must stop on the first mismatch, while XOR ops can match right through minor mismatches and still have net gain. Adding XOR ops can drastically reduce the # of overall instructions the VM ("decompressor") has to process, and also gives the parser an expanded amount of freedom to trade off reduced instructions vs. ratio. It's not all about ratio, it's about decompression speed.

Overall this appears to be a net win, assuming you can optimize the parsing. GPU parsing is probably required to pull this off, which I'm steadily moving towards.

The other improvement that shows net gain on many files is to emit an optional "history distance delta offset" value. This allows the encoder to specify a [-128,127] offset relative to one of the "REP" match history distances. The offset is entropy coded.