Sunday, April 30, 2023

Somewhere Range Coding went off the rails

I've lost track of the number of Range Coders I've seen floating around for the past ~25 years. Subbotin's coders released in the late 90's/early 2000's seem to be very influential. Each implementation has different tradeoffs that greatly impact a vectorized implementation. There is no single definitive range coder design, and I've seen 3 very different ways to handle carries. I learned a lot implementing my first vectorized range decoder.

Most of the tradeoffs boil down to how carries are handled or how many state registers are used by the decoder. You can delay the carries during encoding (Subbotin's earliest "carry-less" or delayed carry encoder that I can find, which is very clever), or fixup the already encoded bytes by propagating the carries backwards in memory during encoding (LZMA's binary range coder, which is similar to Subbotin's), or move the carry logic into the decoder (another by Subbotin called the "Russian People's Range Coder"). There are also decoder variants that track low/code/range vs. just code/range (LZMA's and Subbotin's earliest).

Importantly, range coders using this renormalization method during decoding are unnecessarily complex:

https://github.com/jkbonfield/rans_static/blob/master/arith_static.c#L222



Here's a different Subbotin variant that also uses overly complex and obfuscated renormalization logic during decoding. (Please just stop guys - this is the wrong direction.)

Why is all that messy conditional logic in the decoder renormalization step? It appears to be carry related, but the carries should have been handled during encoding in some way. It's unnecessary to create a working and correct range coder.

Instead do this, from Subbotin's first delayed carry range coder. This is easily vectorizable and he only uses 2 variables (code and range). This is quite similar to LZMA's and LZHAM's, except it doesn't need to propagate carries back in the buffer containing the output bytes.

https://gist.github.com/richgel999/d522e318bf3ad67e019eabc13137350a




Another important note: If you switch to 24-bits vs. 32-bits, you don't need to use integer division. (Note division isn't needed for a binary range coder. It's also not strictly needed for a range coder that only supports small alphabets. The division enables using a fast decoding table for large alphabets.) Instead you can use single precision division and truncate the results. This option is crucial for vectorization on modern CPU's. Modern vectorized single precision division is fairly fast now. Or said in a different way, it's not so slow that it's a show stopper. 


In vectorized implementations of both Huffman and range decoding, most of the instructions are actually not related to the actual entropy coding method. Instead they are common things like fetching from the decoder's lookup table (using some sort of manual or explicit gather), and fetching source bytes and distributing them to the lanes during renormalization. The FP division is a small part of the whole process. 

What seems to slow down vectorized range decoding is the more complex renormalization vs. Huffman, which in my implementation can fetch between [0,2] bytes. By comparison, my vectorized Huffman decoders fetch either 0 or 2 bytes per step.

All of these details are starting to matter now that rANS coding is falling under patent protection. Range Coding is a ~44 years old technique.

No comments:

Post a Comment