To put it simply:
SSE 4.1, on an Intel Ice Lake CPU:
SIMD Range: 64 streams: 738 MiB/sec., 436,349 bytes
SIMD Range: 16 streams: 619.1 MiB/sec.
SIMD rANS: 8 streams: 556.5 MiB/s, 435,626 bytes
SIMD Huffman: 32 streams: 1.1 GiB/sec.
The vectorized 24-bit Pavlov/Subbotin range coder is here.
The rANS decoder I'm comparing against appears to be well implemented. (If I could use rANS, which I can't because it's patented, I would start with this code.) It could be unrolled more to 16 (vs. 8) or more streams, and that's going to boost decoding perf. some. Let's say ~25% faster, so ~695 MiB/sec. Range is still comparable, and Huffman is of course faster. The benchmark data is showing it's pointless to risk infringement.
Also, if you show me probably any vectorized rANS decoder, many of the vectorization techniques also easily apply to SIMD range and Huffman decoders too. They each have to do many of the same things, and this helps equalize the perceived performance advantages of rANS decoding.
Ultimately working around range coding's division requirement is not the biggest problem. It's the gathers and keeping all the lanes normalized and fed efficiently - which are the same problems the other decoders face. Once you solve these problems for one entropy coder you can apply them to the others. Many of the implementation differences of the actual entropy coding technique melt away once vectorized.
Note encoding is a different beast. I've been unable to see a way to easily interleave the output bytes from X parallel range coders. The SSE 4.1 range sample code linked to above has to swizzle the bytes after encoding into a single stream, however there is no special side band or signaling information needed (just like with rANS). (This is just like my LZHAM codec, which has been interleaving arithmetic and Huffman coded streams with no extra signaling or side band data for many years now - including before FSE/ANS was a thing.) With rANS encoding interleaving the outputs seems very easy. With more research into other entropy coders perhaps it's possible to eliminate this advantage too.
A straightforward AVX-2 port of this SSE 4.1 range coder is ~65% faster on Ice Lake, putting it roughly in the same ballpark as the AVX-2 rANS decoders I've seen benchmarked. This makes sense because many of the bottlenecks are similar and have little to do with the actual entropy coding technique.
(Note I could upload my SSE 4.1 Huffman decoder to github, but honestly 1.1 GiB/sec. is not a big deal. Fast scalar Huffman on my machine gets ~800 MiB/sec. SSE 4.1 lacks easy variable per-lane bit shifting which hurts Huffman a lot. Straightforward AVX-2 Huffman decoding with an 8-bit alphabet gets ~1.8 GiB/sec. on this machine.)
