Monday, April 24, 2023

Vectorized interleaved Range Coding using SSE 4.1

In order to avoid the current (and upcoming) ANS/rANS entropy coding patent minefield, we're avoiding it and using vectorized Range Coding instead. Here's a 24-bit SSE 4.1 example using 16 interleaved streams. This example decoder gets 550-700 megabytes/sec. with 8-bit alphabets on various Intel/AMD CPU's I've tried:

More on the rANS patent situation (from early 2022):

This decoder design is practical on any CPU or GPU that supports fast hardware integer or float division. It explicitly uses 24-bit registers to sidestep issues with float divides. I've put much less work on optimizing the encoder, but the key step (the post-encode byte swizzle) is the next bottleneck to address.

No comments:

Post a Comment