This decoder design is practical on any CPU or GPU that supports fast hardware integer or float division. It explicitly uses 24-bit registers to sidestep issues with float divides. I've put much less work on optimizing the encoder, but the key step (the post-encode byte swizzle) is the next bottleneck to address.
Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Monday, April 24, 2023
Vectorized interleaved Range Coding using SSE 4.1
In order to avoid the current (and upcoming) ANS/rANS entropy coding patent minefield, we're avoiding it and using vectorized Range Coding instead. Here's a 24-bit SSE 4.1 example using 16 interleaved streams. This example decoder gets 550-700 megabytes/sec. with 8-bit alphabets on various Intel/AMD CPU's I've tried:
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment