1. Constrained Deflate for huge (10-25x) encoding speedups vs. most other libraries, and 2-3x decoding speedups: this has shipped already in fpng/fpnge.
Richard Geldreich's Blog
Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Tuesday, December 5, 2023
Announcing CPNG, "Compatible Network Graphics", a backwards compatible fork of PNG
1. Constrained Deflate for huge (10-25x) encoding speedups vs. most other libraries, and 2-3x decoding speedups: this has shipped already in fpng/fpnge.
Friday, May 12, 2023
The claimed rANS decoding advantages are being exaggerated
To put it simply:
SSE 4.1, on an Intel Ice Lake CPU:
SIMD Range: 64 streams: 738 MiB/sec., 436,349 bytes
SIMD Range: 16 streams: 619.1 MiB/sec.
SIMD rANS: 8 streams: 556.5 MiB/s, 435,626 bytes
SIMD Huffman: 32 streams: 1.1 GiB/sec.
The vectorized 24-bit Pavlov/Subbotin range coder is here.
The rANS decoder I'm comparing against appears to be well implemented. (If I could use rANS, which I can't because it's patented, I would start with this code.) It could be unrolled more to 16 (vs. 8) or more streams, and that's going to boost decoding perf. some. Let's say ~25% faster, so ~695 MiB/sec. Range is still comparable, and Huffman is of course faster. The benchmark data is showing it's pointless to risk infringement.
Also, if you show me probably any vectorized rANS decoder, many of the vectorization techniques also easily apply to SIMD range and Huffman decoders too. They each have to do many of the same things, and this helps equalize the perceived performance advantages of rANS decoding.
Ultimately working around range coding's division requirement is not the biggest problem. It's the gathers and keeping all the lanes normalized and fed efficiently - which are the same problems the other decoders face. Once you solve these problems for one entropy coder you can apply them to the others. Many of the implementation differences of the actual entropy coding technique melt away once vectorized.
Note encoding is a different beast. I've been unable to see a way to easily interleave the output bytes from X parallel range coders. The SSE 4.1 range sample code linked to above has to swizzle the bytes after encoding into a single stream, however there is no special side band or signaling information needed (just like with rANS). (This is just like my LZHAM codec, which has been interleaving arithmetic and Huffman coded streams with no extra signaling or side band data for many years now - including before FSE/ANS was a thing.) With rANS encoding interleaving the outputs seems very easy. With more research into other entropy coders perhaps it's possible to eliminate this advantage too.
A straightforward AVX-2 port of this SSE 4.1 range coder is ~65% faster on Ice Lake, putting it roughly in the same ballpark as the AVX-2 rANS decoders I've seen benchmarked. This makes sense because many of the bottlenecks are similar and have little to do with the actual entropy coding technique.
(Note I could upload my SSE 4.1 Huffman decoder to github, but honestly 1.1 GiB/sec. is not a big deal. Fast scalar Huffman on my machine gets ~800 MiB/sec. SSE 4.1 lacks easy variable per-lane bit shifting which hurts Huffman a lot. Straightforward AVX-2 Huffman decoding with an 8-bit alphabet gets ~1.8 GiB/sec. on this machine.)
Friday, May 5, 2023
LZ_XOR/LZ_ADD progress
Sunday, April 30, 2023
Somewhere Range Coding went off the rails
Importantly, range coders using this renormalization method during decoding are unnecessarily complex:
https://github.com/jkbonfield/rans_static/blob/master/arith_static.c#L222
https://gist.github.com/richgel999/d522e318bf3ad67e019eabc13137350a
Another important note: If you switch to 24-bits vs. 32-bits, you don't need to use integer division. (Note division isn't needed for a binary range coder. It's also not strictly needed for a range coder that only supports small alphabets. The division enables using a fast decoding table for large alphabets.) Instead you can use single precision division and truncate the results. This option is crucial for vectorization on modern CPU's. Modern vectorized single precision division is fairly fast now. Or said in a different way, it's not so slow that it's a show stopper.
Comparing Vectorized Huffman and Range Decoding vs. rANS (Or: rANS entropy coding is overrated)
SSE 4.1 Huffman/Range Decoding
- Huffman: 32 streams, 1,122 MiB/sec., 438,602 bytes
- Range: 64 streams, 24-bit: 738 MiB/sec., 436,349 bytes (32 streams=436,266 bytes)
AVX-2 Huffman/Range Decoding
- Huffman: 32 streams, 1,811 MiB/sec., 438,602 bytes
- Range: 64 streams, 24-bit: 1,223 MiB/sec., 436,349 bytes
Notes: The range coder uses 12 bit probabilities. The Huffman coder was limited to a max code size of 13 bits. The same encoded stream is compatible with SSE 4.1 and AVX-2 decoders. The compressed bytes statistic doesn't include the probability or Huffman code length tables. I'm not a vector coding expert, so I'm sure in the right hands these decoders could be made even faster.
Collet's Scalar Huffman/FSE Decoding
Using his fast scalar Huffman and FSE encoders (FSE -b -h, or FSE -b):- Huffman: 840 MiB/sec., 439,150 bytes
- FSE: 365 MiB/sec., 437,232 bytes
Giesen's Vectorized rANS Decoding
Using exam_simd_sse41, running on the same machine using WSL and compiled with clang v10:
- rANS: 213.9 MiB/s, 435,604 bytes
- interleaved rANS: 324.7 MiB/s, 435,606 bytes
- SIMD rANS: 556.5 MiB/s, 435,626 bytes
Monday, April 24, 2023
Vectorized interleaved Range Coding using SSE 4.1
Friday, April 21, 2023
Faster LZ is not the answer to 150-250+ GB video game downloads
https://aras-p.info/blog/2023/01/31/Float-Compression-2-Oodleflate/
(Note these benchmarks are great and extremely useful.)
Also, these LZ codecs are too fast. They are designed for fast loading and streaming off SSD's. Who cares about shaving off a few hundred ms (or a second) when it takes hours or days to download the product onto the SSD?
Somebody could develop a 10x faster Oodle (or an Oodle that compresses 1-2% better) and we're still going to wait many hours or days to actually use the product. And then there's the constant updates. This approach doesn't scale.
It's fun and sexy to work on faster LZ but the real problem (and value add) doesn't call for better or more lossless tech. This is a distraction. If trends continue the downloads and updates will be measured in terms of fractional or 1+ week(s).