1. Constrained Deflate for huge (10-25x) encoding speedups vs. most other libraries, and 2-3x decoding speedups: this has shipped already in fpng/fpnge.
Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Tuesday, December 5, 2023
Announcing CPNG, "Compatible Network Graphics", a backwards compatible fork of PNG
1. Constrained Deflate for huge (10-25x) encoding speedups vs. most other libraries, and 2-3x decoding speedups: this has shipped already in fpng/fpnge.
Friday, May 12, 2023
The claimed rANS decoding advantages are being exaggerated
To put it simply:
SSE 4.1, on an Intel Ice Lake CPU:
SIMD Range: 64 streams: 738 MiB/sec., 436,349 bytes
SIMD Range: 16 streams: 619.1 MiB/sec.
SIMD rANS: 8 streams: 556.5 MiB/s, 435,626 bytes
SIMD Huffman: 32 streams: 1.1 GiB/sec.
The vectorized 24-bit Pavlov/Subbotin range coder is here.
The rANS decoder I'm comparing against appears to be well implemented. (If I could use rANS, which I can't because it's patented, I would start with this code.) It could be unrolled more to 16 (vs. 8) or more streams, and that's going to boost decoding perf. some. Let's say ~25% faster, so ~695 MiB/sec. Range is still comparable, and Huffman is of course faster. The benchmark data is showing it's pointless to risk infringement.
Also, if you show me probably any vectorized rANS decoder, many of the vectorization techniques also easily apply to SIMD range and Huffman decoders too. They each have to do many of the same things, and this helps equalize the perceived performance advantages of rANS decoding.
Ultimately working around range coding's division requirement is not the biggest problem. It's the gathers and keeping all the lanes normalized and fed efficiently - which are the same problems the other decoders face. Once you solve these problems for one entropy coder you can apply them to the others. Many of the implementation differences of the actual entropy coding technique melt away once vectorized.
Note encoding is a different beast. I've been unable to see a way to easily interleave the output bytes from X parallel range coders. The SSE 4.1 range sample code linked to above has to swizzle the bytes after encoding into a single stream, however there is no special side band or signaling information needed (just like with rANS). (This is just like my LZHAM codec, which has been interleaving arithmetic and Huffman coded streams with no extra signaling or side band data for many years now - including before FSE/ANS was a thing.) With rANS encoding interleaving the outputs seems very easy. With more research into other entropy coders perhaps it's possible to eliminate this advantage too.
A straightforward AVX-2 port of this SSE 4.1 range coder is ~65% faster on Ice Lake, putting it roughly in the same ballpark as the AVX-2 rANS decoders I've seen benchmarked. This makes sense because many of the bottlenecks are similar and have little to do with the actual entropy coding technique.
(Note I could upload my SSE 4.1 Huffman decoder to github, but honestly 1.1 GiB/sec. is not a big deal. Fast scalar Huffman on my machine gets ~800 MiB/sec. SSE 4.1 lacks easy variable per-lane bit shifting which hurts Huffman a lot. Straightforward AVX-2 Huffman decoding with an 8-bit alphabet gets ~1.8 GiB/sec. on this machine.)
Friday, May 5, 2023
LZ_XOR/LZ_ADD progress
Sunday, April 30, 2023
Somewhere Range Coding went off the rails
Importantly, range coders using this renormalization method during decoding are unnecessarily complex:
https://github.com/jkbonfield/rans_static/blob/master/arith_static.c#L222
https://gist.github.com/richgel999/d522e318bf3ad67e019eabc13137350a
Another important note: If you switch to 24-bits vs. 32-bits, you don't need to use integer division. (Note division isn't needed for a binary range coder. It's also not strictly needed for a range coder that only supports small alphabets. The division enables using a fast decoding table for large alphabets.) Instead you can use single precision division and truncate the results. This option is crucial for vectorization on modern CPU's. Modern vectorized single precision division is fairly fast now. Or said in a different way, it's not so slow that it's a show stopper.
Comparing Vectorized Huffman and Range Decoding vs. rANS (Or: rANS entropy coding is overrated)
SSE 4.1 Huffman/Range Decoding
- Huffman: 32 streams, 1,122 MiB/sec., 438,602 bytes
- Range: 64 streams, 24-bit: 738 MiB/sec., 436,349 bytes (32 streams=436,266 bytes)
AVX-2 Huffman/Range Decoding
- Huffman: 32 streams, 1,811 MiB/sec., 438,602 bytes
- Range: 64 streams, 24-bit: 1,223 MiB/sec., 436,349 bytes
Notes: The range coder uses 12 bit probabilities. The Huffman coder was limited to a max code size of 13 bits. The same encoded stream is compatible with SSE 4.1 and AVX-2 decoders. The compressed bytes statistic doesn't include the probability or Huffman code length tables. I'm not a vector coding expert, so I'm sure in the right hands these decoders could be made even faster.
Collet's Scalar Huffman/FSE Decoding
Using his fast scalar Huffman and FSE encoders (FSE -b -h, or FSE -b):- Huffman: 840 MiB/sec., 439,150 bytes
- FSE: 365 MiB/sec., 437,232 bytes
Giesen's Vectorized rANS Decoding
Using exam_simd_sse41, running on the same machine using WSL and compiled with clang v10:
- rANS: 213.9 MiB/s, 435,604 bytes
- interleaved rANS: 324.7 MiB/s, 435,606 bytes
- SIMD rANS: 556.5 MiB/s, 435,626 bytes
Monday, April 24, 2023
Vectorized interleaved Range Coding using SSE 4.1
Friday, April 21, 2023
Faster LZ is not the answer to 150-250+ GB video game downloads
https://aras-p.info/blog/2023/01/31/Float-Compression-2-Oodleflate/
(Note these benchmarks are great and extremely useful.)
Also, these LZ codecs are too fast. They are designed for fast loading and streaming off SSD's. Who cares about shaving off a few hundred ms (or a second) when it takes hours or days to download the product onto the SSD?
Somebody could develop a 10x faster Oodle (or an Oodle that compresses 1-2% better) and we're still going to wait many hours or days to actually use the product. And then there's the constant updates. This approach doesn't scale.
It's fun and sexy to work on faster LZ but the real problem (and value add) doesn't call for better or more lossless tech. This is a distraction. If trends continue the downloads and updates will be measured in terms of fractional or 1+ week(s).
Sunday, April 2, 2023
The Dark Horse of the Image Codec World: Near-Lossless Image Formats Using Ultra-Fast LZ Codecs
I think simple ultra-high speed lossy (or near-lossless) image codecs, built from the new generation of fast LZ codecs, are going to become more relevant in the future.
Computing bottlenecks change over time. As disk space, disk bandwidth, and internet bandwidth increases, older image codecs that squeeze every last bit out of the resulting file become less valuable for many use cases. Eventually websites or products using noticeably lossy compressed images will be less attractive. The bandwidth savings from overly lossy image codecs will become meaningless, and the CPU/user time and battery or grid energy spent on complex decompression steps will be wasted.
Eventually, much simpler codecs with lower (weaker) compression ratios that introduce less distortion, but have blazingly fast decompression rates are going to become more common. This core concept motivates this work.
One way to construct a simple lossless or lossy image codec with fast decompression is to combine a custom encoding tool with the popular lossless LZ4 codec. LZ4's compression and decompression is extremely fast, and the library is reliable, updated often, extensively fuzz tested, and very simple to use.
To make it lossy, the encoder needs to precondition the image data so when it's subsequently compressed by LZ4, the proportion of 4+ byte matches vs. literals is increased compared to the original image data. I've been constructing LZ Preconditioners, and building new LZ codecs that amend themselves to this preconditioning step, over the past year.
Such a codec will not be able to compete against JPEG, WebP, JPEG 2000, etc. for perceived quality per bit. However, it'll be extremely fast to decode, very simple, and will likely not bloat executables because the LZ4 library is already present in many codebases. Using LZ4 introduces no new security risks.
This LZ preconditioning step must be done in a way that minimizes obvious visible artifacts, as perceived by the Human Visual System (HVS). This tradeoff, of increasing distortion but reducing the bitrate is a classic application of Rate-distortion theory. This is well-known in video coding, and now in GPU texture encoding (which I introduced in 2009 with my "crunch" compression library).
The rdopng tool on github supports creating lossy LZ4 compressed RGB/RGBA images using a simple rate-distortion model. (Lossless is next, but I wanted to solve the harder problem first.) During the preconditioning step, the LZ4 rate in bits is approximated using a sliding dictionary and a match finder. For each potential match replacement which would introduce distortion into the lookahead buffer, the preconditioner approximates the introduced visual error by computing color error distances in a scaled Oklab perceptual colorspace. (Oklab is one of the most powerful colorspaces I've used for this sort of work. There are better colorspaces for compression, but Oklab is simple to use and well-documented.)
Perceptually, distortions introduced into regions of images surrounded by high frequency details are less noticeable vs. regions containing smooth or gradient features. Before preconditioning, the encoder computes two error scaling masks which indicate which areas of the image contains large or small gradients/smooth regions. These scaling masks suppress introducing distortions (by using longer or more matches) if doing so would be too noticeable to the HVS. This step has a large impact on bitrate and can be improved.
To speed up encoding, the preconditioner only examines a window region above and to the left of the lookahead buffer. LZ4's unfortunate minimum match size of 4 bytes complicates encoding of 24bpp RGB images. Encoding is not very fast due to this search process, but it's possible to thread it by working on different regions of the image in parallel. The encoder is a proof of principle and testing grounds, and not as fast as it could be, but it works.
The encoder also supports angular error metrics for encoding tangent space normal maps.
LZ4I images are trivial to decode in any language. A LZ4I image consists of a simple header followed by the LZ4 compressed 24bpp RGB (R first) or RGBA pixel data:
Some example encodings:
Lossy LZ4I: 42.630 709 Y dB, 12.985 bits/pixel, 1.1 gigapixels/sec. decompression rate using LZ4_decompress_safe() (on a mobile Core i7 1065G7 at 1.3GHz base clock):
Biased delta image:
Greyscale histogram of biased delta image:
Thursday, March 30, 2023
EA/Microsoft Neural Network GPU Texture Compression Patents
Microsoft: https://patents.google.com/patent/US10504248B2/en
What this boils down to: techniques like TexNN are potentially patented. This work was done in 2017/2018 but it wasn't public until 2019:
https://cs.unc.edu/~psrihariv/publication/texnn/
Both my "crunch" library and Basis Universal bypass the need for fast real-time texture compression entirely. They compress directly in the GPU texture domain. This approach is used by Unity and many AAA video games:
https://www.mobygames.com/person/190072/richard-geldreich/
I've been researching the next series of codecs for Basis Universal. This is why I wrote RDO PNG, QOI and LZ4, and bc7enc_rdo.
I am strongly anti-software patent. All the EA/MS patents will do is push developers towards open solutions, which will likely be better in the long run anyway. Their patents add nothing and were obvious. These corporations incentivize their developers to patent everything they can get their hands on (via bonuses etc.), which ultimately results in exploiting the system by patenting trivial or obvious ideas. Ultimately this slows innovation and encourages patent wars, which is bad for everybody.
It's possible to speed up a BC7 encoder without using neural networks. An encoder can first try the first 16 partition patterns, and find which is best. The results can be used to predict which more complex patterns are likely to improve the results. See the table and code here - this works:
https://github.com/richgel999/bc7enc_rdo/blame/master/bc7enc.cpp#L1714
It's potentially possible to use a 3 or 4 level hierarchy to determine the BC7 partition pattern. bc7enc.cpp only uses 2 levels. This would reduce the # of partition patterns to examine to only a handful.
To cut down the # of BC7 modes to check, you can first rule out mode 7 because it's only useful for blocks containing alpha. Then try mode 6. If it's good enough, stop. Then try mode 1. If that's good enough, stop, etc. Only a subset of blocks need to use the complex multiple subset modes. In many cases the 3 subset modes can be ignored entirely with little noticeable impact on the results. The component "rotation" feature is usually low value.
TexNN was first published here: