Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Friday, May 5, 2023
LZ_XOR/LZ_ADD progress
Sunday, April 30, 2023
Somewhere Range Coding went off the rails
Importantly, range coders using this renormalization method during decoding are unnecessarily complex:
https://github.com/jkbonfield/rans_static/blob/master/arith_static.c#L222
https://gist.github.com/richgel999/d522e318bf3ad67e019eabc13137350a
Another important note: If you switch to 24-bits vs. 32-bits, you don't need to use integer division. (Note division isn't needed for a binary range coder. It's also not strictly needed for a range coder that only supports small alphabets. The division enables using a fast decoding table for large alphabets.) Instead you can use single precision division and truncate the results. This option is crucial for vectorization on modern CPU's. Modern vectorized single precision division is fairly fast now. Or said in a different way, it's not so slow that it's a show stopper.
Comparing Vectorized Huffman and Range Decoding vs. rANS (Or: rANS entropy coding is overrated)
SSE 4.1 Huffman/Range Decoding
- Huffman: 32 streams, 1,122 MiB/sec., 438,602 bytes
- Range: 64 streams, 24-bit: 738 MiB/sec., 436,349 bytes (32 streams=436,266 bytes)
AVX-2 Huffman/Range Decoding
- Huffman: 32 streams, 1,811 MiB/sec., 438,602 bytes
- Range: 64 streams, 24-bit: 1,223 MiB/sec., 436,349 bytes
Notes: The range coder uses 12 bit probabilities. The Huffman coder was limited to a max code size of 13 bits. The same encoded stream is compatible with SSE 4.1 and AVX-2 decoders. The compressed bytes statistic doesn't include the probability or Huffman code length tables. I'm not a vector coding expert, so I'm sure in the right hands these decoders could be made even faster.
Collet's Scalar Huffman/FSE Decoding
Using his fast scalar Huffman and FSE encoders (FSE -b -h, or FSE -b):- Huffman: 840 MiB/sec., 439,150 bytes
- FSE: 365 MiB/sec., 437,232 bytes
Giesen's Vectorized rANS Decoding
Using exam_simd_sse41, running on the same machine using WSL and compiled with clang v10:
- rANS: 213.9 MiB/s, 435,604 bytes
- interleaved rANS: 324.7 MiB/s, 435,606 bytes
- SIMD rANS: 556.5 MiB/s, 435,626 bytes
Monday, April 24, 2023
Vectorized interleaved Range Coding using SSE 4.1
Friday, April 21, 2023
Faster LZ is not the answer to 150-250+ GB video game downloads
https://aras-p.info/blog/2023/01/31/Float-Compression-2-Oodleflate/
(Note these benchmarks are great and extremely useful.)
Also, these LZ codecs are too fast. They are designed for fast loading and streaming off SSD's. Who cares about shaving off a few hundred ms (or a second) when it takes hours or days to download the product onto the SSD?
Somebody could develop a 10x faster Oodle (or an Oodle that compresses 1-2% better) and we're still going to wait many hours or days to actually use the product. And then there's the constant updates. This approach doesn't scale.
It's fun and sexy to work on faster LZ but the real problem (and value add) doesn't call for better or more lossless tech. This is a distraction. If trends continue the downloads and updates will be measured in terms of fractional or 1+ week(s).
Sunday, April 2, 2023
The Dark Horse of the Image Codec World: Near-Lossless Image Formats Using Ultra-Fast LZ Codecs
I think simple ultra-high speed lossy (or near-lossless) image codecs, built from the new generation of fast LZ codecs, are going to become more relevant in the future.
Computing bottlenecks change over time. As disk space, disk bandwidth, and internet bandwidth increases, older image codecs that squeeze every last bit out of the resulting file become less valuable for many use cases. Eventually websites or products using noticeably lossy compressed images will be less attractive. The bandwidth savings from overly lossy image codecs will become meaningless, and the CPU/user time and battery or grid energy spent on complex decompression steps will be wasted.
Eventually, much simpler codecs with lower (weaker) compression ratios that introduce less distortion, but have blazingly fast decompression rates are going to become more common. This core concept motivates this work.
One way to construct a simple lossless or lossy image codec with fast decompression is to combine a custom encoding tool with the popular lossless LZ4 codec. LZ4's compression and decompression is extremely fast, and the library is reliable, updated often, extensively fuzz tested, and very simple to use.
To make it lossy, the encoder needs to precondition the image data so when it's subsequently compressed by LZ4, the proportion of 4+ byte matches vs. literals is increased compared to the original image data. I've been constructing LZ Preconditioners, and building new LZ codecs that amend themselves to this preconditioning step, over the past year.
Such a codec will not be able to compete against JPEG, WebP, JPEG 2000, etc. for perceived quality per bit. However, it'll be extremely fast to decode, very simple, and will likely not bloat executables because the LZ4 library is already present in many codebases. Using LZ4 introduces no new security risks.
This LZ preconditioning step must be done in a way that minimizes obvious visible artifacts, as perceived by the Human Visual System (HVS). This tradeoff, of increasing distortion but reducing the bitrate is a classic application of Rate-distortion theory. This is well-known in video coding, and now in GPU texture encoding (which I introduced in 2009 with my "crunch" compression library).
The rdopng tool on github supports creating lossy LZ4 compressed RGB/RGBA images using a simple rate-distortion model. (Lossless is next, but I wanted to solve the harder problem first.) During the preconditioning step, the LZ4 rate in bits is approximated using a sliding dictionary and a match finder. For each potential match replacement which would introduce distortion into the lookahead buffer, the preconditioner approximates the introduced visual error by computing color error distances in a scaled Oklab perceptual colorspace. (Oklab is one of the most powerful colorspaces I've used for this sort of work. There are better colorspaces for compression, but Oklab is simple to use and well-documented.)
Perceptually, distortions introduced into regions of images surrounded by high frequency details are less noticeable vs. regions containing smooth or gradient features. Before preconditioning, the encoder computes two error scaling masks which indicate which areas of the image contains large or small gradients/smooth regions. These scaling masks suppress introducing distortions (by using longer or more matches) if doing so would be too noticeable to the HVS. This step has a large impact on bitrate and can be improved.
To speed up encoding, the preconditioner only examines a window region above and to the left of the lookahead buffer. LZ4's unfortunate minimum match size of 4 bytes complicates encoding of 24bpp RGB images. Encoding is not very fast due to this search process, but it's possible to thread it by working on different regions of the image in parallel. The encoder is a proof of principle and testing grounds, and not as fast as it could be, but it works.
The encoder also supports angular error metrics for encoding tangent space normal maps.
LZ4I images are trivial to decode in any language. A LZ4I image consists of a simple header followed by the LZ4 compressed 24bpp RGB (R first) or RGBA pixel data:
Some example encodings:
Lossy LZ4I: 42.630 709 Y dB, 12.985 bits/pixel, 1.1 gigapixels/sec. decompression rate using LZ4_decompress_safe() (on a mobile Core i7 1065G7 at 1.3GHz base clock):
Biased delta image:
Greyscale histogram of biased delta image:
Thursday, March 30, 2023
EA/Microsoft Neural Network GPU Texture Compression Patents
Microsoft: https://patents.google.com/patent/US10504248B2/en
What this boils down to: techniques like TexNN are potentially patented. This work was done in 2017/2018 but it wasn't public until 2019:
https://cs.unc.edu/~psrihariv/publication/texnn/
Both my "crunch" library and Basis Universal bypass the need for fast real-time texture compression entirely. They compress directly in the GPU texture domain. This approach is used by Unity and many AAA video games:
https://www.mobygames.com/person/190072/richard-geldreich/
I've been researching the next series of codecs for Basis Universal. This is why I wrote RDO PNG, QOI and LZ4, and bc7enc_rdo.
I am strongly anti-software patent. All the EA/MS patents will do is push developers towards open solutions, which will likely be better in the long run anyway. Their patents add nothing and were obvious. These corporations incentivize their developers to patent everything they can get their hands on (via bonuses etc.), which ultimately results in exploiting the system by patenting trivial or obvious ideas. Ultimately this slows innovation and encourages patent wars, which is bad for everybody.
It's possible to speed up a BC7 encoder without using neural networks. An encoder can first try the first 16 partition patterns, and find which is best. The results can be used to predict which more complex patterns are likely to improve the results. See the table and code here - this works:
https://github.com/richgel999/bc7enc_rdo/blame/master/bc7enc.cpp#L1714
It's potentially possible to use a 3 or 4 level hierarchy to determine the BC7 partition pattern. bc7enc.cpp only uses 2 levels. This would reduce the # of partition patterns to examine to only a handful.
To cut down the # of BC7 modes to check, you can first rule out mode 7 because it's only useful for blocks containing alpha. Then try mode 6. If it's good enough, stop. Then try mode 1. If that's good enough, stop, etc. Only a subset of blocks need to use the complex multiple subset modes. In many cases the 3 subset modes can be ignored entirely with little noticeable impact on the results. The component "rotation" feature is usually low value.
TexNN was first published here: