Tuesday, January 27, 2026

bc7f: A New Real-Time Analytical BC7 Encoder

bc7f: Prediction, Not Search

The portable, non-SIMD bc7f encoder relies on an analytical, statistics-driven error model rather than iterative search. This full featured (all BC7 modes, all mode features, all dual-plane channels, all partition patterns), strictly bounded O(1) real-time encoder exploits simple closed-form expressions to predict which BC7 mode family (4/5, 0/2, 1/3/7, or 6) is worth considering. It then estimates the block’s SSE/MSE for each candidate using lightweight block statistics derived from covariance analysis together with the mode’s weight and endpoint quantization characteristics. All of this is performed prior to encoding any BC7 modes. In purely analytical mode, bc7f predicts, encodes the input to a single BC7 mode configuration (without any decoding or error measurement), and returns.

BC7 block decoding is an affine interpolation between quantized endpoints using quantized weights, which allows first-order error propagation to be modeled directly. For a given block, the encoder computes basic statistics such as the covariance of the input texels; the principal axis derived from the covariance is used both for endpoint fitting and to estimate the orthogonal least-squares (“line fit”) residual error as trace(covariance) − λ₁. Quantization noise from endpoints and weights is modeled independently using uniform quantization assumptions, with endpoint error contributing an additive term and weight/index error contributing a span-dependent term proportional to the squared endpoint distance. These closed-form estimates are sufficient to predict relative SSE across BC7 mode families, partitions, and dual-plane configurations without trial encodes. As a result, bc7f can select parameters and emit a single BC7 block in strictly bounded time, producing deterministic, high-quality results without brute-force search or refinement.

bc7f is significantly faster than bc7e.ispc Level 1, but because it exploits the entire BC7 format, it isn’t as brittle. It's a “one-shot”, non-AbS (analysis by synthesis), but full featured encoder. The follow-up, “bc7g” is in the works, and it will be released as open source as well.

Binomial first developed these techniques for our full-featured (all block size) ASTC encoder, which is vastly more complex, and later used them to implement bc7f. We expect these predictive, analytical encoding techniques to be rapidly adopted.

Friday, May 5, 2023

LZ_XOR/LZ_ADD progress

I'm tired of all the endless LZ clones, so I'm trying something different.

I now have two prototype LZ_XOR/ADD lossless codecs. In this design a new fundamental instruction is added to the usual LZ virtual machine, either XOR or ADD. Currently the specific instruction added is decided at the file level. (From now on I'm just going to say XOR, but I really mean XOR or ADD.)

These new instructions are like the usual LZ matches, except XOR's are followed by a list of entropy coded byte values that are XOR'd to the string bytes matched in the sliding dictionary. On certain types of content these new ops are a win (to a big win), but I'm still benchmarking it.

The tradeoff is an expensive fuzzy search problem. Also, with this design you're on your own - because there's nobody to copy ideas from. The usual bag of parsing heuristic tricks that everybody copies from LZMA don't work anymore, or have to be modified.

One prototype is byte oriented and is somewhat fast to decompress (>1 GiB/sec.), the other is like LZMA and uses a bitwise range coder. Fuzzy matching is difficult but I've made a lot of headway. It's no longer a terrifying search problem, now it's just scary.






The ratio of XOR's vs. literals or COPY ops highly depends on the source data. On plain text XOR's are weak and not worth the trouble. They're extremely strong on audio and image data, and they excel on binary or structured content. 

With the LZMA-like codec LZ_XOR instructions using mostly 0 delta bytes can become so cheap to code they can be preferred over COPY's, which is at first surprising to see. It can be cheaper to extend an LZ_XOR with some more delta bytes vs. truncating one and starting a COPY instruction. On some repetitive log files nearly all emitted instructions are long XOR's. 

COPY ops must stop on the first mismatch, while XOR ops can match right through minor mismatches and still have net gain. Adding XOR ops can drastically reduce the # of overall instructions the VM ("decompressor") has to process, and also gives the parser an expanded amount of freedom to trade off reduced instructions vs. ratio. It's not all about ratio, it's about decompression speed.

Overall this appears to be a net win, assuming you can optimize the parsing. GPU parsing is probably required to pull this off, which I'm steadily moving towards.

The other improvement that shows net gain on many files is to emit an optional "history distance delta offset" value. This allows the encoder to specify a [-128,127] offset relative to one of the "REP" match history distances. The offset is entropy coded.

Sunday, April 30, 2023

Somewhere Range Coding went off the rails

I've lost track of the number of Range Coders I've seen floating around for the past ~25 years. Subbotin's coders released in the late 90's/early 2000's seem to be very influential. Each implementation has different tradeoffs that greatly impact a vectorized implementation. There is no single definitive range coder design, and I've seen 3 very different ways to handle carries. I learned a lot implementing my first vectorized range decoder.

Most of the tradeoffs boil down to how carries are handled or how many state registers are used by the decoder. You can delay the carries during encoding (Subbotin's earliest "carry-less" or delayed carry encoder that I can find, which is very clever), or fixup the already encoded bytes by propagating the carries backwards in memory during encoding (LZMA's binary range coder, which is similar to Subbotin's), or move the carry logic into the decoder (another by Subbotin called the "Russian People's Range Coder"). There are also decoder variants that track low/code/range vs. just code/range (LZMA's and Subbotin's earliest).

Importantly, range coders using this renormalization method during decoding are unnecessarily complex:

https://github.com/jkbonfield/rans_static/blob/master/arith_static.c#L222



Here's a different Subbotin variant that also uses overly complex and obfuscated renormalization logic during decoding. (Please just stop guys - this is the wrong direction.)

Why is all that messy conditional logic in the decoder renormalization step? It appears to be carry related, but the carries should have been handled during encoding in some way. It's unnecessary to create a working and correct range coder.

Instead do this, from Subbotin's first delayed carry range coder. This is easily vectorizable and he only uses 2 variables (code and range). This is quite similar to LZMA's and LZHAM's, except it doesn't need to propagate carries back in the buffer containing the output bytes.

https://gist.github.com/richgel999/d522e318bf3ad67e019eabc13137350a




Another important note: If you switch to 24-bits vs. 32-bits, you don't need to use integer division. (Note division isn't needed for a binary range coder. It's also not strictly needed for a range coder that only supports small alphabets. The division enables using a fast decoding table for large alphabets.) Instead you can use single precision division and truncate the results. This option is crucial for vectorization on modern CPU's. Modern vectorized single precision division is fairly fast now. Or said in a different way, it's not so slow that it's a show stopper. 


In vectorized implementations of both Huffman and range decoding, most of the instructions are actually not related to the actual entropy coding method. Instead they are common things like fetching from the decoder's lookup table (using some sort of manual or explicit gather), and fetching source bytes and distributing them to the lanes during renormalization. The FP division is a small part of the whole process. 

What seems to slow down vectorized range decoding is the more complex renormalization vs. Huffman, which in my implementation can fetch between [0,2] bytes. By comparison, my vectorized Huffman decoders fetch either 0 or 2 bytes per step.

All of these details are starting to matter now that rANS coding is falling under patent protection. Range Coding is a ~44 years old technique.

Comparing Vectorized Huffman and Range Decoding vs. rANS (Or: rANS entropy coding is overrated)

The point here is to show that both Huffman and range decoding are all vectorizable and competitive vs. rANS in software.

What I care about is fast entropy decoding. Even scalar encoding of any of these techniques is already fast enough for most use cases. For the distribution use case, you encode once on some server(s) and then distribute to millions of users.

I've constructed fast SSE 4.1 and AVX-2 vectorized decoders for range coding and Huffman coding. The SSE 4.1 vectorized 24-bit range decoder (using just 16 streams for simplicity) is here on github. This code is easily ported to AVX-2 and Huffman in a day or two. The Steam Hardware Survey shows that 99.58% of surveyed users (as of 3/23) support SSE 4.1, so it seems pointless to focus on or benchmark scalar decoders any more. 

There's nothing particularly special or exotic about these implementations. The range coder is LZMA's binary range coder (which is also used by LZHAM), which I adapted to handle 8-bit symbols with 12-bit probabilities and 24-bit registers. It appears quite similar to a range coder released by Dmitry Subbotin, except it handles carries differently. The Huffman decoder uses the Package Merge algorithm. 

The tricky renormalization step (where each vector lane is fed with more bytes from the source streams) is basically the same between rANS, Huffman and range decoding. Range decoding needs integer division which is scary at first, but the CPU engineers have solved that problem (if all you need is 24-bits). Also, each decoder uses some sort of table to accelerate decoding. 

The vectorized versions are surprisingly similar once implemented. They do some math, do a gather from a table, write the output symbols, do some more math, then fetch some bytes from the source buffer and renormalize and distribute to each lane in parallel these source bytes. Once you've implemented one well you've more or less got the others.

Here are some benchmarks on the same machine (a Dell Ice Lake laptop, Core i7 1065G7) on Calgary Corpus book1 (768,771 bytes). All coders use 8-bit symbols. 

SSE 4.1 Huffman/Range Decoding

  • Huffman: 32 streams, 1,122 MiB/sec., 438,602 bytes
  • Range: 64 streams, 24-bit: 738 MiB/sec., 436,349 bytes (32 streams=436,266 bytes)


AVX-2 Huffman/Range Decoding

  • Huffman: 32 streams, 1,811 MiB/sec., 438,602 bytes
  • Range: 64 streams, 24-bit: 1,223 MiB/sec., 436,349 bytes

Notes: The range coder uses 12 bit probabilities. The Huffman coder was limited to a max code size of 13 bits. The same encoded stream is compatible with SSE 4.1 and AVX-2 decoders. The compressed bytes statistic doesn't include the probability or Huffman code length tables. I'm not a vector coding expert, so I'm sure in the right hands these decoders could be made even faster.


Collet's Scalar Huffman/FSE Decoding

Using his fast scalar Huffman and FSE encoders (FSE -b -h, or FSE -b):
  • Huffman: 840 MiB/sec., 439,150 bytes
  • FSE: 365 MiB/sec., 437,232 bytes
Also see FPC on github by Konstantinos Agiannis, which gets on a Core i5 ~814 MiB/sec. for scalar length-limited Huffman decoding.

Giesen's Vectorized rANS Decoding

Using exam_simd_sse41, running on the same machine using WSL and compiled with clang v10:

  • rANS: 213.9 MiB/s, 435,604 bytes
  • interleaved rANS: 324.7 MiB/s, 435,606 bytes
  • SIMD rANS: 556.5 MiB/s, 435,626 bytes

Under Windows 10 all code was built with MSVC 2019 x64.

It's easy to construct a Huffman decoding table that supplies 2-3 symbols per decode step.  (My Huffman benchmarks don't demonstrate this ability, but they could if I limited the max Huffman code size to 6 or 7 bits, which is practical for some problems.) AFAIK this ability has not been demonstrated with rANS. Until it is, Huffman isn't going anywhere. Considering rANS is patented, it's just not worth the risk and trouble for questionable decoding gains when usable alternatives exist. I'll revisit this in 20+ years (the lifetime of patents in the US).