Tuesday, February 10, 2026

ASTC Texture Sampling with Deblocking in a Simple Pixel Shader

Deblocking is a standard feature in modern image/video codecs, and now developers can benefit from deblocking on GPU textures, either while transcoding to other formats like BC7, or while sampling ASTC textures directly.

This demo with source code shows how to sample ASTC textures (or really any GPU texture format, of any block size) with deblocking applied in a simple pixel shader. It's intended for the larger ASTC block sizes, i.e. beyond 6x6. It greatly reduces block artifacts, which allows larger block sizes to be used across a wider range of content, which ultimately lowers bitrates, memory bandwidth, and download sizes. It's fully compatible with mipmap filtering.

https://github.com/BinomialLLC/basis_universal/tree/master/shader_deblocking

This is a form of "GPU texture compression-aware shading" or "GPU format-informed reconstruction".


Tuesday, January 27, 2026

bc7f: A New Real-Time Analytical BC7 Encoder

bc7f: Prediction, Not Search

The portable, non-SIMD bc7f encoder relies on an analytical, statistics-driven error model rather than iterative search. This full featured (all BC7 modes, all mode features, all dual-plane channels, all partition patterns), strictly bounded O(1) real-time encoder exploits simple closed-form expressions to predict which BC7 mode family (4/5, 0/2, 1/3/7, or 6) is worth considering. It then estimates the block’s SSE/MSE for each candidate using lightweight block statistics derived from covariance analysis together with the mode’s weight and endpoint quantization characteristics. All of this is performed prior to encoding any BC7 modes. In purely analytical mode, bc7f predicts, encodes the input to a single BC7 mode configuration (without any decoding or error measurement), and returns.

BC7 block decoding is an affine interpolation between quantized endpoints using quantized weights, which allows first-order error propagation to be modeled directly. For a given block, the encoder computes basic statistics such as the covariance of the input texels; the principal axis derived from the covariance is used both for endpoint fitting and to estimate the orthogonal least-squares (“line fit”) residual error as trace(covariance) − λ₁. Quantization noise from endpoints and weights is modeled independently using uniform quantization assumptions, with endpoint error contributing an additive term and weight/index error contributing a span-dependent term proportional to the squared endpoint distance. These closed-form estimates are sufficient to predict relative SSE across BC7 mode families, partitions, and dual-plane configurations without trial encodes. As a result, bc7f can select parameters and emit a single BC7 block in strictly bounded time, producing deterministic, high-quality results without brute-force search or refinement.

bc7f is significantly faster than bc7e.ispc Level 1, but because it exploits the entire BC7 format, it isn’t as brittle. It's a “one-shot”, non-AbS (analysis by synthesis), but full featured encoder. The follow-up, “bc7g” is in the works, and it will be released as open source as well.

Binomial first developed these techniques for our full-featured (all block size) ASTC encoder, which is vastly more complex, and later used them to implement bc7f. We expect these predictive, analytical encoding techniques to be rapidly adopted.

Friday, May 5, 2023

LZ_XOR/LZ_ADD progress

I'm tired of all the endless LZ clones, so I'm trying something different.

I now have two prototype LZ_XOR/ADD lossless codecs. In this design a new fundamental instruction is added to the usual LZ virtual machine, either XOR or ADD. Currently the specific instruction added is decided at the file level. (From now on I'm just going to say XOR, but I really mean XOR or ADD.)

These new instructions are like the usual LZ matches, except XOR's are followed by a list of entropy coded byte values that are XOR'd to the string bytes matched in the sliding dictionary. On certain types of content these new ops are a win (to a big win), but I'm still benchmarking it.

The tradeoff is an expensive fuzzy search problem. Also, with this design you're on your own - because there's nobody to copy ideas from. The usual bag of parsing heuristic tricks that everybody copies from LZMA don't work anymore, or have to be modified.

One prototype is byte oriented and is somewhat fast to decompress (>1 GiB/sec.), the other is like LZMA and uses a bitwise range coder. Fuzzy matching is difficult but I've made a lot of headway. It's no longer a terrifying search problem, now it's just scary.






The ratio of XOR's vs. literals or COPY ops highly depends on the source data. On plain text XOR's are weak and not worth the trouble. They're extremely strong on audio and image data, and they excel on binary or structured content. 

With the LZMA-like codec LZ_XOR instructions using mostly 0 delta bytes can become so cheap to code they can be preferred over COPY's, which is at first surprising to see. It can be cheaper to extend an LZ_XOR with some more delta bytes vs. truncating one and starting a COPY instruction. On some repetitive log files nearly all emitted instructions are long XOR's. 

COPY ops must stop on the first mismatch, while XOR ops can match right through minor mismatches and still have net gain. Adding XOR ops can drastically reduce the # of overall instructions the VM ("decompressor") has to process, and also gives the parser an expanded amount of freedom to trade off reduced instructions vs. ratio. It's not all about ratio, it's about decompression speed.

Overall this appears to be a net win, assuming you can optimize the parsing. GPU parsing is probably required to pull this off, which I'm steadily moving towards.

The other improvement that shows net gain on many files is to emit an optional "history distance delta offset" value. This allows the encoder to specify a [-128,127] offset relative to one of the "REP" match history distances. The offset is entropy coded.