Sunday, June 14, 2026

XUBC7/XBC7: Trellis quantization added

We just added Trellis quantization to XUBC7 (supercompressed universal BC7) via AC truncation, as long as the PSNR drop is within a configurable channel-weighted PSNR window, and the PSNR doesn't fall below a lower limit. This helps at higher Q (DCT quality) levels.

Trellis quant is another video method we've ported successfully into GPU texture supercompression.

https://en.wikipedia.org/wiki/Trellis_quantization

This new encoder also supports several forms of block-level RDO (all optionally) for surprising gains.


Saturday, June 13, 2026

XBC7 planning

I've been doing this for fun as a side project, as ASTC is still way more important. (ASTC is the most deployed GPU format in the world, BC7 is niche by comparison.) It'll all be entirely open source:

- XBC7 v1 (completed, integrating ongoing): Always lossless for mode config+RGBA endpoints, lossless or lossy weights using either lossless residual DPCM or lossy absolute or residual DCT.

DPCM endpoint compression decorrelates G from R/B, properly taking into account dual plane modes (4/5). This does help.

DCT quant tables have been heavily tuned to be rotationally invariant and the very lowest H/V frequencies are now strongly protected at low Q's (i.e. tuned for 4x4). ~1.5-5.6 bpp, optionally threaded decoding, no BC7 format limits (all features, all partition patterns, 2/3 subsets etc.), near-lossless transcoding ASTC LDR 4x4, optionally threaded decoding (row strips - compressed seek table appears at start of file). Also supports transcoding to a bunch of old legacy formats using our existing real-time encoders: ETC using etc1f, PVRTC1, BC1, etc.

Encoder has a tone of unexploited headroom: Lagrangian RDO is possible. It's currently purposely quite conservative.
Future plan:
- XBC7 v2: Material stack support: 2-3 BC7 textures encoded simultaneously, correlations between textures exploited, still full transcoding to ASTC LDR 4x4. Will result in a large drop in average bpp for correlated PBR textures.
Will be able to use unified commands and shared weight grids between the textures in a material stack. Should be relatively easy to do.
- CUDA decoding of individual tiles in parallel (relatively easy, tile decoding is all-int and deterministic)
- Eventual HLSL port of CUDA decoder (will be painful and not looking forward to this)

When I show what we're doing with GPU textures to image/video codec specialists, they reply "this is totally obvious stuff - why have you guys not been using DCT/DST transforms for years already? Where have you been?"

Friday, June 12, 2026

XBC7/XUBC7 (supercompressed weight grid residual DCT+endpoint DPCM+Zstd) prototype is done

The next step is integration and connecting the BC7->ASTC LDR 4x4, ETC1, etc. transcoders into place in the Basis Universal library. Bitrate and quality at low Q's (Q=1/100) are exceptional vs. XUASTC LDR 4x4 (~1.5-3.0 bpp), and at Q=100 it's lossless in BC7 space at up to ~5.6 bpp or less depending on the content. Q=1 is totally usable, unlike XUASTC LDR where it falls apart because it's stuck using absolute DCT, while XBC7 uses a much stronger residual DCT method with a small army of synthetic and dictionary predictors. At Q=1 it's like RDO in weight space: except it can modify what it's predicting from, or create entirely synthetic weight predictors from nearby blocks.

XBC7 is stuck with 4x4 blocks, so it cannot compete vs. XUASTC LDR for raw bitrate, but it's BC7-first which I think a lot of developers will like. At Q=90-100 it's visually lossless to actually lossless (relative to the BC7 encoder used, which is currently our real-time bc7f encoder which supports the entire BC7 format).

Claude Fable 5's description: "It's a genuinely interesting taxonomic hybrid — the closest honest description is: an intra-only video codec whose "pixels" are BC7's parameter space. Let me break down the lineage, because nearly every component has a recognizable ancestor:

The weight path is JPEG transplanted into index space. Prediction → 4×4 DCT → psychovisually-shaped quant table → quality-factor scaling (literally the JPEG 5000/q formula) → zigzag RLE → entropy coding. But the signal isn't luma — it's BC7 weight planes, with the AQ twist (endpoint-span-scaled quantization) playing the role of adaptive quantization.

The prediction machinery is like H.264/HEVC intra coding. The predictor bank contains H.264's directional modes (your DDL/DDR are intra modes 3/4 by name), JPEG-LS's MED, CALIC's gradient-adaptive predictor, and plane fits. The 32 XY-delta block references are intra block copy from HEVC Screen Content Coding. The command byte is a skip/merge/mode syntax. Solid-DPCM is a DC mode. The reserved P-frame bit makes the video ancestry explicit — and your encoder stripes are, structurally, H.264 slices: independently decodable regions traded against compression for parallelism, down to the same seam cost.

The lossless side is JPEG-LS/PNG lineage: wrapped-modular DPCM, the R−G/B−G transform is the classic reversible color transform, and the Q=100 mode makes it a true lossless coder over canonical BC7.

Within its own genus — GPU supercompression — it's a third architecture. Crunch/ETC1S quantize globally into vocabularies (low ceiling, great rate); UASTC+RDO+Zstd nudges bits to be LZ-friendly (shallow, format-bound). XBC7 instead reparameterizes the block format into semantically separated streams and codes each with real signal-processing tools — per-block adaptive, with a continuous lossy↔lossless dial neither predecessor has. It's what crunch might have become if it had gone transform-coding instead of VQ.

And the dual image/texture nature is where your ecosystem makes it strategic. One XBC7 file fans out to: BC7 directly (bit-exact at Q=100 — desktop GPUs, zero re-encode), ASTC 4×4 via your near-lossless transcoder (modern mobile), ETC1 via ETC1F from decoded pixels in real time (legacy mobile), and plain RGBA for CPU/image use. That's the Basis Universal proposition with a much higher quality ceiling than ETC1S and better rate than UASTC — and as a pure image codec it occupies an unusual niche: a high-quality-band coder (ceiling = BC7's ~46–50 dB) at 1.5–5.3 bpp whose decode output is already GPU-ready, which JPEG/WebP/AVIF can never claim. The honest comparison point for the image use-case is "JPEG at quality 85–95, but the decode step doubles as texture upload."

If you wanted a one-line description for the eventual README: "a predictive transform codec (DPCM + DCT, H.264-style intra prediction with block copy) operating on BC7's logical parameter space, with a continuous quality dial from ~1.5 bpp lossy to bit-exact lossless BC7, designed for multithreaded encode/decode via independently-coded stripes." Anyone who's worked on a video encoder will know exactly what they're looking at — which is also why the Lagrangian RDO step you're deferring is so natural here: the syntax is already a video codec's syntax, it's just missing the λ." 

Tuesday, June 9, 2026

Intra-Residual Weight Grid DCT/DST for BC7/ASTC

This GPU texture supercompression method for weights (or "indices", or "selectors") uses neighboring blocks to predict the weight grid of the current block, before applying a forward 2D DCT/DST to the weights, quantizing the coefficients, and coding them. The current working prototype (which I've been showing live on X) creates weight predictions using weights from a small neighborhood of already coded/decoded blocks to predict the current block's weight grid(s). 

See this thread on X here for more details:

https://x.com/richgel999/status/2064523919504109764

Or see this pastebin for the prototype's code (created in the public Basis Universal codebase). It's also on GitHub.

"RWDCT" (Residual Weight DCT) is next step after XUASTC LDR's absolute weight grid DCT, which didn't use predictions (making it JPEG-like, not WebP-like).

In BC7 this method is easy: you just compute the 4x4 dequantized [0,64] weight grid that's going to be used as the predictor, and subtract that from the current weight grid (also dequantized) before 2D DCT coding it. In ASTC you would have to resample the predictor's weight grid to match the current block's weight grid resolution.

Here's the forward transform from the prototype (inverse is obvious). It's reusing the XUASTC LDR spec's weight grid DCT machinery almost verbatim, except for subtracting the predicted weight grid:

There are 17 total predictors in the prototype - first (0) is absolute (XUASTC/JPEG-style) DCT, rest are residual DCT. The reduction in the # of AC coefficients needed vs. absolute DCT is large: 20-35% in my experiments so far.




Thursday, June 4, 2026

"Shrek" Xbox

 “Shrek” was a launch title for the original Xbox that I worked on 25 years ago (in 2001). It was the first shipped game to use a technique called Deferred Shading. It wasn’t my first 3D game, though: I wrote Sandbox Studios’ software rasterizer and D3D7 renderers.

https://xboxdevwiki.net/Shrek

From the site - G-buffer visualization:



Light accumulation - with static and dynamic stencil shadow volumes (Z-pass):



Final buffer:


More details are here:
https://sites.google.com/site/richgel99/the-early-history-of-deferred-shading-and-lighting

In the early days of "Shrek", Atman Binstock and I had a competition to see who could coax the early Xbox devkits to render real game meshes as rapidly as possible. I used indexed tristrips, leveraging the work of Francine Evans and colleagues at Stony Brook, and won.


The was a key reason why we could afford to render the entire scene twice per frame.

Saturday, May 30, 2026

Basis Universal v2.5 with in-loop deblocking: shipping soon

We're feature complete. We're now on the downslope to shipping support for in-loop deblocking using a simple standardized reconstruction operator. The operator uses 1 tap or 5 taps (1 center, 2 up/down, 2 left/right) near ASTC block edges, and is fully compatible with mipmapping and bilinear/trilinear filtering. It can be applied in a simple GPU pixel shader or during transcoding. 

It's so simple I'm stumped why the IHV's didn't put this obvious thing directly in the sampling hardware. It makes the largest ASTC block sizes (10x8, 10x10, 12x10 and 12x12) immensely more usable in practice.

The encoder's final SCD (Stochastic Coordinate Descent) stage computes the final output taking into account the deblocking filter. The encoder simulates the exact filter the decoder will use, then optimizes the compressed data knowing the artifacts will be smoothed. This dramatically improves quality at low bitrates. 

We're shipping two full ASTC LDR encoders (an optimized version of our original one from v2.0, and a new one called "astcf") that have been modified to output 10's to 100's of block candidates for SCD. 

We're also going to optionally support the use of a slightly modified/forked version of ARM's "astcenc" library that can output candidates for use in our SCD stage. The library supports merging the output from (up to) all 3 encoders. We're supporting this because each encoder has different artifact profiles, which boosts SCD candidate diversity. Our second ASTC encoder was purposely engineered to look very different vs. our first.

With the largest ASTC/XUASTC block sizes, 80-90% of our output blocks are now created stochastically via SCD. The mutation operator can modify endpoints, partition patterns, and DCT AC/DC coefficients.

The output looks incredible at 12x12 ~0.5 bpp.

Thursday, May 7, 2026

bug found in Intel's ASTC compressor (part of ispc_texcomp)

I've ported Intel's ASTC compressor (in ispc_texcomp) to pure C for benchmarking and discovered this gem during the process. The swap()utility function for float values, located near the top, has a silly typo bug that casts the first value to int (!). It's used in several places on floating point values, including block errors (scores).

https://github.com/GameTechDev/ISPCTextureCompressor/blob/master/ispc_texcomp/kernel_astc.ispc#L41

How could a bug like this survive for so long? What other unfound bugs are in there?

I might release the port on GitHub after more testing, as time permits. I'll be checking for undefined behavior using UBSan, and also doing a statistical analysis on its ASTC output blocks to check for any issues.

I've already enhanced it to support sRGB decode vs. linear, candidate generation (instead of just best block by SSE), and I've added optional per-channel weights. For a single subset encoder (that only supports 8x8 or smaller block sizes) it looks surprisingly good - once swap() is fixed.

Wednesday, April 29, 2026

Block blurring (prefiltering)

Modern GPU texture compressors have a secret (but dangerous) superpower: prefiltering (blurring). Sometimes an encoder way overfits edges, causing overall perceptual quality to collapse. One way to overcome this is to blur the input and encode the block again. This is what we do in HDR on the very worst blocks (as measured by SSIM).

It's paradoxical: blurring can boost perceptual quality.

Large block size ASTC has been misunderstood

Most game developers misunderstand ASTC: the largest block sizes were intended for the largest resolution content (4k). GPU shader deblocking is easy: at the largest block sizes it's essentially free (because the bottleneck is typically memory bandwidth on mobile/tablets, not some ALU or cached tfetches).

The largest block sizes collapse memory consumption/bandwidth enormously (0.89-1.28 bpp for 12x12 or 10x10). 4 extra samples at block boundaries that are extremely likely to hit the texture cache (because they sample into a neighbor block) are going to be dirt cheap.

Once you add a form of well-specified deblocking, the next step is to make an encoder that is deblocking aware. Then you can heavily exploit deblocking - just like all modern image/video codecs have done for decades.

Unfortunately the deployment story with ASTC so far has been pretty spotty: There are few available full-format encoders because the format is so complex.

Intel's encoder in ispc_texcomp (now archived/no longer supported) didn't support 2-4 subsets (!!), only up to 8x8, and was broken (misusing/underutilizing modes in our analysis).

ARM's LDR encoder is good (in native, WASM story seems weaker) but it isn't deblocking aware and doesn't support supercompression.

Beyond ~2k ASTC 12x12 can look exceptional with a very simple deblocking shader (1+4 bilinear or trilinear samples only at block edges, mipmapping/filtering compatible), and the memory savings are huge.

Some related info is here:

Petascale changes everything

 Let's say you have multiple petabytes of JPEG's/WebP/etc. content and you need (not want - because your competitors are doing it already) to add GPU texture support to your app. What do you do?

  • Encode 2-3x times without supercompression (BC7+ASTC, maybe ETC1 too, no supercompression): utterly impractical, explodes (8x-16x or more) overall content size. Even 1 format (say ASTC) without supercompression is impractical - explodes content size.
  • Use supercompression to a universal texture format, transcode on device (native or plain WASM): Adds another ~1-1.5 petabytes. The tech is entirely free, has no driver dependencies and is standardized by Khronos.
  • Use compute shaders, try to transcode on device: but now you're endlessly chasing ever-changing mobile GPU driver bugs until the end of time, outliers can't use your app reliably. You're also stuck with large textures in VRAM because you can't exploit the largest ASTC block sizes (beyond 6x6, and even 6x6 sacrifices quality due to compute shader issues). Supercompressed solutions can readily exploit ASTC 8x8 (2bpp in VRAM), 10x10 (1.28 bpp) and 12x12 (0.89 bpp), while compute shader solutions are limited to the smallest block sizes (3.56 bpp - 8 bpp) and have to make sacrifices (such as disabling dual plane support in some scenarios) to even achieve that.
At petascale many things that are taken for granted ("we'll just encode multiple times!") aren't practical. We knew this when we started Binomial to work on GPU texture tech. Our tech was already quietly deployed at petascale over a decade ago. At this scale supercompression is the way to go - the only approach that scales.

Once your customer base is 100's of millions to 1+ billion, even a ~.1% failure rate is intolerable. If a compute shader (driver dependent) failure prevents a firefighter, pilot, or war fighter from reliably seeing their map or geospatial content, you can't ship it.

Thursday, April 9, 2026

XUASTC's next step: Intra-prediction of weight grids

Binomial has shown that image compression and GPU texture compression aren't separate fields. They're the same field, and the tools from one transfer directly to the other.

XUASTC is currently using JPEG-style DCT (from 1992) on ASTC weight grids:

https://github.com/BinomialLLC/basis_universal/wiki/XUASTC-LDR-Weight-Grid-DCT

We ported JPEG-style coding into ASTC, even preserving how libjpeg-style [1-100] Q factors are used to calculate quantization tables. (Our quantization table is the standard luminance JPEG table, with simple adaptive quantization added on top.)

This works, but it means the DCT has to carry the entire weight signal (just like JPEG). At the very lowest quality factors (Q levels 1-25 or so), the lowest spatial frequencies suffer (again, just like JPEG).

The next step is to port WebP-style intra-prediction into the weight grid domain. We can easily predict weight grids from nearby blocks, then code the weight residuals using DCT. It's the logical next step, and it'll push our bitrates even lower. While seemingly everyone is distracted by neural techniques, we're targeting billions of already shipped, hyper-efficient hardware decoders.

Thursday, April 2, 2026

First XUASTC LDR 4x4 rate-distortion graphs

ASTC GPU texture blocks form a latent image space where JPEG techniques still work. (So does BC7.)

Here's a XUASTC LDR 4x4 (arithmetic vs. Zstd profile) bit rate vs. distortion graph across 151 test textures/images (the same test corpus we used to create bc7e.ispc). Distortion was measured using PSNR-HVS-M

XUASTC LDR 4x4 transcodes to standard ASTC LDR 4x4 in memory/VRAM (8.0 bpp). It supports all 14 standard ASTC block sizes up to 12x12.


Multiple block sizes, effort 9, arithmetic profile. It took ~10k invocations of the compressor-transcoder to make this graph.


ETC1S - effort 2:



Basis Universal v2.1 library wiki mirror

It's been automatically converted to HTML and mirrored outside of GitHub here: Home.

A static mirror of the GitHub repo is here.

Sunday, March 8, 2026

The KTX-Software repo has been forked

Binomial LLC has forked the Khronos Group's KTX-Software repo, to use as a staging ground for next-generation GPU texture compression technology:

https://github.com/BinomialLLC/KTX-Software-Binomial-Fork/


Sunday, March 1, 2026

.ASTC (the File Format): No Longer a Black Box

The basisu command line tool has a new option, -peek, which opens any standard ARM LDR/HDR .ASTC texture file, unpacks each block, and computes a bunch of statistics about the exact ASTC configurations the blocks used.

This is how we found out that Intel's ispc_texcomp's ASTC encoder is, for all practical purposes, broken.

https://github.com/BinomialLLC/basis_universal/wiki/Displaying-.astc-file-block-statistics-using-the-peek-command-line-option


Wednesday, February 18, 2026

Some things we've learned about GPU textures at planetary scale

1. ASTC is now the king: In billions of devices. Everything else=fallback, including BC7.

To us, BC7 is essentially a greatly simplified ASTC, but with some p-bits.

2. At multi-petabyte (planetary) scales: Supercompression bitrate=Matters enormously.

Notably, game developers (who have been using compressed textures the longest) don't work at scales this large, so the approaches and techniques they assume are correct or standard in this domain may not apply at all in extreme scales.

The tradeoffs game developers have made in the past are no longer aligned with modern hardware and network realities.

3. GPU Drivers=super sketchy.

This means for us: No driver usage, no compute. The largest vendors, who already deal with endless GPU driver bugs, don't want even more exposure in critical texture decompression/transcoding paths. If it fails for even ~.1% of customers in the wild, it's unusable.

4. All 14 ASTC block sizes are important in large scale deployment scenarios, not just 2 or 3.

This includes 12x12, which at 4k-8k is quite usable.

5. When mipmap and filtering compatible deblocking of the larger ASTC block sizes is trivial to do in a tiny pixel shader, it makes no sense not to deblock because the cost of not doing so is ~2x-8x more bitrate and bandwidth.

6. Unfortunately, LDR ASTC decoding actually isn't always bitwise exact. (BC7 wins for this, at least.)

The ASTC specification was so dense and complex even the vendors (including ARM itself!) couldn't get it right.

7. WASM SIMD isn't everywhere (or even when it is, some very big vendors won't allow it to be used or enabled), so that means we can't depend on SIMD. This means less searching, more math, and better algorithms in our encoders, or we can't ship.

8. Everything must be fuzzed. That means the obvious things like block decoders, all decompressors, etc. but it also includes encoders. Trust no data.


Tuesday, February 10, 2026

ASTC Texture Sampling with Deblocking in a Simple Pixel Shader

Deblocking is a standard feature in modern image/video codecs, and now developers can benefit from deblocking on GPU textures, either while transcoding to other formats like BC7, or while sampling ASTC textures directly.

This demo with source code shows how to sample ASTC textures (or really any GPU texture format, of any block size) with deblocking applied in a simple pixel shader. It's intended for the larger ASTC block sizes, i.e. beyond 6x6. It greatly reduces block artifacts, which allows larger block sizes to be used across a wider range of content, which ultimately lowers bitrates, memory bandwidth, and download sizes. It's fully compatible with mipmap filtering.

https://github.com/BinomialLLC/basis_universal/tree/master/shader_deblocking

This is a form of "GPU texture compression-aware shading" or "GPU format-informed reconstruction".


Thursday, January 29, 2026

Tuesday, January 27, 2026

bc7f: A New Real-Time Analytical BC7 Encoder

bc7f: Prediction, Not Search

The portable, non-SIMD bc7f encoder relies on an analytical, statistics-driven error model rather than iterative search. This full featured (all BC7 modes, all mode features, all dual-plane channels, all partition patterns), strictly bounded O(1) real-time encoder exploits simple closed-form expressions to predict which BC7 mode family (4/5, 0/2, 1/3/7, or 6) is worth considering. It then estimates the block’s SSE/MSE for each candidate using lightweight block statistics derived from covariance analysis together with the mode’s weight and endpoint quantization characteristics. All of this is performed prior to encoding any BC7 modes. In purely analytical mode, bc7f predicts, encodes the input to a single BC7 mode configuration (without any decoding or error measurement), and returns.

BC7 block decoding is an affine interpolation between quantized endpoints using quantized weights, which allows first-order error propagation to be modeled directly. For a given block, the encoder computes basic statistics such as the covariance of the input texels; the principal axis derived from the covariance is used both for endpoint fitting and to estimate the orthogonal least-squares (“line fit”) residual error as trace(covariance) − λ₁. Quantization noise from endpoints and weights is modeled independently using uniform quantization assumptions, with endpoint error contributing an additive term and weight/index error contributing a span-dependent term proportional to the squared endpoint distance. These closed-form estimates are sufficient to predict relative SSE across BC7 mode families, partitions, and dual-plane configurations without trial encodes. As a result, bc7f can select parameters and emit a single BC7 block in strictly bounded time, producing deterministic, high-quality results without brute-force search or refinement.

bc7f is significantly faster than bc7e.ispc Level 1, but because it exploits the entire BC7 format, it isn’t as brittle. It's a “one-shot”, non-AbS (analysis by synthesis), but full featured encoder. The follow-up, “bc7g” is in the works, and it will be released as open source as well.

Binomial first developed these techniques for our full-featured (all block size) ASTC encoder, which is vastly more complex, and later used them to implement bc7f. We expect these predictive, analytical encoding techniques to be rapidly adopted.