Saturday, January 25, 2020

Universal ASTC (UASTC) Tech Details

We have reached an exciting milestone: We now have a working HQ universal encoder that supports both ASTC and BC7 for RGB/RGBA textures. It's currently a bit slow and it doesn't support RDO yet, but it works. Quality is extremely high (BC7 grade, no block artifacts) and the encoder's behavior is stable across a wide range of RGB/RGBA inputs including XYZ normal maps.

We've settled on the below standard 15 ASTC modes, which we're calling "UASTC". They are 100% standard ASTC configurations, so any ASTC encoder could be modified to limit itself to output just these modes (out of the hundreds available). Any BC7 encoder could be modified too, once it supports ASTC's endpoint quantization tables, ASTC's 4-bit weight table, and ASTC's 16-bit endpoint interpolation. (That's how we prototyped this system.)

Average RGB PSNR's across 33 test textures:

Original->Near-optimal BC7: 46.67 dB (our high quality SIMD BC7 codec in "slow" mode)
Original->UASTC: 45.14
UASTC->ASTC 45.14 (always lossless)
UASTC->BC7: 44.41  

Original->Near-optimal BC1: 36.96  (stb_dxt STB_DXT_HIGHQUAL)
UASTC->Near-optimal BC1: 36.20

This ASTC subset's quality is on average only ~1.5 dB lower than near-optimal BC7 for opaque content, but it's 9.7 dB higher than near-optimal BC1. Both RGB and RGBA content look *really* good. Our experience building several production BC7 encoders helped guide us to the right ASTC modes.

These modes are easily converted directly to a BC7 texture encoding with no pixel-wise recompression, with low quality loss (less than .75 dB on average). To convert to BC7, the endpoints are scaled, you compute the optimal p-bits to represent the ASTC endpoints (if any- this is simple), and then you either clone the ASTC indices or translate them with a tiny table. Transcoding to BC7 is very simple stuff, and doesn't require the large precomputed tables that Basis Universal's ETC1S solution needs.

We're not encoding these modes to the standard ASTC block format (although we could), because it's too slow to unpack with the CPU. Instead, we use a simple 128-bit/block BC7-like block format for the UASTC mode/endpoints/weights/partition index. Worst case the packed UASTC data takes 112-113 bits, leaving around 15-16 bits for other things.

We have an interesting plan on how to support ETC1/2 at high quality (way better than ETC1S) with fast transcoding. We can take the 15-16 bits left over in our custom block format to store ETC1/2 hints. These hints greatly accelerate real-time high quality ETC1/2 compression (by ~30x for ETC1 vs. a brute force encoder). The UASTC compressor will re-encode the final UASTC block to ETC1/2 and then determine the set of ETC1/2 hints that result in the lowest ETC1/2 error. 

The next major step for us is to sit down and implement ETC1/2 to make sure this plan works well on a wide range of inputs.

As this is a universal GPU texture compression system it will support ALL LDR GPU texture formats, like Basis Universal does. Here's the plan for the other formats:

ETC2 R11 and RG11 might be able to reuse the ETC1/2 hints.

We have already prototyped BC1 and found a way to make that very fast in the 1-subset cases. For the other relatively rare 2/3-subset UASTC cases we'll need to use PCA+least squares. Real-time BC3-5 are fast. 

PVRTC1, and the other niche/obsolete formats (like PVRTC2, ATC, etc.) will use solutions already implemented in Basis Universal.

UASTC mode constraints/notes:

1. All blocks are always LDR 4x4 pixels, and all UASTC modes use integer weight bits for compatibility with BC7. 

2. Only uses Color Endpoint Mode (CEM) 8 or 12 (RGB/RGBA Direct) to simplify the encoder/transcoder. The other CEM's don't help enough to justify the added complexity.

3. CEM 8 and 12 support Blue Contraction, which is never utilized in UASTC. Instead, we swap the subset's endpoints if the MSB of the last weight index is 1 (exactly like BC7). This guarantees the last weight index has an MSB of 0, so we don't need to store it in the packed block format.

The UASTC->ASTC transcoder needs to check the endpoints to see if blue contraction would kick in. If so, it'll need to invert the weight indices and swap the subset's endpoints.

4. The 2 and 3 subset modes are constrained to only use the set of common 2/3-subset partition patterns that are in common between ASTC and BC7, which we've documented on our blog and on Twitter. Total of 60 patterns (30+11+19).

5. Mode 7 uses a 3-subset BC7 mode, but only a 2-subset ASTC mode. Two of the BC7 subset endpoints are set to equal colors to simplify the 3-subset partition pattern into a 2-subset pattern that's compatible with ASTC. This gives us 19 more useful partitions.

6. Opaque encodings get transcoded to BC7 modes 1,2,3,5,6. Alpha encodings transcode to BC7 modes 5,6,7. BC7 modes 0 and 4 are unused. 

7. When the # of weight bits differ between BC7/ASTC encodings, we chose the closest BC7 weight (just a simple table lookup into a static 4/8 entry table). Note that BC7 and ASTC use the same 2-bit and 3-bit weight tables. Some ASTC 4-bit table entries are different by +- 1 compared to BC7, but the encoder can work around this.

8. BC7 and ASTC interpolate endpoints in a similar way, except ASTC endpoints are scaled up to 16-bits before interpolation and then only the top 8-bits are used. This is a surprisingly minor difference that a good encoder can work around by choosing the lowest overall BC7 error from the hundreds/thousands of possible UASTC configurations/partition patterns/endpoints/etc.

9. Strong encoders can compute both ASTC and transcoded BC7 error to choose UASTC encodings that result in minimal BC7 error. (This isn't necessary, it just helps a little.)

10. A driver could easily transcode UASTC texture data to ASTC or BC7 completely transparently to the user. The blocks are completely independent and the transcode step can be done with SIMD operations.

UASTC modes:

 0. DualPlane: 0, WeightRange: 8 (16), Partitions: 1, EndpointRange: 19 (192)  MODE6 RGB
 1. DualPlane: 0, WeightRange: 2 (4), Partitions: 1, EndpointRange: 20 (256)   MODE3
 2. DualPlane: 0, WeightRange: 5 (8), Partitions: 2, EndpointRange: 8 (16)     MODE1
 3. DualPlane: 0, WeightRange: 2 (4), Partitions: 3, EndpointRange: 7 (12)     MODE2
 4. DualPlane: 0, WeightRange: 2 (4), Partitions: 2, EndpointRange: 12 (40)    MODE3
 5. DualPlane: 0, WeightRange: 5 (8), Partitions: 1, EndpointRange: 20 (256)   MODE6 RGB
 6. DualPlane: 1, WeightRange: 2 (4), Partitions: 1, EndpointRange: 18 (160)   MODE5
 7. DualPlane: 0, WeightRange: 2 (4), Partitions: 2, EndpointRange: 12 (40)    MODE2

 8. Solid Color RGBA (MODE5 or MODE6)

 9. DualPlane: 0, WeightRange: 2 (4), Partitions: 2, EndpointRange: 8 (16)     MODE7
10. DualPlane: 0, WeightRange: 8 (16), Partitions: 1, EndpointRange: 13 (48)   MODE6
11. DualPlane: 1, WeightRange: 2 (4), Partitions: 1, EndpointRange: 13 (48)    MODE5
12. DualPlane: 0, WeightRange: 5 (8), Partitions: 1, EndpointRange: 19 (192)   MODE6
13. DualPlane: 1, WeightRange: 0 (2), Partitions: 1, EndpointRange: 20 (256)   MODE5
14. DualPlane: 0, WeightRange: 2 (4), Partitions: 1, EndpointRange: 20 (256)   MODE6

This is an RDO codec, so we're depending on a good LZ codec for compression. To implement multiple quality levels the current plan is to use an LZ dictionary simulator, bit price estimator, and Lagrangian optimization to choose block selector bytes which have been recently emitted into the output data stream. The quality level will control the error threshold used to choose "good enough" selectors which we've already sent (so they'll be cheap for LZ to encode). We've implemented this before in Basis BC1, but that was with already quantized selectors. So there will be some things to figure out.

This system is designed to be compatible with and explicitly exploit KTX2's support for RDO compression overtop of block based formats.

Wednesday, October 2, 2019

Parsing ASTC's overly restrictive end user license

We've been reviewing the licensing situation for all the GPU texture formats Basis Universal supports. (This is basically every LDR GPU format in existence, so this isn't easy.) Most formats are covered by various open Khronos API standards and standard documents and have been fully documented in a variety of very permissive open source works and publications.

However, the ASTC reference encoder, documentation and specification has its own End User License Agreement, which I believe makes it unique. This license explains what you can and cannot legally do with the ASTC texture format. It's distributed with ARM's "astc-encoder" project on github:

At first glance, after a casual reading, you may think this legal agreement grants the end user permission to do basically anything they want with ASTC. Actually, it's very restrictive. There's *a lot* you can't legally do with ASTC.

Here are the key/core lines of the license that matter the most (anything in bold is by me). This is just a subset of the full license linked above:

"Authorised Purpose" means the use of the Software solely to develop products and tools which implement the Khronos ASTC specification to;

(i) compress texture images into ASTC format ("Compression Results");
(ii) distribute such Compression Results to third parties; and
(iii) decompress texture images stored in ASTC format.

"Software" means the source code and Software binaries accompanying this Licence, and any printed, electronic or online documentation supplied with it, in all cases relating to the MALI ASTC SPECIFICATION AND SOFTWARE CODEC.

ARM hereby grants to you, subject to the terms and conditions of this Licence, a nonexclusive, nontransferable, free of charge, royalty free, worldwide licence to use, copy, modify and (subject to Clause 3 below) distribute the Software solely for the Authorised Purpose.
No right is granted to use the Software to develop hardware.
Notwithstanding the foregoing, nothing in this Licence prevents you from using the Software to develop products that conform to an application programming interface specification issued by The Khronos Group Inc. ("Khronos"), provided that you have licences to develop such products under the relevant Khronos agreements.

TITLE AND RESERVATION OF RIGHTS: You acquire no rights to the Software other than as expressly provided by this Licence. ...
What does all this legalese actually mean? First, note under "Definitions" that "Software" actually means astc-encoder, its documentation, and the Mali ASTC Specification. It doesn't mean just code, it means the docs and spec too.

As far as we can tell, this license means you can only legally use astc-encoder and the Mali ASTC Specification to compress texture images into the ASTC format to create Compression Results, then distribute these Compression Results to third parties. Then you can decompress texture images stored in ASTC format. That's it. Notice the key "and" word under Clause 1 (Definitions): "(ii) distribute such Compression Results to third parties; and".  It's not "or".

You can't do anything else with the "Software" (meaning the astc-encoder, docs, or most importantly the ASTC specification!), because those use cases have been expressly forbidden by Clause 3.

This license apparently forbids all sorts of practical real world use cases, like: real-time encoding textures to ASTC on end-user devices, transcoding from other texture formats to ASTC, compressing ASTC using a .CRN-like system and decompressing or transcoding that to ASTC, or processing or converting ASTC data at run-time.

You also cannot compress anything but "Texture Images" into the ASTC format, which is quite restrictive. If your input signal isn't a texture image, well you're out of luck (at least through this license).

Under Clause 2, there's this paragraph that feels crudely hacked into the license contract: "Notwithstanding the foregoing, nothing in this Licence prevents you from using the Software to develop products that conform to an application programming interface specification issued by The Khronos Group Inc. ("Khronos"), provided that you have licences to develop such products under the relevant Khronos agreements." 

So this basically means "in spite of what was just said or written, nothing in this Licence prevents you from using the Software to develop products that conform to a Khronos API, provided that you have licences to develop such products under the relevant Khronos agreements". However, there are many uses cases that don't involve directly calling a Khronos API. Basis Universal doesn't call any Khronos API's at all. If you are using a rendering API that isn't a Khronos standard, you're out of luck.

Thinking about this further, what actually is a product that "conforms" to a Khronos API? Is this a driver? I can't tell. Does a video game product, or a mapping product conform to a Khronos API? Even if you conform, you still need a license from Khronos.

If you develop a real-time ASTC encoder library or product that will be deployed on end-user devices that doesn't conform to a Khronos API (and have a Khronos license), you are not covered by this license. We think. Because this license sucks.

Another confusing issue we see with this license, under DEFINITIONS: "Authorised Purpose" means the use of the Software solely to develop products and tools which implement the Khronos ASTC specification". Then it says "in all cases relating to the MALI ASTC SPECIFICATION AND SOFTWARE CODEC". So is it the Mali ASTC Specification or the Khronos ASTC Specification? 

My guess is that ARM's lawyers weren't filled in on all the modern ways developers can encode, transcode, and manipulate compressed texture data. Also, the document feels rushed and it's obvious that no engineer involved with ASTC with experience reading legal documents has actually sat down and parsed this thing.

As the situation stands right now, you cannot do much with ASTC except encode it offline, distribute this data, and then use it on a device. If your product uses a Khronos API, you may be able to do more, but I can't really tell for sure.

The whole situation is very fuzzy for what is supposed to be an open, royalty free standard.

Note our IP lawyer is still reviewing this license document. (We're actually spending money on this - that's how serious this is to us.)

Tuesday, October 1, 2019

ARM's ASTC encoder patents - is it safe to write encoders for this format?

I put this on Twitter earlier. I found this very disturbing comment in the Arm ASTC Encoder:

/** *
Functions for finding dominant direction of a set of colors. * * Uses Arm patent pending method. */

Source code link.

Wow. I immediately stopped looking at this code and deleted it all once I saw this comment. I will never look at this code again in any way. So basically, ARM seems to be patenting some variant of PCA (Principle Component Analysis)? This is the first software GPU texture *encoder* I've seen that explicitly states that it uses patent pending algorithms.

ASTC is supposed to be "royalty-free":
Yet, if I implement an ASTC encoder that uses PCA, will ARM sue us for patent infringement?

I was very excited about ASTC, but now it's totally clouded by this encoder patent issue. I cannot support a supposed "royalty-free" standard that apparently has encoder patents hanging over its head. We need ARM to fix this, to basically clarify what's going on here, and make a public statement that software developers can write encoders for its format without being sued because they infringed on ARM encoder patents.

You know, just to illustrate what a slippery slope encoder patents are and why they suck for everybody: We could have patented the living daylights out of our texture encoders, our universal codec, etc. It would have been no problem whatsoever. We could take this entire field and patent it up so tight that nobody could write a practical open or closed source GPU texture encoder without having to pay up. We could then sue for patent infringement any IHV's which ship drivers that implement run-time texture compressors, transcoders, or converters that use our patented encoding algorithms.

However we didn't want to ignite a texture encoder/texture compression patent war, and I'm very allergic to software patents.

The sad reality is, if the IHV's are going to start patenting the algorithms in their reference GPU texture *encoders*, we will have no choice but to start patenting every single GPU texture encoding and transcoding algorithm we can. For defensive purposes, so we can survive.

Taking this further, we could then turn this encoder patent landgrab into a significant part of our business model. These patents are worth several million each to the big tech corps during acquisitions. We could sell out our encoders and patents to the biggest buyer and retire.

Our defense to the software development community would be: "ARM started patenting their encoders first, not us. We needed defensive encoder patents to survive, just in case they sued."

After parsing the astc-encoder's license a few times, it appears we can legally use the ASTC specification to write our own 100% independent ASTC encoders and distribute the resulting compressed texture data. That's great. But if I go and write (for example) a BC7 texture encoder that accidentally infringes on ARM's encoder patents over their variation of PCA, I'm still screwed.

BTW - The author of the "Slug" texture rendering library has started to patent his algorithms. (I only point this out to show that it's very possible for a tiny middleware company to easily acquire patents.) Personally, I'm against software patents, and I hope ARM fixes this.

Monday, September 30, 2019

Unified texture encoder for BC7 and ASTC 4x4

So far, it looks possible to unify a very strong subset of BC7 and ASTC 4x4. Such an encoder would be very useful, even if it didn't support rate distortion optimization. I've been looking at this problem on and off for months, and I'm convinced that there's something really interesting here.

First, it turns out there are 30 2-subset partition patterns in common between ASTC 4x4 and BC7:

This is a collection of very strong patterns. Considering ASTC's partition pattern generator is basically a fancy rand() function, this is surprisingly good! We've got almost half of BC7's 64 patterns in there!

Secondly, the way ASTC and BC7 convert indices (weights) to 6-bit scales, and the way they both interpolate endpoints is extremely similar. So similar that I believe ASTC indices could be converted directly to BC7's with little to no expensive per-pixel CPU work required.

Finally, I wrote a little app that examines hundreds of valid ASTC configurations, trying to find which configurations resemble the strongest and most useful BC7 modes. Here they are:

So basically, all the important things between BC7 and ASTC 4x4 are basically the same or similar enough. ASTC's endpoint ranges are all over the map, but that's fine because in most cases BC7's endpoint precision is actually higher than ASTC's. A unified encoder could just optimize for lowest error across *both* formats simultaneously, and output both ASTC and BC7 endpoints with the same selectors. Or, we could only output ASTC's endpoints and just scale them to BC7's.

The next step is to write an ASTC encoder that is limited to these 12 configs and see how strong it is. After this, I need to see if this ASTC texture data can be directly and quickly converted to BC7 texture data with little loss. (Without any recompression to BC7, i.e. we just convert and copy over the endpoints/partition table index/per-pixel selector indices and that's it.) So far, I think this is all possible.

If all this actually works, it'll give us the foundation we need to build the next version of .basis. We could quantize the unified ASTC/BC7 selector/endpoint data and store it in a ".basis2" file. We then can convert this high-quality texture data to the other formats using fast block encoders (like we do already with PVRTC1 RGB/RGBA and PVRTC2 RGBA).

We could even store hints in the .basis2 file that accelerate conversion to some formats. For example we could store optimized BC1 endpoints in the endpoint codebook. Or we could store the optimal ETC1 base color/table indices, etc. Determining the per-pixel selectors for these formats is cheap once you have this info.

I think that with a strong ASTC 4x4 12-mode encoder that supports perceptual colorspace metrics, we could actually beat (or get really close) to ispc_texcomp BC7's encoder (which only supports linear RGB metrics). I think this encoder would get within a few dB of max achievable BC7.

If the system's quality isn't high enough, we could always tack on more ASTC modes, as long as they can be easily transcoded to one of the BC7 modes without expensive operations.

It's too bad that BC7 isn't well supported in WebGL yet. The extensions are there, but the browser support isn't yet. I have no idea why as the format is basically ubiquitous on desktop GPU's now, and it's the highest quality LDR texture format available. For WebGL we still need very strong BC1-5 support for desktops until this situation changes.

Wednesday, April 24, 2019

Idea for next texture compression experiment (originally published 9/11/16)

Right now, I've got a GPU texture in a simple ETC1 subset that is easily converted to most other GPU formats:

Base color: 15-bits, 5:5:5 RGB
Intensity table index: 3-bits
Selectors: 2-bits/texel

Most importantly, this is a "single subset" encoding, using BC7 terminology. BC7 supports between 1-3 subsets per block. A subset is just a colorspace line represented by two R,G,B endpoint colors.

This format is easily converted to DXT1 using a table lookup. It's also the "base" of the universal GPU texture format I've been thinking about, because it's the data needed for DXT1 support. The next step is to experiment with attempting to refine this base data to better take advantage of the full ETC1 specification. So let's try adding two subsets to each block, with two partitions (again using BC7 terminology), top/bottom or left/right, which are supported by both ETC1 and BC7.

For example, we can code this base color, then delta code the 2 subset colors relative to this base. We'll also add a couple more intensity indices, which can be delta coded against the base index. Another bit can indicate which ETC1 block color encoding "mode" should be used (individual 4:4:4 4:4:4 or differential 5:5:5 3:3:3) to represent the subset colors in the output block.

In DXT1 mode, we can ignore this extra delta coded data and just convert the basic (single subset) base format. In ETC1/BC7/ASTC modes, we can use the extra information to support 2 subsets and 2 partitions.

Currently, the idea is to share the same selector indices between the single subset (DXT1) and two subset (BC7/ASTC/full ETC1) encodings. This will constrain how well this idea works, but I think it's worth trying out.

To add more quality to the 2 subset mode, we can delta code (maybe with some fancy per-pixel prediction) another array of selectors in some way. We can also add support for more partitions (derived from BC7's or ASTC's), too.

Few more random thoughts on a "universal" GPU texture format (originally published 9/9/16)

In my experiments, a simple but usable subset of ETC1 can be easily converted to DXT1, BC7, and ATC. And after studying the standard, it very much looks like the full ETC1 format can be converted into BC7 with very little loss. (And when I say "converted", I mean using very little CPU, just basically some table lookup operations over the endpoint and selector entries.)

ASTC seems to be (at first glance) around as powerful as BC7, so converting the full ETC1 format to ASTC with very little loss should be possible. (Unfortunately ASTC is so dense and complex that I don't have time to determine this for sure yet.)

So I'm pretty confident now that a universal format could be compatible with ASTC, BC7, DXT1, ETC1, and ATC. The only other major format that I can't fit into this scheme easily is my old nemesis, PVRTC.

Obviously this format won't look as good compared to a dedicated, single format encoder's output. So what? There are many valuable use cases that don't require super high quality levels. This scheme purposely trades off a drop in quality for interchange and distribution.

Additionally, with a crunch-style encoding method, only the endpoint (and possibly the selector) codebook entries (of which there are usually only hundreds, possibly up to a few thousand in a single texture) would need to be converted to the target format. So the GPU format conversion step doesn't actually need to be insanely fast.

Another idea is to just unify ASTC and BC7, two very high quality formats. The drop in quality due to unification would be relatively much less significant with this combination. (But how valuable is this combo?)

(This blog post was originally mirrored here:

More universal GPU texture format stuff (originally published 9/9/16)

Some BC7 format references:

Source to CPU and shader BC7 (and other format) encoders/decoders:

Khronos texture format references, including BC6H and BC7:

It may be possible to add ETC1-style subblocks into a universal GPU texture format, in a way that can be compressed efficiently and still converted on the fly to DXT1. Converting full ETC1 (with subblocks and per-subblock base colors) directly to BC7 at high quality looks easy because of BC7's partition table support. BC7 tables 0 and 13 (in 2 subset mode) perfectly match the ETC1 subblock orientations.

Any DX11 class or better GPU supports BC7, so on these GPU's the preferred output format can be BC7. DXT1 can be viewed as a legacy lower quality fallback for older GPU's.

Also, I limited the per-block (or per-subblock) base colors to 5:5:5 to simplify the experiments in my previous posts. Maybe storing 5:5:5 (for ETC1/DXT1) with 1-3 bit per-component deltas could improve the output for BC7/ASTC.

Also, one idea for alpha channel support in a universal GPU format: Store a 2nd ETC1 texture, containing the alpha channel. There's nothing to do when converting to ETC1, because using two ETC1 textures for color+alpha is a common pattern. (And, this eats two samplers, which sucks.)

When converting to DXT5's alpha block (DXT5A blocks - and yes I know there are BCx format equivalents but I'm using crnlib terms here), just use another ETC1 block color/intensity selector index to DXT5A mapping table. This table will be optimized for grayscale conversion. BC7 has very flexible alpha support so it should be a straightforward conversion.

The final thing to figure out is ASTC, but OMG that format looks daunting. Reminds me of MPEG/JPEG specs.