Wednesday, April 24, 2019

Idea for next texture compression experiment (originally published 9/11/16)

Right now, I've got a GPU texture in a simple ETC1 subset that is easily converted to most other GPU formats:

Base color: 15-bits, 5:5:5 RGB
Intensity table index: 3-bits
Selectors: 2-bits/texel

Most importantly, this is a "single subset" encoding, using BC7 terminology. BC7 supports between 1-3 subsets per block. A subset is just a colorspace line represented by two R,G,B endpoint colors.

This format is easily converted to DXT1 using a table lookup. It's also the "base" of the universal GPU texture format I've been thinking about, because it's the data needed for DXT1 support. The next step is to experiment with attempting to refine this base data to better take advantage of the full ETC1 specification. So let's try adding two subsets to each block, with two partitions (again using BC7 terminology), top/bottom or left/right, which are supported by both ETC1 and BC7.

For example, we can code this base color, then delta code the 2 subset colors relative to this base. We'll also add a couple more intensity indices, which can be delta coded against the base index. Another bit can indicate which ETC1 block color encoding "mode" should be used (individual 4:4:4 4:4:4 or differential 5:5:5 3:3:3) to represent the subset colors in the output block.

In DXT1 mode, we can ignore this extra delta coded data and just convert the basic (single subset) base format. In ETC1/BC7/ASTC modes, we can use the extra information to support 2 subsets and 2 partitions.

Currently, the idea is to share the same selector indices between the single subset (DXT1) and two subset (BC7/ASTC/full ETC1) encodings. This will constrain how well this idea works, but I think it's worth trying out.

To add more quality to the 2 subset mode, we can delta code (maybe with some fancy per-pixel prediction) another array of selectors in some way. We can also add support for more partitions (derived from BC7's or ASTC's), too.

Few more random thoughts on a "universal" GPU texture format (originally published 9/9/16)

In my experiments, a simple but usable subset of ETC1 can be easily converted to DXT1, BC7, and ATC. And after studying the standard, it very much looks like the full ETC1 format can be converted into BC7 with very little loss. (And when I say "converted", I mean using very little CPU, just basically some table lookup operations over the endpoint and selector entries.)

ASTC seems to be (at first glance) around as powerful as BC7, so converting the full ETC1 format to ASTC with very little loss should be possible. (Unfortunately ASTC is so dense and complex that I don't have time to determine this for sure yet.)

So I'm pretty confident now that a universal format could be compatible with ASTC, BC7, DXT1, ETC1, and ATC. The only other major format that I can't fit into this scheme easily is my old nemesis, PVRTC.

Obviously this format won't look as good compared to a dedicated, single format encoder's output. So what? There are many valuable use cases that don't require super high quality levels. This scheme purposely trades off a drop in quality for interchange and distribution.

Additionally, with a crunch-style encoding method, only the endpoint (and possibly the selector) codebook entries (of which there are usually only hundreds, possibly up to a few thousand in a single texture) would need to be converted to the target format. So the GPU format conversion step doesn't actually need to be insanely fast.

Another idea is to just unify ASTC and BC7, two very high quality formats. The drop in quality due to unification would be relatively much less significant with this combination. (But how valuable is this combo?)

(This blog post was originally mirrored here: http://geldreich1.rssing.com/chan-32192436/all_p6.html#item116)

More universal GPU texture format stuff (originally published 9/9/16)

Some BC7 format references:
https://msdn.microsoft.com/en-us/library/hh308954(v=vs.85).aspx
https://msdn.microsoft.com/en-us/library/hh308953.aspx

Source to CPU and shader BC7 (and other format) encoders/decoders:
https://github.com/Microsoft/DirectXTex

Khronos texture format references, including BC6H and BC7:
https://www.khronos.org/registry/dataformat/specs/1.1/dataformat.1.1.pdf

It may be possible to add ETC1-style subblocks into a universal GPU texture format, in a way that can be compressed efficiently and still converted on the fly to DXT1. Converting full ETC1 (with subblocks and per-subblock base colors) directly to BC7 at high quality looks easy because of BC7's partition table support. BC7 tables 0 and 13 (in 2 subset mode) perfectly match the ETC1 subblock orientations.

Any DX11 class or better GPU supports BC7, so on these GPU's the preferred output format can be BC7. DXT1 can be viewed as a legacy lower quality fallback for older GPU's.

Also, I limited the per-block (or per-subblock) base colors to 5:5:5 to simplify the experiments in my previous posts. Maybe storing 5:5:5 (for ETC1/DXT1) with 1-3 bit per-component deltas could improve the output for BC7/ASTC.

Also, one idea for alpha channel support in a universal GPU format: Store a 2nd ETC1 texture, containing the alpha channel. There's nothing to do when converting to ETC1, because using two ETC1 textures for color+alpha is a common pattern. (And, this eats two samplers, which sucks.)

When converting to DXT5's alpha block (DXT5A blocks - and yes I know there are BCx format equivalents but I'm using crnlib terms here), just use another ETC1 block color/intensity selector index to DXT5A mapping table. This table will be optimized for grayscale conversion. BC7 has very flexible alpha support so it should be a straightforward conversion.

The final thing to figure out is ASTC, but OMG that format looks daunting. Reminds me of MPEG/JPEG specs.

Few more random thoughts on a "universal" GPU texture format (originally published 9/9/16)

In my experiments, a simple but usable subset of ETC1 can be easily converted to DXT1, BC7, and ATC. And after studying the standard, it very much looks like the full ETC1 format can be converted into BC7 with very little loss. (And when I say "converted", I mean using very little CPU, just basically some table lookup operations over the endpoint and selector entries.)

ASTC seems to be (at first glance) around as powerful as BC7, so converting the full ETC1 format to ASTC with very little loss should be possible. (Unfortunately ASTC is so dense and complex that I don't have time to determine this for sure yet.)

So I'm pretty confident now that a universal format could be compatible with ASTC, BC7, DXT1, ETC1, and ATC. The only other major format that I can't fit into this scheme easily is my old nemesis, PVRTC.

Obviously this format won't look as good compared to a dedicated, single format encoder's output. So what? There are many valuable use cases that don't require super high quality levels. This scheme purposely trades off a drop in quality for interchange and distribution.

Additionally, with a crunch-style encoding method, only the endpoint (and possibly the selector) codebook entries (of which there are usually only hundreds, possibly up to a few thousand in a single texture) would need to be converted to the target format. So the GPU format conversion step doesn't actually need to be insanely fast.

Another idea is to just unify ASTC and BC7, two very high quality formats. The drop in quality due to unification would be relatively much less significant with this combination. (But how valuable is this combo?)

Universal texture compression: 5th experiment (originally published 9/15/16)


I outlined a plan for my next texture compression experiment in a previous post, here. I modified my ETC1 packer so it accepts an optional parameter which forces the encoder to use a set of predetermined selectors, instead of allowing it to use whatever selectors it likes.

The idea is, I can take an ETC1 texture using a subset of the full-format (no flips and only a single base color/intensity index - basically a single partition/single subset format using BC7 terminology) and "upgrade" it to higher quality without modifying the selector indices. I think this is one critical step to making a practical universal texture format that supports both DXT1 and ETC1.

Turns out, this idea works better than I thought it would. The ETC1 subset encoding gets 33.265 dB, while the "upgraded" version (using the same selectors as the subset encoding) gets 34.315 dB, a big gain. (Which isn't surprising, because the ETC1 subset encoding doesn't take full advantage of the format.) The nearly-optimal ETC1 encoding gets 35.475 dB, so there is still some quality left on the table here.

The ETC1 subset to DXT1 converted texture is 32.971 dB. I'm not worried about having the best DXT1 quality, because I'm going to support ASTC and BC7 too and (at the minimum) they can be directly converted from the "upgraded" ETC1 encoding that this experiment is about.

I need to think about the next step from here. I now know I can build a crunch-like format that supports DXT1, ETC1, and ATC. These experiments have opened up a bunch of interesting product and open source library ideas. Proving that BC7 support is also practical to add should be easy. ASTC is so darned complex that I'm hesitant to do it for "fun".

1. ETC1 (subset):

Max:  80, Mean: 3.809, MSE: 30.663, RMSE: 5.537, PSNR: 33.265

Its selectors:


2. ETC1 (full format, constrained selectors) - optimizer was constrained to always use the subset encoding's selectors:


Max:  85, Mean: 3.435, MSE: 24.076, RMSE: 4.907, PSNR: 34.315

Its selectors (should be the same as #1's):


Biased delta between the ETC1 subset and ETC1 full encoding with constrained selectors - so we can see what pixels have benefited from the "upgrade" pass:


3. ETC1 (full format, unconstrained selectors) - packed using an improved version of rg_etc1 in highest quality mode:


Max:  80, Mean: 3.007, MSE: 18.432, RMSE: 4.293, PSNR: 35.475

Delta between the best ETC1 encoding (#3) and the ETC1 encoding using constrained selectors (#2):

Direct conversion of ETC1 to DXT1 texture data: 4th experiment (originally published 9/11/16)

In this experiment, I've worked on reducing the size of the lookup table used to quickly convert a subset of ETC1 texture data (using only a single 5:5:5 base color, one 3-bit intensity table index, and 2-bit selectors) directly to DXT1 texture data. Now the ETC1 encoder is able to simultaneously optimize for both formats, and due to this I can reduce the size of the conversion table. To accomplish this, I've modified the ETC1 base color/intensity optimizer function so it also factors in the DXT1 block encoding error into each trial's computed ETC1 error.

The overall trial error reported back to the encoder in this experiment was etc_error*16+dxt_error. The ETC1->DXT1 lookup table is now 3.75MB, with precomputed DXT1 low/high endpoints for three used selector ranges: 0-3, 0-2, 1-3. My previous experiment had 10 precomputed ranges, which seemed impractically large. I'm unsure which set of ranges is really needed or optimal yet. Even just one (0-3) seems to work OK, but with more artifacts on very high contrast blocks.

Anyhow, here's kodim18.

ETC1 subset:


Max:  80, Mean: 3.809, MSE: 30.663, RMSE: 5.537, PSNR: 33.265

DXT1:


Max:  76, Mean: 3.952, MSE: 32.806, RMSE: 5.728, PSNR: 32.971

ETC1 block selector range usage histogram:
0-3: 19161
1-3: 3012
0-2: 2403

Direct conversion of ETC1 to DXT1 texture data: 3rd experiment (originally published 9/7/16)

I've changed the lookup table used to convert to DXT1. Each cell in the 256K entry table (32*32*32*8, for each 5:5:5 base color and 3-bit intensity table entry in my ETC1 subset format) now contains 10 entries, to account for each combination of actually used ETC1 selector ranges in a block:

 { 0, 0 },
 { 1, 1 },
 { 2, 2 },
 { 3, 3 },
 { 0, 3 },
 { 1, 3 },
 { 2, 3 },
 { 0, 2 },
 { 0, 1 },
 { 1, 2 }

The first 4 entries here account for blocks that get encoded into a single color. The next entry accounts for blocks which use all selectors, then { 1, 3 } accounts for blocks which only use selectors 1,2,3, etc.

So for example, when converting from ETC1, if only selector 2 was actually used in a block, the ETC1->DXT1 converter uses a set of DXT1 low/high colors optimized for that particular use case. If all selectors were used, it uses entry #4, etc. The downsides to this technique are the extra CPU expense in the ETC1->DXT1 converter to determine the range of used selectors, and the extra memory to hold a larger table.

Note the ETC1 encoder is still not aware at all that its output will also be DXT1 coded. That's the next experiment. I don't think using this larger lookup table is necessary; a smaller table should hopefully be OK if the ETC1 subset encoder is aware of the DXT1 artifacts its introducing in each trial. Another idea is to use a simple table most of the time, and only access the larger/deeper conversion table on blocks which use the brighter ETC1 intensity table indices (the ones with more error, like 5-7).

ETC1 (subset):


Error: Max:  80, Mean: 3.802, MSE: 30.247, RMSE: 5.500, PSNR: 33.324

ETC1 texture directly converted to DXT1:


Error: Max:  73, Mean: 3.939, MSE: 32.218, RMSE: 5.676, PSNR: 33.050

I experimented with allowing the DXT1 optimizer (used to build the lookup table) to use 3-color blocks. This is actually a big deal for this use case, because the transparent selector's color is black (0,0,0). ETC1's saturation to 0 or 255 after adding the intensity table values creates "strange" block colors (away from the block's colorspace line), and this trick allows the DXT1 optimizer to work around that issue better. I'm not using this trick above, though.

I started seriously looking at the BC7 texture format's details today. It's complex, but nowhere near as complex as ASTC. I'm very tempted to try converting my ETC1 subset to that format next.

Also, if you're wondering why I'm working on this stuff: I want to write one .CRN-like encoder that supports efficient transcoding into as many GPU formats as possible. It's a lot of work to write these encoders, and the idea of that work's value getting amplified across a huge range of platforms and devices is very appealing. A universal format's quality won't be the best, but it may be practical to add a losslessly encoded "fixup" chunk to the end of the universal file. This could improve quality for a specific GPU format. 

Direct conversion of ETC1 to DXT1 texture data: 2nd experiment (originally published 9/06/16)

I lowered the ETC1 encoder's quality setting, so it doesn't try varying the block color so much during endpoint optimization. The DXT1 artifacts in my first experiment are definitely improved, although the overall quality is reduced. I also enabled usage of 3-color DXT1 blocks (although that was very minor).

Perhaps the right solution (that preserves quality but avoids the artifacts) is to add ETC1->DXT1 error evaluation to the ETC1 encoder, so it's aware of how much DXT1 error each ETC1 trial block color has.

ETC1 (subset):


Error: Max: 101, Mean: 4.036, MSE: 34.999, RMSE: 5.916, PSNR: 32.690

Converted directly to DXT1 using a 18-bit lookup table:


Error: Max: 107, Mean: 4.239, MSE: 38.930, RMSE: 6.239, PSNR: 32.228

Another ETC1:


Error: Max: 121, Mean: 4.220, MSE: 45.108, RMSE: 6.716, PSNR: 31.588

DXT1:


Error: Max: 117, Mean: 4.403, MSE: 48.206, RMSE: 6.943, PSNR: 31.300

Direct conversion of ETC1 to DXT1 texture data (originally published 9/11/16)

In this experiment, I limited my ETC1 encoder to only use a subset of the full format: differential mode, no flipping, with the diff color always set to (0,0,0). So all we use in the ETC1 format is the 5:5:5 base color, the 3-bit intensity table index, and the 16 2-bit selectors. This is the same subset used in this post on ETC1 endpoint clusterization.

This limits the ETC1 encoder to only utilizing 4 colors per block, just like DXT1. These 4 colors are on a line parallel to the grayscale axis. Fully lossless conversion (of this ETC1 subset format) to DXT1 is not possible in all cases, but it may be possible to do a "good enough" conversion.

The ETC1->DXT1 conversion step uses a precomputed 18-bit lookup table (5*3+3 bits) to accelerate the conversion of the ETC1 base color, intensity table index, and selectors to DXT1 low/high color endpoints and selectors. Each table entry contains the best DXT1 low/high color endpoints to use, along with a 4 entry table specifying which DXT1 selector to use for each ETC1 selector. I used crunch's DXT1 endpoint optimizer to build this table.

ETC1 (subset):

Error: Max:  80, Mean: 3.802, MSE: 30.247, RMSE: 5.500, PSNR: 33.324



Converted directly to DXT1 using the lookup table approach, then decoded (in software using crnlib):

Error: Max:  73, Mean: 3.966, MSE: 32.873, RMSE: 5.733, PSNR: 32.962


Delta image:


Grayscale delta histogram:


There are some block artifacts to work on, but this is great progress for 1 hour of work. (Honestly, I would have been pretty worried if there weren't any artifacts to figure out on my first test!)

These results are extremely promising. The next step is to work on the artifacts and do more testing. If this conversion step can be made to work well enough it means that a lossy "universal crunch" format that can be quickly and efficiently transcoded to either DXT1 or ETC1 is actually possible.