stb_dxt v1.09, icbc, rgbcx v1.12, original crunch, and Unity's optimized variant of crunch. Both 4 and 3 color blocks can be used, but transparent texels are not utilized to get black/dark texels in this benchmark. Across a diverse assortment of 100 textures (not just images).
Same benchmark except this time with 3-color transparent texels used for black or dark texels in rgbcx (purple samples):
Here's an update, now with nvdxt.exe (black sample) and ispc_texcomp (brown sample). Note that the nvdxt.exe time is approximate because I had to spawn nvdxt.exe and it loads a .png and saves a .dds file. I did spawn it twice, once without timing it, then immediately again timing it.
nvdxt.exe command line:
nvdxt.exe -nomipmap -quality_highest -rms_threshold 50 -file image.png -output nvcompressed.dds -dxt1c -weight 1.0 1.0 1.0
Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Thursday, April 16, 2020
Thursday, April 9, 2020
BC1 encoding initial endpoint determination benchmark
Benchmark of BC1 encoders using different methods to determine the initial endpoints:
- bounds+inset+XY covar method is Castano's/van Waveren's.
stb_dxt.h PCA: 35.754 dB, .551 us/block
rgbcx.h PCA: 35.794, .651
rgbcx.h PCA+inset: 35.925, .640
rgbcx.h 2D LS+inset+opt round: 35.920 dB, .541
rgbcx.h bounds+inset+XY covar: 35.836 dB, .472
This is across 100 textures, so even small avg. improvements are significant. Amazingly, the inset method (a few lines of code) buys rgbcx.h PCA .131 dB! All encoders should be doing this. You *must* pay attention to every little detail in these texture encoders.
Quality is performance in competitive texture block encoding, so even small boosts in quality allow us to dial down the # of total orders to check for the same average quality. This leads to a more competitive encoder.
Methods:
This is across 100 textures, so even small avg. improvements are significant. Amazingly, the inset method (a few lines of code) buys rgbcx.h PCA .131 dB! All encoders should be doing this. You *must* pay attention to every little detail in these texture encoders.
Quality is performance in competitive texture block encoding, so even small boosts in quality allow us to dial down the # of total orders to check for the same average quality. This leads to a more competitive encoder.
Methods:
- bounds+inset+XY covar method is Castano's/van Waveren's.
All encoders should be applying the "inset" method describes in this paper, because from a quantization perspective it makes perfect sense.
- 2D LS is Humus's method, ported to mostly integer math, with added inset+optimal rounding to 565:
- stb_dxt.h and rgbcx.h PCA is 3D integer PCA (3x3 covar+4 power iters, pick 2 colors along principle axis).
- PCA+inset+optimal rounding does PCA, picks 2 colors, then lerps the 2 colors by 1/16 or 15/16, then optimal rounds to 565.
Wednesday, April 8, 2020
AMD GPU BC1 decoding lookup tables
Here are the lookup tables you can use to determine how AMD GPU's decode BC1 textures: https://pastebin.com/raw/LSgn0ent
For selectors 0 and 1, the 5->8 and 6->8 endpoint conversion just uses bitshifts/OR's (same as ideal BC1). For 4-color selector 2, use the tables. For selector 3, just invert the low/high endpoints. (I've verified you can do this.) For 3-color selector 2, use the tables.
To access the tables, use [color0_component*32+color1_component], or *64 for 6-bits:
Block Compression (Direct3D 10) - Win32 appsdocs.microsoft.com
Converting the tables to formulas sounds like an interesting puzzle.
Example showing exactly how to use the tables to decode AMD BC1:
These tables were gathered straight from a Radeon RX 580 by using a small D3D9 app that rendered a textured BC1 quad with point sampling and did a CPU readback. I used this same D3D9 app on an NVidia 1080 and the pixels I read back exactly matched what the NV BC1 formulas on the web predicted, so I'm confident in the approach.
For selectors 0 and 1, the 5->8 and 6->8 endpoint conversion just uses bitshifts/OR's (same as ideal BC1). For 4-color selector 2, use the tables. For selector 3, just invert the low/high endpoints. (I've verified you can do this.) For 3-color selector 2, use the tables.
To access the tables, use [color0_component*32+color1_component], or *64 for 6-bits:
Block Compression (Direct3D 10) - Win32 appsdocs.microsoft.com
Converting the tables to formulas sounds like an interesting puzzle.
Example showing exactly how to use the tables to decode AMD BC1:
Tuesday, April 7, 2020
CPU BC1 Encoding Pareto Frontier
rgbcx.h now defines the BC1 Pareto Frontier for high quality CPU BC1 encoding (i.e. it's stronger than all other available practical high quality CPU encoders for both performance and quality):
Data:
I didn't include AMD Compressonator's encoder because in previous benchmarks (conducted by others) it was beaten by a weaker version of rgbcx.h for both perf. and quality.
The overall CPU BC1 Pareto frontier is defined by ispc_texcomp (at low quality: ~33.1 dB) and rgbcx for any higher quality level. We're going to need SIMD to compete against ispc_texcomp BC1 (a weak stb_dxt clone), which is my next major goal.
To get rgbcx to compete against icbc for max. quality I had to add prioritized cluster fit support for 3-color blocks (not just 4).
It's possible to permit rgbcx to go to even higher quality levels by enlarging the total ordering tables. They're currently limited to 32 entries per total ordering.
I think rgbcx.h's max quality is slightly higher than icbc's HQ mode because prioritized cluster fit can afford to do optimal rounding and evaluate accurate MSE errors in every trial. Regular cluster fit can't afford to do so because it has to evaluate so many total orderings.
Links:
rgbcx: https://github.com/richgel999/bc7enc
libsquish: https://github.com/richgel999/libsquish
icbc: https://github.com/castano/icbc/blob/master/icbc.h
Data:
I didn't include AMD Compressonator's encoder because in previous benchmarks (conducted by others) it was beaten by a weaker version of rgbcx.h for both perf. and quality.
The overall CPU BC1 Pareto frontier is defined by ispc_texcomp (at low quality: ~33.1 dB) and rgbcx for any higher quality level. We're going to need SIMD to compete against ispc_texcomp BC1 (a weak stb_dxt clone), which is my next major goal.
To get rgbcx to compete against icbc for max. quality I had to add prioritized cluster fit support for 3-color blocks (not just 4).
It's possible to permit rgbcx to go to even higher quality levels by enlarging the total ordering tables. They're currently limited to 32 entries per total ordering.
I think rgbcx.h's max quality is slightly higher than icbc's HQ mode because prioritized cluster fit can afford to do optimal rounding and evaluate accurate MSE errors in every trial. Regular cluster fit can't afford to do so because it has to evaluate so many total orderings.
Links:
rgbcx: https://github.com/richgel999/bc7enc
libsquish: https://github.com/richgel999/libsquish
icbc: https://github.com/castano/icbc/blob/master/icbc.h
Saturday, April 4, 2020
New BC1 benchmark
Optimizing BC1 encoding is still useful and interesting because the same core algorithms are used in BC7 and ASTC/UASTC encoders. Most improvements made to BC1 encoding carry over nicely to the 2-bit and 3-bit selector modes of other formats.
Here's my latest benchmark:
rgbcx.h flags:
- h is cEncodeBC1HighQuality
- ut is cEncodeBC1UseLikelyTotalOrderings
- ub is cEncodeBC1Use3ColorBlocksForBlackPixels
- 3 is cEncodeBC1Use3ColorBlocks
From the benchmarks I've seen it appears NVidia Texture Tools BC1 is around the same perf. as libsquish at slightly higher quality:
I believe this was rgbcx using 10 total orderings (the default setting). The max is 32, and every additional total ordering increases average quality. So at higher settings rgbcx is likely competitive against nvtt while being faster.
I'm currently working on integrating NVTT into my test app.
Here's my latest benchmark:
The highest performing samples (above 37 dB) are rgbcx in 3-color block mode, where it can use transparent black colors (selector 3) for opaque black or very dark texels. (The only other BC1 encoder that might support this mode is the one in NVidia Texture Tools, but I'm not sure.) This technically turns opaque textures into textures with a useless alpha channel, but if the engine or shader just ignores alpha then this mode performs exceptionally well in the average case. The flags are cEncodeBC1Use3ColorBlocksForBlackPixels | cEncodeBC1Use3ColorBlocks.
This mode is super useful because it allows the 3-color block encoder to focus the endpoints on the brighter texels within the block, potentially greatly increasing quality. Blocks with very dark or black texels are common in practice.
If your engine supports ignoring the alpha channel in sampled BC1 textures then everyone using BC1 should be using encoders that support this.
Data:rgbcx.h flags:
- h is cEncodeBC1HighQuality
- ut is cEncodeBC1UseLikelyTotalOrderings
- ub is cEncodeBC1Use3ColorBlocksForBlackPixels
- 3 is cEncodeBC1Use3ColorBlocks
From the benchmarks I've seen it appears NVidia Texture Tools BC1 is around the same perf. as libsquish at slightly higher quality:
I believe this was rgbcx using 10 total orderings (the default setting). The max is 32, and every additional total ordering increases average quality. So at higher settings rgbcx is likely competitive against nvtt while being faster.
I'm currently working on integrating NVTT into my test app.