Wednesday, February 3, 2021

RDO BC1-BC7 progress

I've been making progress on my first RDO BC7 encoder. I started working on RDO BC1-7 years ago, but I put this work on hold to open source Basis Universal. BasisU was way more important from a business perspective. (Games are fun and all, but the game business doesn't pay and web and mapping are where the eyeballs are at.)

RDO BC1-5 are done and already checked into the bc7enc_rdo repo. The test app in this repo only currently supports RDO BC1 and BC4, but I'll add in BC3/5 very soon (they are just trivial variations of BC1/4). I'm hoping the KTX2 guys will add this encoder to their repo, so I don't have to create yet another command line tool that supports mipmaps, reading/writing DDS/KTX, etc. RDO BC1-5 are implemented as post-processors, so they are compatible with any other non-RDO BC1-5 encoder. 

For my first RDO BC7 encoder, I've modified bc7enc's BC7 encoder (which purposely only supports 4 modes: 1/5/6/7) to support optional per-mode error weights, and 6-bit endpoint components with fixed 0/1 p-bits in mode 6. These two simple changes immediately reduce LZ compressed file sizes by around 5-10% with Deflate, with no perf. impact. I may support doing something like this for the other modes. I also implemented Castano's optimal endpoint rounding method, because why not.

The next step is creating a post-processor that accepts an array of encoded BC7 blocks, and modifies them for higher quality per compressed bit by increasing LZ matches. The post-processor function will support all the modes, although I'm testing primarily with bc7enc at the moment. Merging selector bits with previously encoded blocks is the simplest thing to do, which I just got working for any mode. 

I'm using the usual Langrangian multiplier method (j=D+l*R, where D=MSE, R=predicted bits, l=lambda). Here's a good but dense article on rate distortion methods and theory: Rate-distortion methods for image and video compression, by Ortego and Ramchandran (1998). I first read this years ago while working on Halo Wars 1's texture compression system, which was like crunch's .CRN mode but simpler. None of this stuff is new, and the image and video folks have been doing it for decades.

I first implemented the Langrangian multiplier method in 2017, as a postprocess on top of crunch's BC1 RDO mode which we sent to a few companies. The Langrangian multiplier method itself is easy, but estimating LZ bitrate and especially handling smooth blocks is tricky. The current smooth block method I'm using computes the maximum of the standard deviation of any component in each block, and from that scalar it computes a per-block MSE error scale. This artificially amplifies computed errors on smooth blocks, which is a hack, but it does seem to work. This hurts R-D performance but something must be done or smooth blocks turn to shit.

Some RDO BC1 and BC7 examples on kodim23, which has lots of smooth blocks:

RDO BC1: 8KB dictionary, lambda=1.0, max smooth block MSE scale=10.2, max std dev=18.0, linear metrics
38.763 RGB dB, 2.72 bits/texel (Deflate, miniz max compression)


RDO BC7 modes 1+6: 8KB dictionary, lambda=1.0, max smooth block MSE scale=19.2, max std dev=18.0, -u4, linear metrics
42.659 RGB dB, 5.05 bits/texel (Deflate, miniz max compression)
Mode 1: 1920 blocks
Mode 6: 22656 blocks



RDO BC7 modes 1+6: 8KB dictionary, lambda=2.0, max smooth block MSE scale=20.4, max std dev=18.0, -u4, linear metrics
40.876 RGB dB, 4.41 bits/texel (Deflate, miniz max compression)
Mode 1: 1920 blocks
Mode 6: 22656 blocks


To get an idea how bad things can get if you don't do anything to handle smooth blocks, here's BC7 modes 1+6: lambda=1.0, no smooth block error scaling (max MSE scale=1.0):
38.469 RGB dB, 3.49 bits/texel


I'm showing kodim23 because how you handle smooth blocks in this method is paramount. 92% of kodim23's blocks are treated as smooth blocks (because the max component standard deviation of any block is <= 18). This means that most of the MSE errors being computed and plugged into the Langrangian calculation are being artificially scaled up. There must be a better way, but at least it's simple. (By comparison, crunch's clusterization-based method didn't do anything special for smooth blocks - it just worked.)

I'm still tuning how smooth blocks are handled. Being too conservative with smooth blocks can cause very noticeable block artifacts at higher lambdas. Being too liberal with smooth blocks causes the R-D efficiency (quality per LZ bit) to go down:


Here are some R-D curves from my early BC1 RDO results. rgbcx.h's RDO BC1 is clearly beating crunch's 13 year old RDO BC1 implementation, achieving higher quality per LZ compressed bit. crunch is based off endpoint/selector clusterization+refinement with no direct awareness of what LZ is going to do with the output data, so this isn't surprising. 


(These images are way too big, but I'm too tired to resize them.)

For some historical background the crunch library (which also supports RDO BC1-5) has always computed and displayed LZMA statistics on the compressed texture's output data. (As a side note, there's no point using Unity's crunch repo for RDO BC1-5 - they didn't optimize RDO, just .CRN.) The entire goal of crunch, from the beginning, was RDO BC1-5. I remember being very excited by RDO texture encoders in 2009, because I realized how useful they would be to video game developers. It achieved this indirectly by causing many blocks, especially nearby ones, to use the same endpoint/selector bits, increasing the ratio of LZ matches vs. literals. For a fun but practical Windows demo I wrote years ago of crunch's RDO encoder (written using managed C++ of all things), check out ddsexport.

Anyhow, the next step is to further enhance RDO BC7 opaque, then dive into alpha. I'll be open sourcing the RDO BC7 postprocessor within a week. After this I'm going to write a second stronger version.

I suspect everybody will switch to RDO texture encoders at some point. Selector RDO with optional endpoint refinement is very easy to do on all the GPU texture formats, even PVRTC1.

I recently went back and updated the UASTC RDO encoder to use the same basic options (lambda and smooth block settings) as my RDO BC1-7 encoders. The original UASTC RDO encoder controlled quality vs. bitrate in a different way. These changes will be in basisu v1.13, which should be released on github hopefully by next week (once we get approval from the company we're working with).

Saturday, January 23, 2021

How to benchmark or use the UASTC encoder in the Basis Universal library

UASTC is a subset of LDR ASTC 4x4, 4x4 block size, always 8bpp, and very high quality. If your engine/product/benchmark supports BC7 or LDR ASTC 4x4, trying out our UASTC encoder/transcoder (without using .basis or .KTX2 at all) is pretty simple:

Compile/link in the Basis Universal encoder and transcoder .cpp files (or put them into libs). Call basisu_encoder_init() at startup.

To encode 4x4 blocks to the 8bpp UASTC format, call encode_uastc():
https://github.com/BinomialLLC/basis_universal/blob/master/encoder/basisu_uastc_enc.h

To decode UASTC blocks to raw 32bpp pixels, call

bool unpack_uastc(const uastc_block& blk, color32* pPixels, bool srgb);

Set the "srgb" flag to always false right now, because that's what the UASTC encoder assumes it will be set to. (We're fixing this for the Feb. release.)

Or you can call transcode_uastc_to_bc7() or transcode_uastc_to_astc(), then unpack those blocks yourself (ASTC will always be equal or higher quality than BC7 because UASTC is a pure subset of LDR 4x4 ASTC):

https://github.com/BinomialLLC/basis_universal/blob/master/transcoder/basisu_transcoder_uastc.h

There's an optional RDO post processor in there too that you can call on arrays of UASTC blocks, but it's pretty basic right now. See uastc_rdo().

The advantage of UASTC is that you can transcode it at run-time to basically any texture format. There are very high quality transcoders to BC1-5, ETC1/2, BC7, etc. It even supports PVRTC1. The disadvantage is a slight drop in quality vs. best BC7/ASTC, but not much, and slower encoding. We even throw in a free RDO encoder (as a simple post processor) for UASTC.

Tuesday, September 15, 2020

LZHAM and crunch are now Public Domain software

As of 9/15/2020, acting as the full legal owner of the LZHAM and crunch data compression libraries, I (acting as an individual) have placed these libraries into the Public Domain. For jurisdictions that don't recognize releasing Public Domain software, there are unlicense-style fallback clauses:


Thanks to Cowles & Thompson, a law firm in Dallas, TX for making this Public Domain release possible.

Wednesday, August 26, 2020

LZHAM and "crunch" IP will be placed into the Public Domain on 9/15/2020

As the owner of the "LZHAM" and "crunch" free open source software IP, I have decided to place these two works into the Public Domain in the United States, expressly waiving copyright protection. Once this is done this software will no longer by my or anyone's IP (i.e. it will NOT BE INTELLECTUAL PROPERTY, OR ANYONE'S PROPERTY). The upload placing these works into the Public Domain will occur on 9/15/2020 around noon EST.

This public domain declaration and anti-copyright waiver (somewhat derived from the unlicense and CC0) will be distributed along with the software:

THIS SOFTWARE IS IN THE PUBLIC DOMAIN

THIS IS FREE AND UNENCUMBERED SOFTWARE EXPLICITLY AND OVERTLY RELEASED AND CONTRIBUTED TO THE PUBLIC DOMAIN, PERMANENTLY, IRREVOCABLY AND UNCONDITIONALLY WAIVING ANY AND ALL CLAIM OF COPYRIGHT, IN PERPETUITY ON SEPTEMBER 15, 2020.

1. FALLBACK CLAUSES

THIS SOFTWARE MAY BE FREELY USED, DERIVED FROM, EXECUTED, LINKED WITH, MODIFIED AND DISTRIBUTED FOR ANY PURPOSE, COMMERCIAL OR NON-COMMERCIAL, BY ANYONE, FOR ANY REASON, WITH NO ATTRIBUTION, IN PERPETUITY.

THE AUTHOR OR AUTHORS OF THIS WORK HEREBY OVERTLY, FULLY, PERMANENTLY, IRREVOCABLY AND UNCONDITIONALLY FORFEITS AND WAIVES ALL CLAIM OF COPYRIGHT (ECONOMIC AND MORAL), ANY AND ALL RIGHTS OF INTEGRITY, AND ANY AND ALL RIGHTS OF ATTRIBUTION. ANYONE IS FREE TO COPY, MODIFY, ENHANCE, OPTIMIZE, PUBLISH, USE, COMPILE, DECOMPILE, ASSEMBLE, DISASSEMBLE, DOWNLOAD, UPLOAD, TRANSMIT, RECEIVE, SELL, FORK, DERIVE FROM, LINK, LINK TO, CALL, REFERENCE, WRAP, THUNK, ENCODE, ENCRYPT, TRANSFORM, STORE, RETRIEVE, DISTORT, DESTROY, RENAME, DELETE, BROADCAST, OR DISTRIBUTE THIS SOFTWARE, EITHER IN SOURCE CODE FORM, IN A TRANSLATED FORM, AS A LIBRARY, AS TEXT, IN PRINT, OR AS A COMPILED BINARY OR EXECUTABLE PROGRAM, OR IN DIGITAL FORM, OR IN ANALOG FORM, OR IN PHYSICAL FORM, OR IN ANY OTHER REPRESENTATION, FOR ANY PURPOSE, COMMERCIAL OR NON-COMMERCIAL, AND BY ANY MEANS, WITH NO ATTRIBUTION, IN PERPETUITY.

2. ANTI-COPYRIGHT WAIVER AND STATEMENT OF INTENT

IN JURISDICTIONS THAT RECOGNIZE COPYRIGHT LAWS, THE AUTHOR OR AUTHORS OF THIS SOFTWARE OVERTLY, FULLY, PERMANENTLY, IRREVOCABLY AND UNCONDITIONALLY DEDICATE, FORFEIT, AND WAIVE ANY AND ALL COPYRIGHT INTEREST IN THE SOFTWARE TO THE PUBLIC DOMAIN. WE MAKE THIS DEDICATION AND WAIVER FOR THE BENEFIT OF THE PUBLIC AT LARGE AND TO THE DETRIMENT OF OUR HEIRS AND SUCCESSORS. WE INTEND THIS DEDICATION AND WAIVER TO BE AN OVERT ACT OF RELINQUISHMENT IN PERPETUITY OF ALL PRESENT AND FUTURE RIGHTS TO THIS SOFTWARE UNDER COPYRIGHT LAW. WE INTEND THIS SOFTWARE TO BE FREELY USED, COMPILED, EXECUTED, MODIFIED, PUBLISHED, DERIVED FROM, OR DISTRIBUTED BY ANYONE, FOR ANY COMMERCIAL OR NON-COMMERCIAL USE, WITH NO ATTRIBUTION, IN PERPETUITY.

3. NO WARRANTY CLAUSE

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHOR OR AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE, OR DERIVING FROM THE SOFTWARE, OR LINKING WITH THE SOFTWARE, OR CALLING THE SOFTWARE, OR EXECUTING THE SOFTWARE, OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 

4. FINAL ANTI-COPYRIGHT AND INTENT FALLBACK CLAUSE

SHOULD ANY PART OF THIS PUBLIC DOMAIN DECLARATION, OR THE FALLBACK CLAUSES, OR THE ANTI-COPYRIGHT WAIVER FOR ANY REASON BE JUDGED LEGALLY INVALID OR INEFFECTIVE UNDER APPLICABLE LAW, THEN THE PUBLIC DOMAIN DECLARATION, THE 
FALLBACK CLAUSES, AND ANTI-COPYRIGHT WAIVER SHALL BE PRESERVED TO THE MAXIMUM EXTENT PERMITTED BY LAW TAKING INTO ACCOUNT THE ABOVE STATEMENT OF INTENT.

Thursday, April 16, 2020

Yet another BC1 encoder benchmark

stb_dxt v1.09, icbc, rgbcx v1.12, original crunch, and Unity's optimized variant of crunch. Both 4 and 3 color blocks can be used, but transparent texels are not utilized to get black/dark texels in this benchmark. Across a diverse assortment of 100 textures (not just images).



Same benchmark except this time with 3-color transparent texels used for black or dark texels in rgbcx (purple samples):


Here's an update, now with nvdxt.exe (black sample) and ispc_texcomp (brown sample). Note that the nvdxt.exe time is approximate because I had to spawn nvdxt.exe and it loads a .png and saves a .dds file. I did spawn it twice, once without timing it, then immediately again timing it.


nvdxt.exe command line:

nvdxt.exe -nomipmap -quality_highest -rms_threshold 50 -file image.png -output nvcompressed.dds -dxt1c -weight 1.0 1.0 1.0


Thursday, April 9, 2020

BC1 encoding initial endpoint determination benchmark

Benchmark of BC1 encoders using different methods to determine the initial endpoints: 

stb_dxt.h PCA: 35.754 dB, .551 us/block 
rgbcx.h PCA: 35.794, .651 
rgbcx.h PCA+inset: 35.925, .640 
rgbcx.h 2D LS+inset+opt round: 35.920 dB, .541 
rgbcx.h bounds+inset+XY covar: 35.836 dB, .472

This is across 100 textures, so even small avg. improvements are significant. Amazingly, the inset method (a few lines of code) buys rgbcx.h PCA .131 dB! All encoders should be doing this. You *must* pay attention to every little detail in these texture encoders.

Quality is performance in competitive texture block encoding, so even small boosts in quality allow us to dial down the # of total orders to check for the same average quality. This leads to a more competitive encoder.

Methods:

- bounds+inset+XY covar method is Castano's/van Waveren's. 
All encoders should be applying the "inset" method describes in this paper, because from a quantization perspective it makes perfect sense.

- 2D LS is Humus's method, ported to mostly integer math, with added inset+optimal rounding to 565: 

- stb_dxt.h and rgbcx.h PCA is 3D integer PCA (3x3 covar+4 power iters, pick 2 colors along principle axis). 

- PCA+inset+optimal rounding does PCA, picks 2 colors, then lerps the 2 colors by 1/16 or 15/16, then optimal rounds to 565.

Wednesday, April 8, 2020

AMD GPU BC1 decoding lookup tables

Here are the lookup tables you can use to determine how AMD GPU's decode BC1 textures: https://pastebin.com/raw/LSgn0ent

These tables were gathered straight from a Radeon RX 580 by using a small D3D9 app that rendered a textured BC1 quad with point sampling and did a CPU readback. I used this same D3D9 app on an NVidia 1080 and the pixels I read back exactly matched what the NV BC1 formulas on the web predicted, so I'm confident in the approach.

For selectors 0 and 1, the 5->8 and 6->8 endpoint conversion just uses bitshifts/OR's (same as ideal BC1). For 4-color selector 2, use the tables. For selector 3, just invert the low/high endpoints. (I've verified you can do this.) For 3-color selector 2, use the tables.

To access the tables, use [color0_component*32+color1_component], or *64 for 6-bits:
Block Compression (Direct3D 10) - Win32 appsdocs.microsoft.com

Converting the tables to formulas sounds like an interesting puzzle.

Example showing exactly how to use the tables to decode AMD BC1: