RDO BC1-5 are done and already checked into the bc7enc_rdo repo. The test app in this repo only currently supports RDO BC1 and BC4, but I'll add in BC3/5 very soon (they are just trivial variations of BC1/4). I'm hoping the KTX2 guys will add this encoder to their repo, so I don't have to create yet another command line tool that supports mipmaps, reading/writing DDS/KTX, etc. RDO BC1-5 are implemented as post-processors, so they are compatible with any other non-RDO BC1-5 encoder.
For my first RDO BC7 encoder, I've modified bc7enc's BC7 encoder (which purposely only supports 4 modes: 1/5/6/7) to support optional per-mode error weights, and 6-bit endpoint components with fixed 0/1 p-bits in mode 6. These two simple changes immediately reduce LZ compressed file sizes by around 5-10% with Deflate, with no perf. impact. I may support doing something like this for the other modes. I also implemented Castano's optimal endpoint rounding method, because why not.
The next step is creating a post-processor that accepts an array of encoded BC7 blocks, and modifies them for higher quality per compressed bit by increasing LZ matches. The post-processor function will support all the modes, although I'm testing primarily with bc7enc at the moment. Merging selector bits with previously encoded blocks is the simplest thing to do, which I just got working for any mode.
I'm using the usual Langrangian multiplier method (j=D+l*R, where D=MSE, R=predicted bits, l=lambda). Here's a good but dense article on rate distortion methods and theory: Rate-distortion methods for image and video compression, by Ortego and Ramchandran (1998). I first read this years ago while working on Halo Wars 1's texture compression system, which was like crunch's .CRN mode but simpler. None of this stuff is new, and the image and video folks have been doing it for decades.
I first implemented the Langrangian multiplier method in 2017, as a postprocess on top of crunch's BC1 RDO mode which we sent to a few companies. The Langrangian multiplier method itself is easy, but estimating LZ bitrate and especially handling smooth blocks is tricky. The current smooth block method I'm using computes the maximum of the standard deviation of any component in each block, and from that scalar it computes a per-block MSE error scale. This artificially amplifies computed errors on smooth blocks, which is a hack, but it does seem to work. This hurts R-D performance but something must be done or smooth blocks turn to shit.
Some RDO BC1 and BC7 examples on kodim23, which has lots of smooth blocks:
RDO BC1: 8KB dictionary, lambda=1.0, max smooth block MSE scale=10.2, max std dev=18.0, linear metrics
38.763 RGB dB, 2.72 bits/texel (Deflate, miniz max compression)
RDO BC7 modes 1+6: 8KB dictionary, lambda=1.0, max smooth block MSE scale=19.2, max std dev=18.0, -u4, linear metrics
42.659 RGB dB, 5.05 bits/texel (Deflate, miniz max compression)
Mode 1: 1920 blocks
Mode 6: 22656 blocks
RDO BC7 modes 1+6: 8KB dictionary, lambda=2.0, max smooth block MSE scale=20.4, max std dev=18.0, -u4, linear metrics
40.876 RGB dB, 4.41 bits/texel (Deflate, miniz max compression)
Mode 1: 1920 blocks
Mode 6: 22656 blocks
To get an idea how bad things can get if you don't do anything to handle smooth blocks, here's BC7 modes 1+6: lambda=1.0, no smooth block error scaling (max MSE scale=1.0):
38.469 RGB dB, 3.49 bits/texel
I'm showing kodim23 because how you handle smooth blocks in this method is paramount. 92% of kodim23's blocks are treated as smooth blocks (because the max component standard deviation of any block is <= 18). This means that most of the MSE errors being computed and plugged into the Langrangian calculation are being artificially scaled up. There must be a better way, but at least it's simple. (By comparison, crunch's clusterization-based method didn't do anything special for smooth blocks - it just worked.)
I'm still tuning how smooth blocks are handled. Being too conservative with smooth blocks can cause very noticeable block artifacts at higher lambdas. Being too liberal with smooth blocks causes the R-D efficiency (quality per LZ bit) to go down:
Here are some R-D curves from my early BC1 RDO results. rgbcx.h's RDO BC1 is clearly beating crunch's 13 year old RDO BC1 implementation, achieving higher quality per LZ compressed bit. crunch is based off endpoint/selector clusterization+refinement with no direct awareness of what LZ is going to do with the output data, so this isn't surprising.
(These images are way too big, but I'm too tired to resize them.)
For some historical background the crunch library (which also supports RDO BC1-5) has always computed and displayed LZMA statistics on the compressed texture's output data. (As a side note, there's no point using Unity's crunch repo for RDO BC1-5 - they didn't optimize RDO, just .CRN.) The entire goal of crunch, from the beginning, was RDO BC1-5. I remember being very excited by RDO texture encoders in 2009, because I realized how useful they would be to video game developers. It achieved this indirectly by causing many blocks, especially nearby ones, to use the same endpoint/selector bits, increasing the ratio of LZ matches vs. literals. For a fun but practical Windows demo I wrote years ago of crunch's RDO encoder (written using managed C++ of all things), check out ddsexport.
I suspect everybody will switch to RDO texture encoders at some point. Selector RDO with optional endpoint refinement is very easy to do on all the GPU texture formats, even PVRTC1.
I recently went back and updated the UASTC RDO encoder to use the same basic options (lambda and smooth block settings) as my RDO BC1-7 encoders. The original UASTC RDO encoder controlled quality vs. bitrate in a different way. These changes will be in basisu v1.13, which should be released on github hopefully by next week (once we get approval from the company we're working with).
No comments:
Post a Comment