Monday, February 8, 2021

bc7enc_rdo encoding examples

Compress kodim.png to kodim03.dds (with no mips) to two BC7 modes (1+6):

Highest Quality Mode (uses Modes 1+6)

This mode is like ispc_texcomp or bc7e's BC7 compressor. bc7enc_rdo currently only uses modes 1/6 on opaque blocks, and modes 5/6/7 on alpha blocks. 

bc7enc.exe -o -u4 kodim08.png

...
BC7 mode histogram:
0: 0
1: 8703
2: 0
3: 0
4: 0
5: 0
6: 15873
7: 0

Pre-RDO output data size: 393216, LZ (Deflate) compressed file size: 378097, 7.69 bits/texel
Processing time: 0.113000 secs
...
Output data size: 393216, LZ (Deflate) compressed file size: 378097, 7.69 bits/texel
Wrote DDS file kodim08.dds
Luma Max error: 13 RMSE: 1.279412 PSNR 45.991 dB, PSNR per bits/texel: 59787.033452
RGB Max error: 37 RMSE: 2.065000 PSNR 41.832 dB, PSNR per bits/texel: 54381.448887
RGBA Max error: 37 RMSE: 1.805041 PSNR 43.001 dB, PSNR per bits/texel: 55900.685976

So the output .DDS file compressed to 7.69 bits/texel using miniz (stock non-optimal parsing Deflate, so a few percent worse vs. zopfli or 7za's Deflate). The RGB PSNR was 41.8 and the RGBA PSNR was 43 dB. It used mode 1 around half as much as mode 6.

Notice the pre-RDO compressed size is equal to the output's compressed size (7.69 bits/texel). There was no RDO, or anything in particular done to reduce the encoded output data's entropy. The output is mostly Huffman compressed because Deflate can't find many 3+ byte matches, so the output is quite close to 8 bits/texel. It's basically noise to Deflate or most other LZ's.

Reduced Entropy Mode (-e option)


This mode is as fast as before. It only causes the encoder to weight modes, p-bits, etc. differently so the output data is naturally more compressible by entropy/LZ coders:

bc7enc -o -u4 -zc2048 kodim08.png -e

BC7 mode histogram:
0: 0
1: 3385
2: 0
3: 0
4: 0
5: 0
6: 21191
7: 0
Pre-RDO output data size: 393216, LZ (Deflate) compressed file size: 352693, 7.18 bits/texel
Processing time: 0.116000 secs
Output data size: 393216, LZ (Deflate) compressed file size: 352693, 7.18 bits/texel
Wrote DDS file kodim08.dds
Luma  Max error:  18 RMSE: 1.368621 PSNR 45.405 dB, PSNR per bits/texel: 63277.507753
RGB   Max error:  48 RMSE: 2.456375 PSNR 40.325 dB, PSNR per bits/texel: 56197.596592
RGBA  Max error:  48 RMSE: 2.152539 PSNR 41.472 dB, PSNR per bits/texel: 57795.900335

The RGB error increased by 1.5 dB (from 41.8 dB to 40.3 dB - so less signal and more distortion), however the compressibility went up. The output is now 7.18 bits/texel instead of the previous 7.69! Notice also that the "PSNR per bits/texel" value (the compressibility index I use to monitor the encoder's effectiveness) for RGB is now 56197 vs. the previous 54381.


Rate Distortion Optimization with the Entropy Reduction Transform (-e -z#)


Now let's enable all the tools the encoder has to reduce the encoded output data's entropy. This mode is slower, but it trivially threadable and you can scale down the amount of total compute by reducing the window size using "-zc#":

bc7enc -o -u4 -zc2048 kodim08.png -e -z.5

BC7 mode histogram:
0: 0
1: 4028
2: 0
3: 0
4: 0
5: 0
6: 20548
7: 0
Pre-RDO output data size: 393216, LZ (Deflate) compressed file size: 354192, 7.21 bits/texel
rdo_total_threads: 40
Using an automatically computed smooth block error scale of 19.375000
lambda: 0.500000
Lookback window size: 2048
Max allowed RMS increase ratio: 10.000000
Max smooth block std dev: 18.000000
Smooth block max MSE scale: 19.375000
Total modified blocks: 21589 87.85%
Total RDO postprocess time: 2.765000 secs
Processing time: 2.846000 secs
Output data size: 393216, LZ (Deflate) compressed file size: 316364, 6.44 bits/texel
Wrote DDS file kodim08.dds
Luma  Max error:  41 RMSE: 2.749131 PSNR 39.347 dB, PSNR per bits/texel: 61131.435585
RGB   Max error:  48 RMSE: 3.286210 PSNR 37.797 dB, PSNR per bits/texel: 58723.280910
RGBA  Max error:  48 RMSE: 2.861897 PSNR 38.998 dB, PSNR per bits/texel: 60588.948928

First, I set the window size the compressor uses to insert byte sequences from previously encoded blocks into each output block to 2KB to increase compression, using "-zc2048". The default is only 256 bytes, which is way faster (.42 seconds vs. 2.92 on my system).

Notice the RGB PSNR has dropped to 37.8 dB, however the compressed file is now only 6.44 bits/texel. The compressibility index (PSNR per bits/texel) is 58723. This is significantly higher than the previous two encodes, so the encoder has been able to squeeze more signal into the output bits (once they are LZ compressed).

The -z option directly sets lambda, which controls the rate distortion tradeoff. The higher this value, the more likely the encoder is to substitute a block with a previous block's bytes (either entirely or partially), which increases distortion but reduces entropy.

RDO compression using MSE as the internal error metric is difficult on smooth or flat regions. The RDO encoder tries to automatically scale up the computed MSE's of smooth blocks (using a simple linear function of each block's color channel maximum standard deviation), but the settings are conservative. You'll notice a message like this printed when you use -z:

Using an automatically computed smooth block error scale of 19.375000

By default the command line tool tries to compute a max smooth block factor based off the supplied lambda setting. There is no single calculation/set of settings that work perfectly on all input textures, but the formula in the code works OK for most textures at low-ish lambdas. (For an example of a difficult texture the currently formulas/settings doesn't handle so well, try encoding kodim03 at lambdas 1-3.) I tried to tune smooth block handling so lambdas at or near 1 it looks OK on textures with smooth gradients, skies, etc. 

You can use the -zb# option to manually set a max smooth block scale factor to a higher value. -zb30-100 works well. You'll need to experiment. -zb1.0 disables all smooth block handling, so only MSE is plugged into the lambda calculation.

Graphing length of introduced matches in the BC7 ERT

I'm starting to graph what's going on with this awesome little lossy BC7 block data transform (in bc7enc_rdo). Lets look at some match length histograms:


The window size was only 128 bytes (8 BC7 blocks). 3 byte matches is the minimum Deflate match length. 16 byte matches replicate entire BC7 blocks. Not sure why there's a noticeable peak at 10 bytes.

Entire block replacements are super valuable at these lambdas. The ERT in bc7enc_rdo weights matches of any sort way more than literals. If some nearby previous block is good enough it makes perfect sense to use it.

One thing I think would be easily added to the transform: If there's a match at the end of the previous block, try to continue/extend it by weighting the bytes following the copied bytes in the window a little more cheaply (to coax the transform towards extending the match).

RDO texture encoding notes

A few things I've learned about RDO texture encoders:

- If you've spent a lot of time working on lowest distortion based texture encoders, your instincts will probably lead you astray once you start working on rate distortion encoders. Distortion can paradoxically increase on a single test even when the rate distortion behavior has improved overall.

- Always plot your results in 2D (rate vs. distortion) - don't focus so much on distortion. 

As a quick check of compressor efficiency, compute and display PSNR/bits_per_texel * scale, or SSIM/bits_per_texel * scale (where scale is like 10,000 or something - it's just for readability). 

Compute accurate bits_per_texel by actually compressing your output using a real LZ compressor with correct settings. The higher this value, the more efficient the compressor. Use the actual LZ compressor you're shipping the data with.

- Make sure your PSNR, RMSE, MSE, SSIM, etc. calculations are correct and accurate. ALWAYS compare against an independent 3rd party implementation that is known to be correct/trusted. Write your input and output to .PNG/.TGA/.BMP or whatever and use an external 3rd party image comparison tool as a sanity check.

Otherwise you've possibly messed it up and are in the weeds. 

One option is ImageMagick.
Here's how to calculate PSNR, and here's some sample code.

- RDO texture encoding+Deflate is basically all about increasing matches above all else. Even adding a single match to a block can be a huge win in a rate distortion sense.

- It's not necessary to worry about how blocks are packed, which modes are supported, or byte alignment. Just focus on byte matches and literals/match estimates. 

- Avoid copying around bits. That increases the overall block entropy. Always copy full bytes. 

- For more gains you can copy bytes from one offset in a block to another offset. This is way slower to encode but does compress better. I removed this option from bc7enc_rdo because it was so much slower.

- You don't need a huge window to get large gains. Even 64-512 byte windows are fine. 

- You don't need an accurate LZ simulator to make a workable high quality encoder. 

Although, I needed one to figure all this out. 

- Use an already working RDO encoder as a baseline (even a shitty one). Plot its average R-D curve across a range of settings/images. Go from there.

- By default, a high quality texture encoding will consist of mostly literals. Just focus on inserting a single match into each block from one of the previously encoded blocks. Use the Langrangian multiplier method (j=MSE*smooth_block_scale+bits*lambda) to pick the best one.

- Use Matt Mahoney's "fv" tool to visualize the entropy of your encoded output data:
http://www.mattmahoney.net/dc/fv.cpp

- You can copy a full block (which is like VQ) or partial byte sequences from one block to another. It's possible that a match can partially cross endpoints and selectors. Just decode the block, calculate MSE, estimate bits and then the Langrangian formula.

- Plot rate distortion curves (PSNR or SSIM vs. bits/texel) for various lambdas and encoder settings. Focus on increasing the PSNR per bit (move the curve up and left).

- You must do something about smooth/flat blocks. Their MSE's are too low relative to the visual impact they have when they get distorted. One solution is to compute the max std dev. of any component and use a linear function of that to scale block/trial MSE.

- Before developing anything more complex than the technique used in bc7enc_rdo (the byte-wise ERT), get this technique working and tuned first. You'll be surprised how challenging it can be to actually improve it.

- Nobody will trust or listen to you when you claim your encoder is better in some way, even if you show them graphs. There are just too many ways to either mess up or bias a benchmark. You need a trusted 3rd party to independently benchmark and validate your encoder vs. other encoders.

The people at Unity have been filling this role recently. (Which makes sense because they integrate a lot of texture encoders into Unity.)

bc7enc_rdo repo updated

I've updated it with my latest RDO BC1-7 and reduced entropy BC7 encoders:

 https://github.com/richgel999/bc7enc_rdo


Sunday, February 7, 2021

More RDO BC7 encoding - new algorithm

I sat down and implemented another RDO BC7 algorithm, using what I learned from the previous one. Amazingly it's beating the way more complex one, except perhaps at really high quality levels (really low lambdas). Very surprising! The source is here, and the post-processing function (the entropy reduction transform in function bc7enc_reduce_entropy()) is here

The latest bc7enc_rdo repo is here.

I expected it to perform worse, yet it's blowing the more complex one away. The new algorithm is compatible with all the BC7 modes, too. The previous one was mostly hardwired for the main modes (mostly 1/6).


The new algorithm is much stronger:

RDO BC7 new algorithm - lambda 1.0, 4KB window size 
bc7enc -o -u4 -zc4096 J:\dev\test_images\xmen_1024.png -e -E -z1.0

37.15 dB, 3.97 bits/texel (Deflate)



RDO BC7 new algorithm - lambda 3.0, 4KB window size 
bc7enc -o -u4 -zc4096 J:\dev\test_images\xmen_1024.png -e -E -z3.0

32.071 dB, 3.12 bits/texel (Deflate)



The new algorithm degrades way more gracefully:

lambda=4.0
30.812 dB, 2.94 bits/texel


lambda=5.0
29.883 2.69 bits/texel (Deflate)


lambda=5.0, window size 8KB
29.826 dB, 2.59 bits/texel (Deflate)




Low bitrate RDO BC7 with lzham_devel

RDO BC7+Deflate could also be described as "BC7 encoding with Deflate in-loop".

Using the lzham_codec_devel repo (which is now perfectly stable, I just haven't updated the readme kinda on purpose), this mode 1+6 RDO BC7 .DDS file compressed to 2.87 bits/texel. LZMA gets 2.74 bits/texel. 

Around 10% of the blocks use mode 1, the rest mode 6. I need to add a LZMA/LZHAM model to bc7enc_rdo, which should be fairly easy (add len2 matches, add rep model, larger dictionary - and then let the optimal parsers in lzham/lzma figure it out).

Commands:

bc7enc -zc32768 -u4 -o xmen_1024.png -z6.0

lzhamtest_x64.exe -x16 -h4 -e -o c xmen_1024.dds 1.lzham

There are some issues with this encoding, but it's great progress.



More RDO BC7 progress

I've optimized the bc7enc_rdo's RDO BC7 encoder a bunch over the past few days. I've also added multithreading via a OpenMP parallel for, which really helps.

RDO BC7+Deflate (4KB replacement window size)

33.551 RGB dB PSNR, 3.75 bits/texel


One could argue that at these low PSNR's you should just use BC1, but about 10% of the blocks in this RDO BC7 encoding use mode 1 (2 subsets). BC1 will be more blocky even at a similar PSNR.

31.319 dB, 3.25 bits/texel: