Sunday, February 7, 2021

Low bitrate RDO BC7 with lzham_devel

RDO BC7+Deflate could also be described as "BC7 encoding with Deflate in-loop".

Using the lzham_codec_devel repo (which is now perfectly stable, I just haven't updated the readme kinda on purpose), this mode 1+6 RDO BC7 .DDS file compressed to 2.87 bits/texel. LZMA gets 2.74 bits/texel. 

Around 10% of the blocks use mode 1, the rest mode 6. I need to add a LZMA/LZHAM model to bc7enc_rdo, which should be fairly easy (add len2 matches, add rep model, larger dictionary - and then let the optimal parsers in lzham/lzma figure it out).

Commands:

bc7enc -zc32768 -u4 -o xmen_1024.png -z6.0

lzhamtest_x64.exe -x16 -h4 -e -o c xmen_1024.dds 1.lzham

There are some issues with this encoding, but it's great progress.



More RDO BC7 progress

I've optimized the bc7enc_rdo's RDO BC7 encoder a bunch over the past few days. I've also added multithreading via a OpenMP parallel for, which really helps.

RDO BC7+Deflate (4KB replacement window size)

33.551 RGB dB PSNR, 3.75 bits/texel


One could argue that at these low PSNR's you should just use BC1, but about 10% of the blocks in this RDO BC7 encoding use mode 1 (2 subsets). BC1 will be more blocky even at a similar PSNR.

31.319 dB, 3.25 bits/texel:



BC7 RDO rate distortion curves

I've been tuning the fixed Deflate model in bc7enc_rdo. In this test I varied the # of literal bits from 8 to 14. Higher values push the system to prefer matches vs. literals.

The orange line was yesterday's encoder, all other lines are for today's encoder. Today's encoder has several improvements, such as lazy parsing and mode 6 endpoint match trails. 


(I know this graph is going to be difficult to read on blogger - Google updated it and now images suck. You used to be able to click on images and get a full-res view.)

Lagrangian multiplier based RDO encoding early outs

Some minor observations about Lagrangian multiplier based RDO (with BC7 RDO+Deflate or LZ4):

We're optimizing to find lowest t (sometimes called j), given many hundreds/thousands of ways of encoding a BC7 block:

float t = trial_mse * smooth_block_error_scale + trial_lz_bits * lambda;

For each trial block, we compute its MSE and estimate its LZ bits using a simple Deflate/LZ4-like model.

If we already have a potential solution for a block (the best found so far), given the trial block's MSE and the current best_t we can compute how many bits (maximum) a new trial encoding would take to be an improvement. If the number of computed threshold bits is ridiculous (like negative, or just impossible to achieve with Deflate on a 128-bit block input), we can immediately throw out that trial block:

threshold_trial_lz_bits = (best_t - trial_mse * smooth_block_error_scale ) / lambda

Same for MSE: if we already have a solution, we can compute the MSE threshold where it's impossible for a trial to be an improvement:

threshold_trial_mse  = (best_t - (trial_lz_bits * lambda)) /  smooth_block_error_scale

This seems less valuable because running the LZ simulator to compute trial_lz_bits is likely more expensive than computing a trial block's MSE. We could plug in a lowest possible estimate for trial_lz_bits, and use that as a threshold MSE. Another interesting thing about this: trials are very likely to always have an MSE >= than the best found encoding for a block.

Using simple formulas like this results in large perf. improvements (~2x).

Saturday, February 6, 2021

BC7 DDS file entropy visualization

RDO GPU texture encoders increase the number/density of LZ matches in the encoded output texture. He's a file entropy visualization of kodim18.dds. The left image was non-RDO encoded, the right image was encoded with lambda=4.0 max backwards scan=2048 bytes.

Non-RDO:



RDO:

Non-RDO, one byte matches removed:



RDO, one byte matches removed:



fv docs:
"The output is fv.bmp with the given size in pixels, which visually
displays where matching substrings of various lengths and offsets are
found.  A pixel at x, y is (black, red, green, blue) if the last matching
substring of length (1, 2, 4, 8) at x occurred y bytes ago.  x and y
are scaled so that the image dimensions match the file length.
The y axis is scaled log base 10."
Tool source:

The two types of RDO BC7 encoders

There are two main high-level categories of RDO BC7 encoders:
1. The first type is optimized for highest PSNR per LZ compressed bit, but they are significantly slower vs. ispc_texcomp/bc7e.

2. The second type is optimized for highest PSNR per LZ compressed bit per encoding time. They have the same speed, or are almost as fast as ispc_texcomp/bc7e. Some may even be faster than non-RDO encoders because they entirely ignore less useful modes (like mode 0).

To optimize for PSNR per LZ compressed bit, you can create the usual rate distortion graph (bitrate on X, quality on Y), then choose the encoder with the highest PSNR at specific bitrates (the highest/leftmost curve) that meets your encoder performance needs.

Other thoughts:
- When comparing category 2 encoders, encoding time is nearly everything.

- Category 2 encoders don't need to win against category 1 encoders. They compete against non-RDO encoders. Given two encoders, one category 2 RDO and the other non-RDO, if all other things are equal the RDO encoder will win.

- Modifying bc7e to place it into category #2 will be easy.

- Category 1 is PSNR/bitrate (where bitrate is in LZ bits/texel). Or SSIM/bitrate, but I've found SSIM to be nearly useless for texture encoding.

- Category 2 is (PSNR/bitrate)/encode_time (where encode_time is in seconds).

Simple and fast ways to reduce BC7's output entropy (and increase LZ matches)

 It's relatively easy to reduce the output entropy of BC7 by around 5-10%, without slowing down encoding or even speeding it up. I'll be adding this stuff to the bc7e ispc encoder soon. I've been testing these tricks in bc7enc_rdo:

- Weight the mode errors: For example weight mode 1 and 6's errors way lower than the other modes. This shifts the encoder to use modes 1 and 6 more often, which reduces the output data's entropy. This requires the other modes to make a truly significant difference in reducing distortion before the encoder switches to using them.

- Biased p-bits: When deciding which p-bits to use (0 vs. 1), weight the error from using p-bit 1 slightly lower (or the opposite). This will cause the encoder to favor one of the p-bits more than the other, reducing the block output data's entropy.

- Partition pattern weighting: Weight the error from using the lower frequency partitions [0,15] or [0,33] slightly lower vs. the other patterns. This reduces the output entropy of the first or second byte of BC7 modes with partitions.

- Quantize mode 6's endpoints and force its p-bits to [0,1]: Mode 6 uses 7-bit endpoint components. Use 6-bits instead, with fixed [0,1] p-bits. You'll need to do this in combination with reducing mode 6's error weight, or a multi-mode encoder won't use mode 6 as much. 

- Don't use mode 4/5 component rotations, or the index flag. 

In practice these options aren't particularly useful, and just increase the output entropy. The component rotation feature can also cause odd looking color artifacts.

- Don't use mode 0,2,3, possibly 4: These modes are less useful, at least on albedo/specular/etc. maps, sRGB content, and photos/images. Almost all BC7 encoders, including ispc_texcomp's, can't even handle mode 0 correctly anyway.

Mode 4 is useful on decorrelated alpha. If your content doesn't have much of that, just always use mode 5.