bc7enc_rdo is now a library that's utilized by the command line tool, which is far simpler now. This makes it trivial to call multiple times to generate large .CSV files.
If you can only choose one set of settings for bc7enc_rdo, choose "-zn -U -u6". (I've set the default BC7 encoding level to 6, not sure that's checked in yet.) I'll be making bc7e.ispc the new default on my next checkin - it's clearly better.
All other settings were the tool's defaults (linear metrics, window size=128 bytes).
The command line tool now detects extremely smooth blocks and encodes them with a significantly higher MSE scale factor. It computes a per-block mask image, filters it, then supplies an array of per-block MSE scale factors to the ERT. -zu disables this.
The end result is much less significant artifacts on regions containing very smooth blocks (think gradients). This does hurt rate-distortion performance.
It now fully supports RDO BC1-7:
I've also been cleaning up the tool and tuning all of the defaults. Note that if you build with MSVC you get OpenMP, which results in significantly faster compression. Currently the Linux/OSX builds don't get OpenMP.
I decided to unify all RDO BC1-7 encoders so they use a single universal entropy reduction transform function in ert.cpp/.h. I have specialized RDO encoders for arrays BC1 and BC4 blocks (which I checked into the repo previously), which may perform better, but it was a lot more code to maintain. I removed them.
Just got it working for BC1. Took about 15 minutes of copying & pasting the BC7 ERT, then modifying it to decode BC1 instead of BC7 blocks and have it ignore the decoded alpha. The ERT function is like 250 lines of code, and for BC1 it would be easily vectorizable (way easier than BC7 because decoding BC1 is easy).
This implementation differs from the BC7 ERT in one simple way: The bytes copied from previously encoded blocks are allowed to be moved around within the current block. This is slower to encode, but gives the encoder more freedom. I'm going to ship both options (move vs. nomove).
Here's a 2.02 bits/texel (Deflate) encode (lambda=1.0), 34.426 RGB dB. Normal BC1 (rgbcx.cpp level 18) is 3.00 bits/texel 35.742 dB. Normal BC1 level 2 (should be very close to stb_dxt) gets 3.01 bits/texel and 35.086 dB, so if you're willing to lose a little quality you can get large savings.
I'll have this checked in tomorrow after more benchmarking and smooth block tuning.
I've been thinking about a simple/elegant universal rate distortion optimizing transform for GPU texture data for the past year, since working on UASTC and BC1 RDO. It's nice to see this working so well on two different GPU texture formats. ETC1-2, PVRTC1, LDR/HDR ASTC, and BC6H are coming.
Compress kodim.png to kodim03.dds (with no mips) to two BC7 modes (1+6):
So the output .DDS file compressed to 7.69 bits/texel using miniz (stock non-optimal parsing Deflate, so a few percent worse vs. zopfli or 7za's Deflate). The RGB PSNR was 41.8 and the RGBA PSNR was 43 dB. It used mode 1 around half as much as mode 6.
Notice the pre-RDO compressed size is equal to the output's compressed size (7.69 bits/texel). There was no RDO, or anything in particular done to reduce the encoded output data's entropy. The output is mostly Huffman compressed because Deflate can't find many 3+ byte matches, so the output is quite close to 8 bits/texel. It's basically noise to Deflate or most other LZ's.
Using the lzham_codec_devel repo (which is now perfectly stable, I just haven't updated the readme kinda on purpose), this mode 1+6 RDO BC7 .DDS file compressed to 2.87 bits/texel. LZMA gets 2.74 bits/texel.
Around 10% of the blocks use mode 1, the rest mode 6. I need to add a LZMA/LZHAM model to bc7enc_rdo, which should be fairly easy (add len2 matches, add rep model, larger dictionary - and then let the optimal parsers in lzham/lzma figure it out).
bc7enc -zc32768 -u4 -o xmen_1024.png -z6.0
lzhamtest_x64.exe -x16 -h4 -e -o c xmen_1024.dds 1.lzham
There are some issues with this encoding, but it's great progress.
I've optimized the bc7enc_rdo's RDO BC7 encoder a bunch over the past few days. I've also added multithreading via a OpenMP parallel for, which really helps.
RDO BC7+Deflate (4KB replacement window size)
33.551 RGB dB PSNR, 3.75 bits/texel
I've been tuning the fixed Deflate model in bc7enc_rdo. In this test I varied the # of literal bits from 8 to 14. Higher values push the system to prefer matches vs. literals.
The orange line was yesterday's encoder, all other lines are for today's encoder. Today's encoder has several improvements, such as lazy parsing and mode 6 endpoint match trails.
Some minor observations about Lagrangian multiplier based RDO (with BC7 RDO+Deflate or LZ4):
We're optimizing to find lowest t (sometimes called j), given many hundreds/thousands of ways of encoding a BC7 block:
float t = trial_mse * smooth_block_error_scale + trial_lz_bits * lambda;
For each trial block, we compute its MSE and estimate its LZ bits using a simple Deflate/LZ4-like model.
If we already have a potential solution for a block (the best found so far), given the trial block's MSE and the current best_t we can compute how many bits (maximum) a new trial encoding would take to be an improvement. If the number of computed threshold bits is ridiculous (like negative, or just impossible to achieve with Deflate on a 128-bit block input), we can immediately throw out that trial block:
threshold_trial_lz_bits = (best_t - trial_mse * smooth_block_error_scale ) / lambda
Same for MSE: if we already have a solution, we can compute the MSE threshold where it's impossible for a trial to be an improvement:
threshold_trial_mse = (best_t - (trial_lz_bits * lambda)) / smooth_block_error_scale
This seems less valuable because running the LZ simulator to compute trial_lz_bits is likely more expensive than computing a trial block's MSE. We could plug in a lowest possible estimate for trial_lz_bits, and use that as a threshold MSE. Another interesting thing about this: trials are very likely to always have an MSE >= than the best found encoding for a block.
Using simple formulas like this results in large perf. improvements (~2x).
"The output is fv.bmp with the given size in pixels, which visually displays where matching substrings of various lengths and offsets are found. A pixel at x, y is (black, red, green, blue) if the last matching substring of length (1, 2, 4, 8) at x occurred y bytes ago. x and y are scaled so that the image dimensions match the file length. The y axis is scaled log base 10."Tool source:
It's relatively easy to reduce the output entropy of BC7 by around 5-10%, without slowing down encoding or even speeding it up. I'll be adding this stuff to the bc7e ispc encoder soon. I've been testing these tricks in bc7enc_rdo:
- Weight the mode errors: For example weight mode 1 and 6's errors way lower than the other modes. This shifts the encoder to use modes 1 and 6 more often, which reduces the output data's entropy. This requires the other modes to make a truly significant difference in reducing distortion before the encoder switches to using them.
- Biased p-bits: When deciding which p-bits to use (0 vs. 1), weight the error from using p-bit 1 slightly lower (or the opposite). This will cause the encoder to favor one of the p-bits more than the other, reducing the block output data's entropy.
- Partition pattern weighting: Weight the error from using the lower frequency partitions [0,15] or [0,33] slightly lower vs. the other patterns. This reduces the output entropy of the first or second byte of BC7 modes with partitions.
- Quantize mode 6's endpoints and force its p-bits to [0,1]: Mode 6 uses 7-bit endpoint components. Use 6-bits instead, with fixed [0,1] p-bits. You'll need to do this in combination with reducing mode 6's error weight, or a multi-mode encoder won't use mode 6 as much.
- Don't use mode 4/5 component rotations, or the index flag.
In practice these options aren't particularly useful, and just increase the output entropy. The component rotation feature can also cause odd looking color artifacts.
- Don't use mode 0,2,3, possibly 4: These modes are less useful, at least on albedo/specular/etc. maps, sRGB content, and photos/images. Almost all BC7 encoders, including ispc_texcomp's, can't even handle mode 0 correctly anyway.
Mode 4 is useful on decorrelated alpha. If your content doesn't have much of that, just always use mode 5.