Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Sunday, November 28, 2021
Lena is Retired
Wednesday, February 17, 2021
Average rate-distortion curves for bc7enc_rdo
bc7enc_rdo is now a library that's utilized by the command line tool, which is far simpler now. This makes it trivial to call multiple times to generate large .CSV files.
If you can only choose one set of settings for bc7enc_rdo, choose "-zn -U -u6". (I've set the default BC7 encoding level to 6, not sure that's checked in yet.) I'll be making bc7e.ispc the new default on my next checkin - it's clearly better.
All other settings were the tool's defaults (linear metrics, window size=128 bytes).
Saturday, February 13, 2021
First RDO LDR ASTC 6x6 encodings
Left=Non-RDO, 37.3 dB, 2.933 bits/texel (Deflate)
First-ever RDO ASTC encodings
I used astcenc to generate a .astc file, loaded it into memory, then used the code in ert.cpp/.h with a custom callback that decodes ASTC blocks. All the magic is in the ERT. Here's a match injection histogram - this works: 1477,466,284,382,265,398,199,109,110,87,82,105,193,3843
Friday, February 12, 2021
bc7e.ispc integrated into bc7enc_rdo
4.53 bits/texel (Deflate), 37.696 dB PSNR
BC7 mode histogram:
0: 3753
1: 15475
2: 1029
3: 6803
4: 985
5: 2173
6: 35318
7: 0
Updated bc7enc_rdo with improved smooth block handling
The command line tool now detects extremely smooth blocks and encodes them with a significantly higher MSE scale factor. It computes a per-block mask image, filters it, then supplies an array of per-block MSE scale factors to the ERT. -zu disables this.
The end result is much less significant artifacts on regions containing very smooth blocks (think gradients). This does hurt rate-distortion performance.
Thursday, February 11, 2021
Dirac video codec authors on Rate-Distortion Optimization
http://dirac.sourceforge.net/documentation/algorithm/algorithm/rdo.htm
"Perceptual fudge factors are therefore necessary in RDO in all types of coders." "There may be no common measure of distortion. For example: quantising a high-frequency subband is less visually objectionable than quantising a low-frequency subband, in general. So there is no direct comparison with the significance of the distortion produced in one subband with that produced in another. This can be overcome by perceptual weighting.."
So if you take two RDO texture encoders, benchmark them, and look at just their PSNR's, you are possibly fooling yourself and others. One encoder with higher PSNR (and better R-D performance) may visually look worse than the other. It's part art, not all science.
With bc7enc_rdo, I wanted to open source *something* usable for most textures with out of the box settings, even though I knew that its smooth block handling needed work. Textures with skies like kodim03 are challenging to compress without manually increasing the smooth block factor. kodim23 is less challenging because its background has some noise.
Releasing something open source with decent performance that works OK on most textures is more important than perfection.
Wednesday, February 10, 2021
Weighted/biased BC7 encoding for reduced output data entropy (with no slowdowns)
Mode 1+6: 45.295 dB, 7.41 bits/texel (Deflate), .109 secs
BC7 mode histogram:
6: 15840
Mode 1+6 reduced entropy mode: 43.479 RGB PSNR, 6.77 bits/texel (Deflate), .107 secs
Command: "bc7enc kodim23.png -e"
Tuesday, February 9, 2021
bc7enc_rdo now supports RDO for all BC1-7 texture formats
It now fully supports RDO BC1-7:
https://github.com/richgel999/bc7enc_rdo
I've also been cleaning up the tool and tuning all of the defaults. Note that if you build with MSVC you get OpenMP, which results in significantly faster compression. Currently the Linux/OSX builds don't get OpenMP.
I decided to unify all RDO BC1-7 encoders so they use a single universal entropy reduction transform function in ert.cpp/.h. I have specialized RDO encoders for arrays BC1 and BC4 blocks (which I checked into the repo previously), which may perform better, but it was a lot more code to maintain. I removed them.
Monday, February 8, 2021
Entropy Reduction Transform on BC1 texture data
Just got it working for BC1. Took about 15 minutes of copying & pasting the BC7 ERT, then modifying it to decode BC1 instead of BC7 blocks and have it ignore the decoded alpha. The ERT function is like 250 lines of code, and for BC1 it would be easily vectorizable (way easier than BC7 because decoding BC1 is easy).
This implementation differs from the BC7 ERT in one simple way: The bytes copied from previously encoded blocks are allowed to be moved around within the current block. This is slower to encode, but gives the encoder more freedom. I'm going to ship both options (move vs. nomove).
Here's a 2.02 bits/texel (Deflate) encode (lambda=1.0), 34.426 RGB dB. Normal BC1 (rgbcx.cpp level 18) is 3.00 bits/texel 35.742 dB. Normal BC1 level 2 (should be very close to stb_dxt) gets 3.01 bits/texel and 35.086 dB, so if you're willing to lose a little quality you can get large savings.
I'll have this checked in tomorrow after more benchmarking and smooth block tuning.
I've been thinking about a simple/elegant universal rate distortion optimizing transform for GPU texture data for the past year, since working on UASTC and BC1 RDO. It's nice to see this working so well on two different GPU texture formats. ETC1-2, PVRTC1, LDR/HDR ASTC, and BC6H are coming.
1.77 bits/texel, 32.891 dB (-L18 -b -z4.0 -zb17.0 -zc2048):
bc7enc_rdo encoding examples
Compress kodim.png to kodim03.dds (with no mips) to two BC7 modes (1+6):
Highest Quality Mode (uses Modes 1+6)
So the output .DDS file compressed to 7.69 bits/texel using miniz (stock non-optimal parsing Deflate, so a few percent worse vs. zopfli or 7za's Deflate). The RGB PSNR was 41.8 and the RGBA PSNR was 43 dB. It used mode 1 around half as much as mode 6.
Notice the pre-RDO compressed size is equal to the output's compressed size (7.69 bits/texel). There was no RDO, or anything in particular done to reduce the encoded output data's entropy. The output is mostly Huffman compressed because Deflate can't find many 3+ byte matches, so the output is quite close to 8 bits/texel. It's basically noise to Deflate or most other LZ's.
Reduced Entropy Mode (-e option)
Rate Distortion Optimization with the Entropy Reduction Transform (-e -z#)
Graphing length of introduced matches in the BC7 ERT
The window size was only 128 bytes (8 BC7 blocks). 3 byte matches is the minimum Deflate match length. 16 byte matches replicate entire BC7 blocks. Not sure why there's a noticeable peak at 10 bytes.
Entire block replacements are super valuable at these lambdas. The ERT in bc7enc_rdo weights matches of any sort way more than literals. If some nearby previous block is good enough it makes perfect sense to use it.
RDO texture encoding notes
- You don't need a huge window to get large gains. Even 64-512 byte windows are fine.
- By default, a high quality texture encoding will consist of mostly literals. Just focus on inserting a single match into each block from one of the previously encoded blocks. Use the Langrangian multiplier method (j=MSE*smooth_block_scale+bits*lambda) to pick the best one.
- You can copy a full block (which is like VQ) or partial byte sequences from one block to another. It's possible that a match can partially cross endpoints and selectors. Just decode the block, calculate MSE, estimate bits and then the Langrangian formula.
- Plot rate distortion curves (PSNR or SSIM vs. bits/texel) for various lambdas and encoder settings. Focus on increasing the PSNR per bit (move the curve up and left).
- You must do something about smooth/flat blocks. Their MSE's are too low relative to the visual impact they have when they get distorted. One solution is to compute the max std dev. of any component and use a linear function of that to scale block/trial MSE.
- Before developing anything more complex than the technique used in bc7enc_rdo (the byte-wise ERT), get this technique working and tuned first. You'll be surprised how challenging it can be to actually improve it.
bc7enc_rdo repo updated
I've updated it with my latest RDO BC1-7 and reduced entropy BC7 encoders:
https://github.com/richgel999/bc7enc_rdo
Sunday, February 7, 2021
More RDO BC7 encoding - new algorithm
The new algorithm is much stronger:
Low bitrate RDO BC7 with lzham_devel
Using the lzham_codec_devel repo (which is now perfectly stable, I just haven't updated the readme kinda on purpose), this mode 1+6 RDO BC7 .DDS file compressed to 2.87 bits/texel. LZMA gets 2.74 bits/texel.
Around 10% of the blocks use mode 1, the rest mode 6. I need to add a LZMA/LZHAM model to bc7enc_rdo, which should be fairly easy (add len2 matches, add rep model, larger dictionary - and then let the optimal parsers in lzham/lzma figure it out).
Commands:
bc7enc -zc32768 -u4 -o xmen_1024.png -z6.0
lzhamtest_x64.exe -x16 -h4 -e -o c xmen_1024.dds 1.lzham
There are some issues with this encoding, but it's great progress.
More RDO BC7 progress
I've optimized the bc7enc_rdo's RDO BC7 encoder a bunch over the past few days. I've also added multithreading via a OpenMP parallel for, which really helps.
RDO BC7+Deflate (4KB replacement window size)
33.551 RGB dB PSNR, 3.75 bits/texel
One could argue that at these low PSNR's you should just use BC1, but about 10% of the blocks in this RDO BC7 encoding use mode 1 (2 subsets). BC1 will be more blocky even at a similar PSNR.
BC7 RDO rate distortion curves
I've been tuning the fixed Deflate model in bc7enc_rdo. In this test I varied the # of literal bits from 8 to 14. Higher values push the system to prefer matches vs. literals.
The orange line was yesterday's encoder, all other lines are for today's encoder. Today's encoder has several improvements, such as lazy parsing and mode 6 endpoint match trails.
Lagrangian multiplier based RDO encoding early outs
Some minor observations about Lagrangian multiplier based RDO (with BC7 RDO+Deflate or LZ4):
We're optimizing to find lowest t (sometimes called j), given many hundreds/thousands of ways of encoding a BC7 block:
float t = trial_mse * smooth_block_error_scale + trial_lz_bits * lambda;
For each trial block, we compute its MSE and estimate its LZ bits using a simple Deflate/LZ4-like model.
If we already have a potential solution for a block (the best found so far), given the trial block's MSE and the current best_t we can compute how many bits (maximum) a new trial encoding would take to be an improvement. If the number of computed threshold bits is ridiculous (like negative, or just impossible to achieve with Deflate on a 128-bit block input), we can immediately throw out that trial block:
threshold_trial_lz_bits = (best_t - trial_mse * smooth_block_error_scale ) / lambda
Same for MSE: if we already have a solution, we can compute the MSE threshold where it's impossible for a trial to be an improvement:
threshold_trial_mse = (best_t - (trial_lz_bits * lambda)) / smooth_block_error_scale
This seems less valuable because running the LZ simulator to compute trial_lz_bits is likely more expensive than computing a trial block's MSE. We could plug in a lowest possible estimate for trial_lz_bits, and use that as a threshold MSE. Another interesting thing about this: trials are very likely to always have an MSE >= than the best found encoding for a block.
Using simple formulas like this results in large perf. improvements (~2x).
Saturday, February 6, 2021
BC7 DDS file entropy visualization
Non-RDO:
"The output is fv.bmp with the given size in pixels, which visually displays where matching substrings of various lengths and offsets are found. A pixel at x, y is (black, red, green, blue) if the last matching substring of length (1, 2, 4, 8) at x occurred y bytes ago. x and y are scaled so that the image dimensions match the file length. The y axis is scaled log base 10."Tool source:
The two types of RDO BC7 encoders
1. The first type is optimized for highest PSNR per LZ compressed bit, but they are significantly slower vs. ispc_texcomp/bc7e.
2. The second type is optimized for highest PSNR per LZ compressed bit per encoding time. They have the same speed, or are almost as fast as ispc_texcomp/bc7e. Some may even be faster than non-RDO encoders because they entirely ignore less useful modes (like mode 0).
To optimize for PSNR per LZ compressed bit, you can create the usual rate distortion graph (bitrate on X, quality on Y), then choose the encoder with the highest PSNR at specific bitrates (the highest/leftmost curve) that meets your encoder performance needs.
- Modifying bc7e to place it into category #2 will be easy.
Simple and fast ways to reduce BC7's output entropy (and increase LZ matches)
It's relatively easy to reduce the output entropy of BC7 by around 5-10%, without slowing down encoding or even speeding it up. I'll be adding this stuff to the bc7e ispc encoder soon. I've been testing these tricks in bc7enc_rdo:
- Weight the mode errors: For example weight mode 1 and 6's errors way lower than the other modes. This shifts the encoder to use modes 1 and 6 more often, which reduces the output data's entropy. This requires the other modes to make a truly significant difference in reducing distortion before the encoder switches to using them.
- Biased p-bits: When deciding which p-bits to use (0 vs. 1), weight the error from using p-bit 1 slightly lower (or the opposite). This will cause the encoder to favor one of the p-bits more than the other, reducing the block output data's entropy.
- Partition pattern weighting: Weight the error from using the lower frequency partitions [0,15] or [0,33] slightly lower vs. the other patterns. This reduces the output entropy of the first or second byte of BC7 modes with partitions.
- Quantize mode 6's endpoints and force its p-bits to [0,1]: Mode 6 uses 7-bit endpoint components. Use 6-bits instead, with fixed [0,1] p-bits. You'll need to do this in combination with reducing mode 6's error weight, or a multi-mode encoder won't use mode 6 as much.
- Don't use mode 4/5 component rotations, or the index flag.
In practice these options aren't particularly useful, and just increase the output entropy. The component rotation feature can also cause odd looking color artifacts.
- Don't use mode 0,2,3, possibly 4: These modes are less useful, at least on albedo/specular/etc. maps, sRGB content, and photos/images. Almost all BC7 encoders, including ispc_texcomp's, can't even handle mode 0 correctly anyway.
Mode 4 is useful on decorrelated alpha. If your content doesn't have much of that, just always use mode 5.
More RDO BC7 encoder progress
With BC7 RDO encoding, you really need an LZ simulator of some sort. Or you need decent approximations. Once you can simulate how many bits a block compresses to, you can then have the encoder try replacing byte aligned sequences within each block (with sequences that appear in previous blocks). This is the key magic that makes this method work so well. You need to "talk" to the LZ compressor in the primary language it understands: 2+ or 3+ length byte matches.
For example, with mode 6, the selectors are 4-bits per texel, and are aligned at the end of the block. So each byte has 2 texels. If your p-bits are always [0,1] (mine are in RDO mode), then it's easy to substitute various regions of bytes from previously encoded mode 6 blocks, and see what LZ does.
This is pretty awesome because it allows the encoder to escape from being forced to always using an entire previous block's selectors, greatly reducing block artifacts.
In one experiment, around 40% of the blocks that got selector byte substitutions from previous blocks are from plugging in 3 or 4 byte matches and evaluating the Lagrangian.
40% is ridiculously high - which means this technique works well. It'll work with BC1 too. The downside (as usual) is encoding performance.
- RDO BC7 mode 1+6, lambda 10.0, 8KB max search distance, match replacements taken from up to 2 previous blocks
- RDO BC7 mode 1+6, lambda 12.0, 8KB max search distance, match replacements taken from up to 2 previous blocks