Sunday, May 13, 2018

Graphing BC7 with random codec options

I used a different set of data (and a different random seed) to compute this scattergraph vs. my previous post. This time, I've highlighted all random solutions that enable mode 1:


Modes 1 and 6 are the key modes for opaque textures. Mode 6 is used across the entire frontier, and mode 1 is used across all of the frontier above a threshold of about ~49.4 dB PSNR. (This is why I open sourced bc7enc16.)

Here's the same graph but I've marked in red two key solutions. The left one is mode 1 only, and the upper right one is modes 1+6. There are several variants of each, I just choose the highest quality ones to mark. Mode 1+6 is at a perfect spot: very high quality and just at the BC7 "wall" where quality plateaus.

Mode 1 only is also at a great spot. It's fast and has excellent quality with no visible block artifacts (unlike exclusively mode 6 which will be full of them).


Here's a graph showing all solutions that use mode 6 in brown, with the rest in blue:


Finally, here's a graph generated using perceptual colorspace metrics. To speed up the test I reduced the test texture so it had 1/16th as many blocks (randomly chosen from the original test texture). One thing that appears in this graph is that, with perceptual metrics and Luma PSNR, mode 6 by itself is more effective than with RGB metrics.


Saturday, May 12, 2018

BC7 opaque encoding sweetspots

By running our non-RDO codec in an automated test thousands of times, I've identified 3-4 major encoding time vs. quality sweetspots. These regions are somewhat specific to our codec and how it's written and optimized, but I suspect rough regions like this will exist in all correctly written BC7 codecs (either CPU or GPU). Excluding ispc_texcomp, most codecs have key pbit handling flaws preventing them from exploiting some of these regions at all.

It's commonly believed that BC7 has this incredibly large search space (which it does) and is slow to encode, but only a few mode combinations and encoder search settings are actually valuable (i.e. highest quality for a given amount of encoding time). The valuable codec settings appear to fall into these four major regions (three if you ignore the last impractically slow region):

Note the max practical PSNR for this test corpus is ~51.39 dB RGB avg. PSNR. For reference, BC1 non-perceptual is around 10 dB lower than BC7 on average, so even real time BC7 is massively better than BC1's quality.

Region 1: Real time


The first sweetspot is the "extremely fast" or real time region. Amazingly, it's possible to produce results in this region with minimal or no obvious block artifacts.

In this region, you only use mode 6 (which will have noticeable block artifacts), or you combine modes 0 and 6 but you limit the # of mode 0 partitions examined by your partition estimator to only a small amount, like 1-4. Only a single partition (the best one returned by the estimator) is evaluated.

You can increase the # of partitions examined in mode 0 to improve quality a little more, which will reduce or eliminate block artifacts. You must handle pbits correctly (as I've detailed in previous posts) to exploit this mode combination, otherwise mode 0 will be useless due to massive banding.

Encoding time: .5-3.5 secs, average quality: 48.2-49.09 dB RGB PSNR

Region 2: Fast


The second region uses modes 1, 5, 6, and possibly 3. The max # of partitions examined by the estimator ranges from 1-34, and the encoder uses the strongest pbit handling it supports (trying all combinations of pbits using correct rounding). The encoder can also try varying the selectors a bit to improve the endpoints. In mode 5 the encoder tries all component rotations. In this region there will be no visible block artifacts if the # of partitions examined in mode 1 is high enough (not sure of the threshold yet, probably at least 16).

Encoding time: 5.5-16.4 secs, average quality: 49.8-50.8 dB RGB PSNR

Region 3: Basic


Most of the third major region uses modes 1, 3 and 6. Mode 2 can be added for slightly more quality. The # of partitions examined by the estimator ranges from 42-55, and the # of partitions evaluated ranges from 1-9. This region uses exhaustive pbit evaluation with correct rounding, and the evaluator tries several different ways of varying the selectors. There are no visible block artifacts in this region.

Encoding time: 21-54.0 secs, average quality: 50.98-51.24 dB RGB PSNR

Region 4: Slow to Exhaustive


Beyond this you can enable more modes, more partitions, etc. to very slightly increase quality. I have breakdowns on some interesting configurations here, but the massive increase in encoding time just isn't worth the tiny imperceptible quality gains.

Encoding time: 132.8-401.97 secs, average quality: 51.36-51.39 dB RGB PSNR


Graphing our BC7 encoder's quality vs. encoding time for opaque textures

I non-RDO encoded to BC7 a 4k test texture containing random blocks chosen from a few thousand input textures 5000 times, using random codec settings for each trial encode. I recorded all the results to a CSV file.

The various stages of our codec can be configured in various ways. It's not immediately obvious or intuitive what settings are actually valuable, so we must run tests like this. Here are the various settings:

- Which modes to try
- Max # of partitions to examine for each mode (max of 16 or 64)
- Max # of partitions to evaluate after estimation, for each mode
- Various endpoint refinement settings ("uber" setting, iterative endpoint refinement trials)
- pbit refinement mode

Here's the resulting graph time vs. quality (RGB avg. PSNR) graph:


I examined this graph to help come up with strong codec profiles. Some key takeaways after examining the Pareto frontier of this graph (the uppermost points on the convex hull from left to right):
  • Ironically BC7 mode 0 (the mostly ignored 3 subset mode) is at its highest value in some of our fastest codec settings. Our 2 fastest settings use just mode 6. After this, we add mode 0 but just examine the first 1 or 4 partitions in our partition estimator. This combo is strong! I intuitively knew that mode 0 had unique value if the pbits were handled correctly. 3 subsets with 3-bit indices is a powerful combination. (If our partition estimator was faster we could afford to check more partitions in this mode before another set of codec settings becomes better.)
  • Mode 0 is only valuable if you limit the # of partitions examined during estimation to the first 1 or 4 (and just evaluate the best). In this case, when combined with 6 it's a uniquely powerful combination. With every other practical set of encoder settings (anything below 219 secs), mode 0 is of no value. 
  • Mode 4 is totally useless for opaque textures (for all settings below 401 secs of compute). Note we currently always evaluate all the rotations and index settings for mode 4 when it's enabled. This is possibly a minor weakness in our encoder, so I'm going to fix this and regenerate the graph.
  • Mode 6 is always Pareto optimal, i.e. it's always enabled in every optimal setting of the codec across the entire frontier. It doesn't have subsets, but it does have large 777.1 endpoints and 4-bit selectors. Mode 1 is also very strong, and is used across the entire frontier beyond the very fastest settings.
  • There's a very steep quality barrier at around 25-30 secs of compute. Beyond this inflection point only minor quality gains (of around .2-.4 dB) can be made - but only with large increases in encoding CPU time. Every BC7 codec I've seen hits a wall like this, sooner or later.
  • The sweetspot for our codec, at the beginning of this steep wall, is around 21 seconds of compute (~42x slower than mode 6 only). This config uses modes 1,3,6, limits the max # of partitions examined by the estimator to 42 for 1/3, and only evaluates the single best partition for modes 1/3. This mode also enables a more exhaustive pbit refinement mode, and a deeper endpoint refinement mode. Note we use the same estimated partition in mode 1/3 (they're both 2 subset modes), which is probably why this combo stands out. 
  • I'm not getting enough data samples on the frontier as I would like. Most samples have no value. I either need a more efficient way of computing random parameters, or I need to just use more samples.
  • It's interesting to explore the non-Pareto optimal solutions. Mode 5 only with pbit searching is a really bad performer (fast but very low quality).

Friday, May 11, 2018

Basis non-RDO BC7 benchmark

This graph shows the performance of ispc_texcomp at each of its supported opaque profiles (from ultrafast to slow) vs. Basis's non-RDO BC7 encoder at various settings. I haven't decided on the settings to use for each of our profiles yet, which is why I'm generating graphs like this.


For reference, BC1 non-perceptual gets ~36.92 dB on this test set, so even just mode 6 BC7 (the first data points on the bottom left) is superior to BC1's quality.

Basis non-RDO BC7 also has a "ultra" profile which is ~.49 dB better than ispc_texcomp's slow profile (its highest quality mode), but you pay a steep price in encoding time (959 secs vs. ~350 secs for ispc_texcomp's slow profile). In this mode Basis is even better than the reference BC7 encoder in NVidia's NVTT (but we're ridiculously faster):

Basis BC7 ultra:   953.2 secs   47.255552 dB
Basis BC7 slow:    365.5 secs   47.162389 dB
NVTT:              28061.9 secs 47.141580 dB
ispc_texcomp slow: 353.6 secs   46.769749 dB


The leftmost bottom samples are mode 6 only (ispc_texcomp's ultrafast profile). We are 19% faster at slightly higher quality here.

Basis non-RDO BC7 supports a powerful perceptual mode which I'll benchmark tomorrow. In this mode we kinda wreck ispc_texcomp at the same Luma PSNR (but to be fair ispc_texcomp doesn't support a perceptual mode at all, which is a serious deficiency).

BC7 mode 0-only encoding examples

BC7 mode 0 (3 subsets, 444.1 endpoints with a unique p-bit per endpoint, 3-bit indices, 16 partitions) is probably the most difficult mode to handle correctly. If you don't do the pbits right, the results look terrible (very banded). All BC7 encoder I've examined (from Intel, NVidia, Volition, and Microsoft) have weak mode 0 encoders (and most drop the ball with pbits).

To put things into perspective, mode 0 is better than BC1 by approximately 5-6 dB RGB PSNR on average, when done correctly. All-mode BC7 is 10-12 dB better on average than BC1 for opaque textures.

Here are some example mode-0 only encodings created with the ispc vectorized non-RDO BC7 encoder in Basis. Notice there's no visible banding (there shouldn't be in a properly written encoder).











Tuesday, May 8, 2018

One last non-RDO BC7 benchmark: ispc_texcomp slow vs. my encoder in perceptual mode

"ISPC" is Intel's Fast ISPC Texture Compressor. (Both of our encoders use ispc.)

In perceptual mode, you basically trade off around 1 dB of RGB PSNR for a gain of 2.6 dB Luma PSNR, relative to ispc_texcomp. Our per-mode encoding time is actually slower in perceptual mode, but we don't use modes 4 and 5 in perceptual mode yet which helps compensate for the slowdown.

Perceptual mode (total encode CPU time, average RGB PSNR, average Luma PSNR):
ISPC: 353.245527 46.769749 48.568988
Ours: 216.838825 45.782654 51.185091 

ISPC mode histogram:
367473 370942 26227 633692 26789 116571 318478 0
Ours mode histogram:
47882 409997 18025 185524 0 0 1198744 0

RGB mode:
ISPC: 352.133776 46.769749 48.568988
Ours: 227.692341 47.029635 48.903907 

ISPC mode histogram:
367473 370942 26227 633692 26789 116571 318478 0
Ours mode histogram:
33264 411398 21192 186552 25437 68297 1114032 0

I'm writing some API's to expose this encoder in the Basis DLL/SO/dylib. It'll be exposed just like Intel's encoder (you call it with an array of blocks and you handle the multithreading).

Friday, April 27, 2018

BC7 showdown #2: Basis ispc vs. NVidia Texture Tools vs. ispc_texcomp slow

Got my BC7 encoder test app linking with NVTT. The BC7 encoder in NVTT is the granddaddy of them all, from what I understand. It's painfully slow but very high quality. I called it using the blob of code below. It supports weighted metrics, which is great.

Anyhow, here are the test results using linear RGB metrics (non-perceptual), comparing NVTT and ispc_texcomp vs. Basis non-RDO BC7 ispc. The test corpus was kodim01-24 plus a number of other images/textures I commonly test with. I turned up the Basis BC7 encoder options to the highest currently supported.

Basis BC7:         365.5 secs   47.162389 dB
NVTT:              28061.9 secs 47.141580 dB
ispc_texcomp slow: 353.6 secs   46.769749 dB

This was a multithreaded test (using OpenMP) on a dual Xeon workstation supporting AVX.

Here's the code snippet for calling NVTT's AVPCL encoder directly to pack BC7 blocks (bypassing the rest of NVTT because I don't want to pack entire textures, just blocks):

#include "../3rdparty/nvidia-texture-tools/src/bc7/avpcl.h"

void nvtt_bc7_compress(void *pBlock, const uint8_t *pPixels, bool perceptual)
{
AVPCL::mode_rgb = false;
AVPCL::flag_premult = false; //(alphaMode == AlphaMode_Premultiplied);
AVPCL::flag_nonuniform = false;
AVPCL::flag_nonuniform_ati = perceptual;

// Convert NVTT's tile struct to AVPCL's.
AVPCL::Tile avpclTile(4, 4);
memset(avpclTile.data, 0, sizeof(avpclTile.data));

for (uint y = 0; y < 4; ++y) 
{
for (uint x = 0; x < 4; ++x) 
{
nv::Vector4 &p = avpclTile.data[y][x];

p.x = pPixels[0];
p.y = pPixels[1];
p.z = pPixels[2];
p.w = pPixels[3];

pPixels += 4;

avpclTile.importance_map[y][x] = 1.0f; //weights[4*y+x];
}
}

AVPCL::compress(avpclTile, (char *)pBlock);
}

Thursday, April 26, 2018

New BC7 encoder open sourced

bc7m16.c/.h is one of the strongest CPU BC7 encoders available at the moment for opaque textures:

https://github.com/richgel999/bc7enc16

In perceptual mode using luma PSNR on opaque textures it beats ispc_texcomp, without vectorization and only using 2 modes.

Using linear RGB metrics it's around 20% slower at same PSNR (compared to ispc_texcomp's fast profile). I don't suggest you use linear RGB metrics because the results visually look worse vs. perceptual metrics, as we learned years ago with BC1 encoding.

This encoder was basically the prototype for my new non-RDO ispc BC7 encoder I wrote for Basis, which supports all the modes and is even faster.

Wednesday, April 25, 2018

A tale of multiple BC7 encoders

To get going with BC7, I found an open source block decompressor and first created a BC7 struct (using bitfields) to create a single mode 6 block. I filled in the fields for a simple mode 6 block, called the decompressor, and examined the output. (Turns out this decompressor has some bugs which I've reported to the author.)

For my first encoder I started in C++ and wrote a straightforward but feature complete encoder. It supports exhaustive partition scanning or it can estimate a single partition to evaluate which is around 12x faster. It's also pretty generic, so you can call the endpoint optimizer with any number of pixels (not just 16). About the only thing good you can say about this encoder is that it's high quality and easy to follow. Its performance is dreadful (even worse then DirectXTex, believe it or not) unless you force it to evaluate only a single partition per mode in which case it's ~2.6x faster than DirectXTex.

I then wrote a new, simpler C++ encoder, then quickly ported this to C. For this simpler encoder, I focused on just the modes that mattered most to performance vs. quality. The endpoint optimizer was less generic and more focused, and I removed some less useful features in perceptual mode. I then added in additional modes to this encoder to bring its quality up vs. Intel's ispc_texcomp. After this, I took the C encoder and got it building with iscp, then profiled and optimized this over a period of a week or so.

1. Ridiculously slow C++ encoder, evaluate all partitions:
46286.7 secs, 51.56 Luma PSNR

2. Ridiculously slow C++ encoder, evaluate the single best partition only:
4615.5 secs, 51.35 Luma PSNR

3. Faster C encoder:
458.7 secs, 50.77 Luma PSNR

4. Vectorized encoder - C encoder ported to ispc then further optimized and enhanced:
76.4 secs, 51.16 Luma PSNR

Notice each time I got roughly a 6-10x performance increase at similar quality. The iscp version is currently 6x faster than the C version. To get it there I had to examine the code generated for all the inner loops and tweak the iscp code for decent code generation. Without doing this it was only around 2x faster.

I used the previous encoder as a reference design during each step of this optimization process, which was invaluable for checking for regressions.

Some things about BC7 that are commonly believed but aren't really true, and other things I've learned along the way (along with some light ranting but I mean well):

- "BC7 has this massive search space - it takes forever to encode!" Actually, with a properly written encoder it's not that big of a deal. You can very quickly estimate which partition(s) to examine (which is an extremely vectorizable operation), you can drop half the modes for alpha textures, mode 7 is useless with opaque textures, the 3 subset modes can be dropped unless you want absolute max quality, and many fast endpoint optimization methods from BC1 are usable with BC7. BC1-style cluster fit isn't a good match for BC7 (because it has 3 and 4 bit indices), but you can easily approximate it.

For a fast opaque-only encoder, my statistics show that supporting just modes 1 and 6, or 0/1/6, results in reasonably high quality without noticeable block artifacts.

- CPU encoders are too slow, so let's use the GPU!
Actually, a properly vectorized CPU encoder can be extremely fast. There is nothing special about BC7 that requires compute shaders if you do it correctly. Using gradient descent or calling rand() in your encoder's inner loop are red flags that you are going off the rails. The search space is massive but most of it is useless.

You need the right algorithms, and strong heuristics matter in GPU texture compression. Write a BC1 encoder that can compete against squish first, then tackle BC7.

- Graphics dev BC7 folklore: "3 subset modes are useless!"
No, they aren't. iscp_texcomp and my encoder make heavy use of these modes. The perf reduction for checking the 3 subset modes is minimal but the PSNR gain from using them is high:

Across my 31 texture corpus, all modes (but 7) gets 46.72 dB, disabling the two 3 subset modes I get 45.96 dB. This is a significant difference. Some timings: 487 secs (3 subsets enabled) vs. 476 secs (3 subsets disabled). In another test at lower quality: 215.9 secs (3 subsets enabled) vs. 199.5 secs (disabled).

Of the publicly available BC7 encoders I've tested or looked at so far none of them have strong or properly working mode 0 encoders. Mode 0 is particularly tough to handle correctly because of its 4-bit/component endpoints with pbits. Most encoders mess up their pbit handling so mode 0 appears much weaker than it really is. 

- Excluding iscp_texcomp, I've been amazed at how weak the open source CPU/GPU BC7 encoders are. Every single encoder I've looked at except for iscp_texcomp has pbit handling issues or bugs holding them back. DirectXTex, Volition, NVTT - all use pbit voting/parity without properly rounding the component value, which is incorrect and introduces significant error. DirectXTex had outright pbit bugs causing it to return lower quality textures when 3 subset modes were enabled.

Note to encoder authors: You can't estimate error for a trial solution without factoring in the pbits, or your encoder will make very bad choices that actually increase the solution error. Those LSB's are valuable, and you can't flip them without adjusting the components too.

- None of the available encoders support perceptual colorspace metrics (such as computing weighted error in YCbCr space like etc2comp does), which is hugely valuable in BC7. You can get up to 2 dB gain from switching to perceptual metrics, which means your encoder doesn't have to work nearly as hard to compete against a non-perceptual encoder.

Before you write a BC1 or BC7 encoder, you should probably write a simple JPEG compressor first to learn all about chroma subsampling, etc.

- SSIM and PSNR are highly correlated for basic 4x4 block compression. I'll blog about this soon. I test with both, but honestly it's just not valuable to do so because they are so correlated and SSIM is more expensive to compute.

- Whoever wrote iscp_texcomp at Intel should be given a big bonus because it's overall very good. Without it the BC7/BC6H situation would be pretty bad.

Comparing ispc_texcomp alpha performance vs. my encoder

Most papers and encoders focus on opaque performance with BC7, but alpha textures are very important too. BC7's alpha support is somewhat weaker than opaque, especially with alpha signals that are uncorrelated vs. RGB. It only has a single 2 subset mode that can handle alpha, with limited color precision (555.1 with 2-bit indices).

This test exercises each codec in a different way than my previous opaque-only tests. Modes 0-3 are useless with transparent blocks.

I've finished the alpha path in my new non-RDO BC7 encoder. Results on a 4k test texture containing random 4x4 blocks (both in RGB and A) picked from thousands of textures in my corpus:

My encoder - higher quality settings:
Time: 42.4 secs
RGB: 46.058 RGB Avg. PSNR
A: 44.521 PSNR

My encoder - lower quality settings:
Time: 21.9 secs
RGB: 45.986
A: 44.539

My encoder - faster settings:
Time: 12.8 secs
RGB: 45.850
A: 43.949

ispc_texcomp slow:
Time: 155.4 secs
RGB: 45.826
A: 44.216

ispc_texcomp basic: 
Time: 45.6 secs
RGB: 45.820
A: 44.188

ispc_texcomp fast:
Time: 23.3 secs
RGB: 45.647
A: 44.307

This was with linear colorspace metrics, and benchmarking was on a single thread. The RGB/A stats are PSNR (higher is better). (In case you're wondering, SSIM and PSNR are highly correlated with block compression, so I usually use PSNR until I start doing whole-texture stuff with RDO.)

I'm now ~2x faster at higher quality vs. the fastest CPU BC7 encoder I'm aware of, and there are several easy optimizations left. And this is before enabling perceptual metrics.

BC7 mode utilization comparison of three encoders

I've been doing some benchmarking today to see where I stand with raw (non-RDO) BC7 encoding. Depending on the profile, I'm up to 2.26x faster at higher average quality using linear colorspace metrics vs. ispc_texcomp.

BC7 mode utilization, ispc_texcomp's basic profile, with my encoder set to a roughly similar profile:


Here's ispc_texcomp's slow profile, my encoder at a higher quality profile, and DirectXTex with BC_FLAGS_USE_3SUBSETS (with the pbit bug fixed so this is a fair comparison).


After modifying DirectXTex to try mode 6 first it get slightly faster and uses mode 6 much more often:


This was a multithreaded test. The timings are the overall amount of CPU time utilized only for encoding across all threads.

My encoder favors mode 6 on grayscale inputs (one large texture is grayscale), and it's always the first mode that's checked for opaque blocks so on simple blocks mode 6 gets favored. Mode 6 has very good endpoint precision (7777.1) and large 4-bit indices, so even on complex blocks it's fairly good.

GPU texture compression error metrics

While working on an encoder I conduct a lot of experiments (probably thousands over time) to improve it. To check for regressions or silly bugs, you must use some sort of error metrics otherwise it'll be impossible to make forward progress just by examining the output with your eye.

Here's what I commonly use:

Total time: 0.124466, encode-only CPU time: 3.288052
Compressed size: 360041, 7.325053 bpp

BC7 Mode histogram:
6 22 2 0 3809 2754 17936 0

RGB Total   Error: Max:  53, Mean: 5.220, MSE: 21.670, RMSE: 4.655, PSNR: 34.772, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

RGB Average Error: Max:  53, Mean: 1.740, MSE: 7.223, RMSE: 2.688, PSNR: 39.543, SSIM: 0.985854, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

Luma        Error: Max:  35, Mean: 1.123, MSE: 3.048, RMSE: 1.746, PSNR: 43.290, SSIM: 0.993128, CRCA: 0x351EFAEF CRCB: 0xA29288A7

Red         Error: Max:  52, Mean: 1.766, MSE: 7.544, RMSE: 2.747, PSNR: 39.355, SSIM: 0.987541, CRCA: 0xC673D5EA CRCB: 0x8623AA59

Green       Error: Max:  40, Mean: 1.553, MSE: 5.462, RMSE: 2.337, PSNR: 40.758, SSIM: 0.989642, CRCA: 0x2BF4294F CRCB: 0x418E5EE1

Blue        Error: Max:  53, Mean: 1.900, MSE: 8.664, RMSE: 2.944, PSNR: 38.753, SSIM: 0.980377, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

Alpha       Error: Max:  31, Mean: 1.039, MSE: 3.308, RMSE: 1.819, PSNR: 42.935, SSIM: 0.976243, CRCA: 0x524049DA CRCB: 0x7492A642

I report some timings (two timings because it may be a threaded encode), the compressed size of the data (using LZMA) to get a feel for the encoder's output entropy, the mode histogram on those modes that support this concept, and then a bunch of metrics. 

I report a bunch of variations of PSNR (RGB Total and RGB Average) because there really are no rock solid standards for this and every author or tool seems to do this in a different way. I sometimes also use RGBA PSNR to compare my results against other tools.

Sometimes, an encoder bug will subtly break just a few blocks. That's where the Max errors are really handy.

Luma error is what I pay attention to during perceptual encoding tests, and RGB Average for non-perceptual tests. SSIM and PSNR are highly correlated in my experience for block compression work, which I'll show in a future blog post.

I also use PSNR per second or SSIM per second metrics (or some variation).

The CRC's are there to detect when the output (or input) data is exactly the same across runs.

I have a tool which can compare two encoders across a large corpus of test textures, which is really handy for finding regressions.

I also test with large textures containing random blocks gathered from thousands of test textures. For example (blogger.com has resampled it, here's the original 4K PNG):


These corpus textures can be generated using crunch in the corpus_gen mode.

I've found that testing like this is a reliable way of gauging the overall strength of an encoder vs. another. Graphics devs will instantly say "but wait why aren't you looking at minimizing block artifacts by looking at adjacent blocks!" That turns block compression into a global optimization problem (like PVRTC), and it's already hard enough as it is to do in a practical amount of CPU time. Also with BC7, the error is already pretty low (45-60 dB), and BC1-style block artifacts in a properly written BC7 encoder at high quality settings are uncommon.

Monday, April 23, 2018

LZHAM decompressor vectorization

The is just a sketch of an idea. It might not pan out, there are too many details to know until I try it. LZHAM is an old codec now, but maybe I could stretch it a bit further before moving on to a new design.

It should be possible to decode multiple segments of each LZHAM block simultaneously with vector instructions. Porting the decoder to C, then vectorizing the decoder loop with SPMD (using iscp) shouldn't be that hard. The compressed stream will need some hints so it knows where each lane needs to start decoding from.

So for 8 parallel decodes, at the beginning of the block you memcpy() the arithmetic and Huffman statistics into 8 copies, run the SPMD decoder on say 1-2K bytes per lane, then sync the statistics after 8 1-2K blocks are processed. Then repeat the process.

Lane 0 will be able to process match copies from any previously decompressed bytes, but lane 1 can't access lane 0's decoded bytes, and lane 2 can't access lane 1/0's, etc. That'll definitely lower ratio. However, it should be possible for the compressor to simulate lane 0's decompressor and predict which bytes are available in lane 1 at each point in the compressed data stream.

The various if() statements in the decoder's inner loop won't be particularly coherent, which would impact lane efficiency. But even a 2-3x improvement in decoding speed would be pretty nice.

DirectXTex BC7 3 subset wierdness

I brought Microsoft's DirectXTex project (latest code) into my test project, to see how it fairs vs. ispc_texcomp and my encoder. Unfortunately, it appears broken. Across 31 test images (kodim and others):

DirectXTex BC7:

No flags (0): 9972.6 secs 44.41 dB

BC_FLAGS_USE_3SUBSETS: 13534.6 secs 44.25 dB

This is wrong. Quality should go up with 3 subset modes enabled, not down. I'm tempted to go figure out what's wrong in there myself, but it's a lot of code.

By comparison, my ispc encoder gets 477.1 secs 46.72 dB (using high quality settings). ispc_texcomp is in the same ballpark. With 3 subset modes disabled, I get 45.96 dB (as expected - the 3 subset modes are useful!).

I verified that the flag is doing something. Here's the BC7 mode histogram for kodim01 with the flags parameter to DirectX::D3DXEncodeBC7() set to 0:

0 17968 0 1752 0 0 4856 0

With 3 subsets enabled:

1435 16647 26 1675 0 0 4793 0

The source looks nice and readable, and as a library it was dead-simple to get it building and linked against my stuff. But it doesn't appear to be a production-ready encoder, it's more like a sample.

I'm calling it from multiple threads using OpenMP (it's too slow to benchmark otherwise). It makes my 20 core Xeon workstation crawl for a while, it's that slow.

Also, there is no need to disable 3 subset modes in a properly written encoder. Some timings with my encoder: 487 secs (3 subsets enabled) vs. 476 secs (3 subsets disabled). In another test at lower quality: 215.9 secs (3 subsets enabled) vs. 199.5 secs (disabled). This was on a 20 core Xeon workstation/40 threads/31 images (kodim+others). 

The extra cost of a 3 subset mode isn't a big deal (3 endpoint optimizations) once you've estimated which partition(s) to examine more deeply. Partition estimation is fast and simple with SIMD, and a nice property of 3 subset modes is that the # of pixels fed to the endpoint optimizer per subset is rather low (enabling 3-subset specific optimizations). If your pbit handling is correct these modes are quite valuable.


Sunday, April 22, 2018

Proper pbit computation in the BC7 texture format

The BC7 GPU texture format supports the clever concept of endpoint "pbits", where the LSB's of RGB(A) endpoints are forced to the same value so only 1 bit (vs. 3 or 4) needs to be coded. BC7's use of pbits saves precious bits which can be used for other things which decrease error more. Some modes support a unique pbit per endpoint, and some only have 1 pbit for each endpoint pair.

I'm covering this because the majority of available BC7 encoders mess this important detail up. (I now kinda doubt any available BC7 encoder handles this correctly in all modes.) The overall difference across a 31 texture corpus (including the kodim images) is .26 dB RGB PSNR, which is quite a lot considering the CPU/GPU cost of doing this correctly vs. incorrectly is the same. (The improvement is even greater if you try all pbit combinations with proper rounding: .4 dB.)

ispc_texcomp handles this correctly for sure in most if not all modes, while the DirectXTex CPU, Volition GPU, and NVidia Texture Tool encoders don't as far as I can tell (they use pbit voting without proper rounding - the worst). The difference to doing this correctly in some modes is pretty significant - by at least ~.6 dB in mode 0!

Not doing this properly means your encoder will run slower because it will have to work harder (scanning more of the search space) to keep PSNR up vs. the competition. The amount of compute involved in lifting a BC7 encoder "up" by .26 dB across an entire test corpus is actually quite large, because there's a very steep quality vs. compute "wall" in BC7.

Here are some of the ways p-bits can be computed. The RGB PSNR's were for a single encoded image (kodim18), purposely limited to mode 0 (with 4 bit components+unique per-endpoint pbits) to emphasize the importance of doing this correctly:
  • 40.217 dB: pbit search+compensated rounding: Compute properly rounded component endpoints compensating for the chosen pbit, try all pbit combinations. This is an encoder option in my new BC7 encoder. Encoding error evaluation cost: 2X or 4X (expensive!)
  • 39.663 dB: Round to middle of component bin+pbit search: Compute rounded endpoints (with a scale factor not including the pbit), then evaluate the full error of all 2 or 4 pbit possibilities. This is what I initially started doing, because it's trivial to code. In mode 0, you scale by 2^4, round, then iterate through all the pbits and test the error of each combination. Error evaluation cost: 2X or 4X
  • 39.431 dB: Compensated rounding (ispc_texcomp method): Proper quantization interval rounding, factoring in the shift introduced when the pbit is 1 vs. 0. The key idea: If an endpoint scaled and rounded to full-precision (with a scale factoring in the pbit) has an LSB which differs from the pbit actually encoded, you must properly round the output component value to compensate for the different LSB or you will introduce more error than necessary. So basically, if the LSB you want is different from what's encoded, you need to correctly round the encoded component index to compensate for this difference. You must also factor in the way the hardware scales up the encoded component+pbit to 8-bits. Error evaluation cost: 1X
  • 39.151 dB: Voting/parity (DirectXTex/Volition): Count how many endpoint components in the scaled colors (with a scale factor including the pbit) sharing each pbit have set LSB's. If half or more do then set the encoded pbit to 1, otherwise 0. pbit voting doesn't round the encoded components so it introduces a lot of error. Error evaluation cost: 1X
  • 38.712 dB: Always 0 or 0,1
  • 37.878: Always 0 or 0,0
I tested a few different ways of breaking ties when computing pbits by voting and just reported the best one. At least on this image biasing the high endpoint towards 1 helps a little:

Shared  Unique
> >    39.053
> >=   39.151
>= >   38.996
>= >=  38.432
>=  > >=   39.151

This stuff is surprisingly tricky to get right, so here's a mode 0 example to illustrate what's going on. Each component can be coded to 16 possible values with one pbit selecting between two different ramps. So factoring in the pbit we have 32 possible representable 8-bit values. Here are the resulting 8-bit values (scaled using the shift+or method BC7 uses - not pure arithmetic scaling by 255/31 which is slightly different):

pbit 0:
0
16
33
49
66
82
99
115
132
148
165
181
198
214
231
247

pbit 1:
8
24
41
57
74
90
107
123
140
156
173
189
206
222
239
255

Let's say the encoder wants to encode an endpoint value of 9/255 (using 8-bit precision) in mode 0 (4-bit components+pbit). The pbit voting encoders will compute a quantized/scaled component value of 1/31 (from a range of [0,31] - not [0,15] because we're factoring in the pbit). The LSB is 1 and the encoded component index (the top 4 MSB's) is 0, and if more than half of the other component LSB's are also 1 we're ok. In the good case we're coding a value of 8/255, which is closer to 9/255 than 24/255.

If instead a pbit of 0 is chosen, we're now encoding a value of 0/255 (because the encoded component index of 0 wasn't compensated), when we should have chosen the closer value of 16/255 (i.e. a component index of 1). Choosing the wrong LSB and not compensating the index has resulted in increased error.

There's an additional bit of complexity to all this: The hardware scales the mode 0 index+pbit up to 8-bits by shifting the index+pbit left by 3 bits for mode 0, then it takes the upper 3 MSB's of this and logically or's them into the lower 3 bits to fill in. This isn't quite the same as scaling by 255/31. So proper pbit determination code needs to factor this in, too. Here are the ramps computed using arithmetic scaling+rounding (notice they slightly differ from the previous ramps computed using shifting+or'ing):

0
16
33
49
66
82
99
115
132
148
165
181
197
214
230
247

8
25
41
58
74
90
107
123
140
156
173
189
206
222
239
255

I worked out the formulas involved on a piece of paper: 

How to compute [0,1] values from mode 0 bins+pbits (using arithmetic scaling, NOT hardware scaling):
pbit 0: value=bin*2/31
pbit 1: value=(bin*2+1)/31

How to compute mode 0 bins from [0,1] values with proper compensation/rounding (rearranging the equations+rounding) for each pbit index:
pbit 0: bin=floor(value*31/2+.5)
pbit 1: bin=floor((value*31-1)/2+.5)

Here's the clever code in ispc_texcomp that handles this correctly for modes with unique p-bits (modes 0,3,6,7). I bolded the bin calculations, which are slightly optimized forms of the previous set of equations. 

I believe there's actually a bug in here for mode 7 - I don't see it scaling the component values up to 8-bit bytes for this mode. It has a special case in there to handle mode 0, and modes 3/6 don't need scaling because they have 7-bit components, but mode 7 has 5-bit components. I didn't check the rest of the code to see if it actually handles mode 7 elsewhere, but it's possible ispc_texcomp's handling of mode 7 is actually broken due to this bug. Mode 7 isn't valuable when encoding opaque textures, but is pretty valuable for alpha textures because it's the only alpha mode that supports partitions.

///////////////////////////
// endpoint quantization

inline int unpack_to_byte(int v, uniform const int bits)
{
    assert(bits >= 4);
    int vv = v << (8 - bits);
    return vv + shift_right(vv, bits);
}

void ep_quant0367(int qep[], float ep[], uniform int mode, uniform int channels)
{
    uniform int bits = 7;
    if (mode == 0) bits = 4;
    if (mode == 7) bits = 5;

    uniform int levels = 1 << bits;
    uniform int levels2 = levels * 2 - 1;

    for (uniform int i = 0; i < 2; i++)
    {
        int qep_b[8];

        for (uniform int b = 0; b < 2; b++)
            for (uniform int p = 0; p < 4; p++)
            {
                int v = (int)((ep[i * 4 + p] / 255f*levels2 - b) / 2 + 0.5) * 2 + b;
                qep_b[b * 4 + p] = clamp(v, b, levels2 - 1 + b);
            }

        float ep_b[8];
        for (uniform int j = 0; j < 8; j++)
            ep_b[j] = qep_b[j];

        if (mode == 0)
            for (uniform int j = 0; j < 8; j++)
                ep_b[j] = unpack_to_byte(qep_b[j], 5);

        float err0 = 0f;
        float err1 = 0f;
        for (uniform int p = 0; p < channels; p++)
        {
            err0 += sq(ep[i * 4 + p] - ep_b[0 + p]);
            err1 += sq(ep[i * 4 + p] - ep_b[4 + p]);
        }

        for (uniform int p = 0; p < 4; p++)
            qep[i * 4 + p] = (err0 < err1) ? qep_b[0 + p] : qep_b[4 + p];
    }
}

Saturday, April 21, 2018

RDO BC7 encoder planning

I'm building my first RDO BC7/BC6H encoders for our product (Basis), and I'm trying to decide which BC7 modes to implement first. I've decided on modes 1+6 for opaque textures. For alpha, I've boiled down the options to modes 1+5+6, 1+4+6, or maybe 1+4+6+7.

For alpha I know I'll need either 4 and/or 5 because they are the only modes with uncorrelated alpha support. Mode 1 isn't optional because it supports 2 subsets, greatly reducing ugly block artifacts. The 3 subset modes aren't used much in practice and aren't necessary. Mode 6 is optional but boosts quality on simple blocks due to its awesome 7777.1 bit endpoint precision and beefy 4-bit selectors. The bare minimum modes for a decent opaque/alpha encoder seems to be 1+4 or 1+5, but adding 6 and 7 will improve quality especially with alpha textures.

Others have suggested that I do a RDO BC7 encoder using only a single subset mode (6 seems perfect), but the result would have BC1-style block artifacts. I've already built a 2 subset RDO encoder for ETC1 in Basis and it isn't that much harder to support 2 subsets vs. 1.

RDO GPU texture encoders try to optimize the encoded output data so that, after the output is LZ compressed (and the output pretty much always is!), you get as much quality per compressed bit as you can. The output bit patterns matter a lot. If you can reuse a bit pattern that's occurred in a nearby block, without impacting quality too much, an LZ codec can exploit this to issue a match vs. literals (saving bits and improving quality per compressed output bit). There's more to it than that, because you also want the ability to control the output quality vs. compressed size tradeoff in an artist-friendly way.

My RDO BC1 encoder in Basis uses Lagrangian optimization, and this is simplified because in BC1 the bit patterns nicely line up to byte boundaries which modern LZ codecs like. The endpoints and selectors aren't munged together, and a large chunk of the output entropy is in the selector data.

Anyhow, here's some of the data I used to make these decisions about BC7. I have a lot more data than this which I may post later.

image corpus was kodim+7 others
For comparison ispc_basic (using 7 modes) gets 46.54 dB
Encoder was set to extremely high quality (better than ispc_slow)

Single mode:

mode enctime   dB               
1    83        45.8 
3    91.2      43.4 
6    12.1      42.9 
2    12.9      42.7 
0    19.6      42.34
4    38.56     41.48
5    14.4      38.45 

Key mode combinations:
                 
1+6  86.37     46.27
1+5  11.161    45.79
1+6  21.9      45.7 
1+4  137.6     45.38

This data definitely isn't ideal, because I used my ispc encoder which was heavily tuned for performance. So modes 1 and 3 tried all 64 partitions in this test, while modes 0 and 2 only tried the best predicted partition. But it matches up with earlier data I generated from a more brute force encoder.

Note that the mode 4 and 5 encoders used all rotation/index selector options, which skews the output a bit. I doubt I'll be supporting these mode 4/5 options. Two 1+6 entries are for the encoder set to very high vs. much lower quality levels.

Some earlier encoding error data for just kodim18, using an earlier C++ version of my encoder:

Best of any mode error: 12988.968550

Best of modes 0 and 1: 13186.750320
Best of modes 1 and 6: 13440.390471
Best of modes 1 and 2: 13521.855790
Best of modes 1 and 4: 13565.064541
Best of modes 1 and 3: 13589.143902
Best of modes 1 and 5: 13592.709517
Best of modes 1 and 7: 13604.182592
Best of modes 1 and 1: 13605.033774
Best of modes 0 and 6: 14099.723969
Best of modes 0 and 3: 15046.616630
Best of modes 2 and 6: 15383.155463
Best of modes 0 and 4: 15563.136445
Best of modes 0 and 5: 15665.245609
Best of modes 0 and 2: 15892.424359
Best of modes 3 and 6: 15977.876955

Best of modes 0 and 1 and 6: 13037.149688
Best of modes 0 and 1 and 4: 13160.175075
Best of modes 0 and 1 and 2: 13168.570462
Best of modes 0 and 1 and 3: 13171.807469
Best of modes 0 and 1 and 5: 13176.588633
Best of modes 0 and 1 and 7: 13186.504616
Best of modes 0 and 0 and 1: 13186.750320
Best of modes 0 and 1 and 1: 13186.750320
Best of modes 1 and 2 and 6: 13360.493404
Best of modes 1 and 4 and 6: 13400.774007
Best of modes 1 and 5 and 6: 13429.100640
Best of modes 1 and 3 and 6: 13433.822985
Best of modes 1 and 6 and 7: 13439.954762
Best of modes 1 and 1 and 6: 13440.390471
Best of modes 1 and 6 and 6: 13440.390471
Best of modes 1 and 2 and 4: 13489.904966
Best of modes 1 and 2 and 3: 13508.004738
Best of modes 1 and 2 and 5: 13511.406144
Best of modes 1 and 2 and 2: 13521.855790
Best of modes 1 and 1 and 2: 13521.855790

The 1+6 combination is pretty powerful. Only a quarter dB separates it from ispc_basic on this corpus. Also, mode 1's bit packing is nicely aligned to bytes (more or less) - perfect for RDO encoding!

byte bits
0    2 6  mode+partition
1    6 2  endpoints
2    4 4  endpoints
3    2 6  endpoints
4    6 2  endpoints
5    4 4  endpoints
6    2 6  endpoints
7    6 2  endpoints
8    4 4  endpoints
9    2 6  endpoints
10   2 6  pbits+selectors
11   8    selectors
12   8    selectors
13   8    selectors 
14   8    selectors
15   8    selectors

Note that most of mode 1's output is endpoint data, not selector data, so partition/pbit biasing, endpoint quantization and bit pattern optimization will be critical.

BC7 showdown: ispc_texcomp vs. my ispc encoder

This benchmark compares the Fast ISPC Texture Compressor's BC7 encoder vs. my new ispc vectorized encoder. I've just barely begun to profile and optimize it, but it's already looking really strong. To create this encoder, I studied all other available BC7 encoders and leveraged all the things I learned while creating crunch's BC1 high-quality encoder and Basic's new ETC1 and universal format encoders. This is a non-RDO encoding test, i.e. what matters here is how much quality each encoder can achieve per unit time.

It was conducted on a 20 core Xeon workstation across 31 test images (first 24 are the kodim images). Both use AVX instructions. The quality metric is RGB average PSNR, perceptual mode disabled. The PSNR's below are averages across the entire set. The timings are the total amount of CPU time used only calling the encoder functions (across all threads). OpenMP was used for threading, and each encoder was called with 64 blocks per function call.

I'm currently focusing on ispc_texcomp's basic and slow profiles:

ispc_texcomp:
basic profile: 100.5 secs, 46.54 dB, .4631 dB/sec
slow profile: 355.29 secs, 46.77 dB, .1316 dB/sec

My encoder:
uber 0: 56.7 secs, 46.49 dB, .8199 dB/sec
uber 1: 86.4 secs, 46.72 dB, .5407 dB/sec
uber 2: 129.1 secs, 46.79 dB, .3624 dB/sec
uber 2 (2 refinement passes, 16 max 1,3 partitions): 161.9 secs, 46.84 dB, .2893 dB/sec
uber 3 (2 refinement passes, 32 max 1,3 partitions, pbit search): 215.2 secs, 46.91 dB, .2180 dB/sec
uber 4 (2 refinement passes, 64 max 1,3 partitions, pbit search): : 292.5 secs, 46.96 dB, .1605 dB/sec

The dB/sec. values are a simple measure of encoder efficiency. ispc_texcomp's slow profile at .1315 dB/sec. is working very hard for very little quality per unit time compared to its basic profile. The efficiency of both encoders decreases as the quality is improved, but ispc_texcomp falls off very rapidly above basic and mine falls off later. I believe a whole texture encoder like etc2comp's can more efficiently get through the quality barrier here.

What this boils down to: If you use ispc_texcomp, definitely avoid the slow profile (the tiny gain in quality isn't worth it). And it's definitely possible to compete against ispc_texcomp using plain RGB metrics.

Friday, April 20, 2018

ispc_texcomp BC7 issues

Been studying ispc_texcomp today to better understand why it's so slow compared to my encoder (currently by a factor of 2x at max quality). We do many of the same things, so why is it slower? Overall, there are many clever/smart things in there (overall it's surprisingly well done!), but it's being held back by weak vectorization, some missing optimizations, and it only supports linear RGB metrics.

Here's what I've found so far:

- The inner loops are bogged down with gathers and scatters. Definitely not good. There's even a set of helper functions at the top with a comment of "(perf warning expected)". (Umm - the compiler perf warnings are there for a reason!) For an example, check out block_quant().

The innermost loops should not have gathers, period.

- The partition estimation code is using full PCA. I've found this to be unnecessary in my testing - just using Waveren's bounding box approximation works well and is trivially vectorizable. After all, we're not trying to compute the actual output, just a reasonable approximation.

So ispc_texcomp goes into overkill mode and computes PCA while estimating the partition. At least the way it computes each subset's PCA is smart: it first computes the overall block's statistics/covariance, then it subtracts out the statistics of each partition's active (masked) pixels to compute each subset's individual covar.

Also, it's only computing what looks like an upper bound on the error from the block statistics, not an approximation of the actual error. The approximation of the actual error (factoring in quantization to 4/8/16 selectors) is extremely fast to compute with SIMD code, so it's not clear to me what's better yet.

Overall, the author seems to be favoring cleverness vs. exploiting the properties of fast but simple SIMD code.

- It uses squish-style iterative refinement after choosing the partition: Basically, it computes the PCA, comes up with some initial selectors, uses least squares to optimize the endpoints, then it computes new selectors and tries all over again a few times. In my tests, the PSNR gain from this method is too low (fraction of a dB) to justify the repeated LS computation and selector selection costs. Just do it once and then optionally try other things. It's more effective to just vary the selectors in simple ways (simplified cluster fit) in each trial.

- There's no support for perceptual colorspace metrics in there. This indirectly impacts performance (against other codecs that do support perceptual metrics) because it's stuck competing against RGB PSNR, and getting RGB PSNR up in BC7 is VERY computational intensive. You basically hit a steep quality wall, then it takes massively more compute to get it up above that wall even by a fraction of a dB.

If it supported perceptual metrics (where error in R,G,B is allowed to become a little unbalanced by approx. .25 - 1.5 dB, favoring G and R) it wouldn't have to try as hard because it would gain ~1.5 dB or more instantly before hitting the wall.

- First, the good news: The selector quantizer (see block_quant()) is using a clever algorithm: It dots the desired color by the subset's axis, converts that to a scaled int by rounding, clamps that to between [1,num_selectors-1], then it computes full squared euclidean error between the desired color and the subset's interpolated colors (s-1) and s. It only has to compute the full distance to 2 colors vs. all of them, which is cool.

I've compared this method vs. full distance to all colors and the results are super close (~1/1000th of a dB) on many images (but not all - I've seen .1 dB RGB PSNR loss on some images).

Now the bad news: The implementation is just bad. First, it recomputes the subset axis for every pixel (even though there are only 2 or 3 of them in BC7). And it uses multiple gather's to fetch the endpoints! This is done for all 16 pixels in the block - ouch! There's also a per-pixel divide in there.

Also, with good SIMD computing full distance to all subset colors isn't that expensive, at least for 4 and maybe 8 color blocks. I've implemented optimized forms of full search vs. ispc_texcomp's method. At least with AVX, all the fetches into the weighted_colors[] array (one for each lane) just slow the method down. Brute force leads to simpler code once vectorized and seems to slightly win out overall for 4 and 8 color blocks. With 16 color blocks the smarter method wins.

- After iterative refinement it doesn't have any more ways of improving quality. Trying to vary the selectors in keys ways (say by incrementing the lowest values and decrementing the highest values - to exploit extrapolation) and then LS optimizing the results helps a lot (.3-.5 dB) and is very fast if you SIMD optimize the trial solution evaluator function, yet it doesn't do that.

- Its mode 0 encoder suffers from a lot of quantization error - which is indicative of some weaknesses in its endpoint selection:

ispc_texcomp mode 0 only:

My encoder mode 0 only (no dithering - just stronger endpoint selection):


- ispc_texcomp is weak with grayscale images, by around .6-1.2 dB in my testing. Granted, once you're over ~60dB it doesn't matter much.

The "slow" profile is solidly in the quality "wall" region I described earlier. The basic and faster profiles are in much healthier regions.