Wednesday, April 25, 2018

A tale of multiple BC7 encoders

To get going with BC7, I found an open source block decompressor and first created a BC7 struct (using bitfields) to create a single mode 6 block. I filled in the fields for a simple mode 6 block, called the decompressor, and examined the output. (Turns out this decompressor has some bugs which I've reported to the author.)

For my first encoder I started in C++ and wrote a straightforward but feature complete encoder. It supports exhaustive partition scanning or it can estimate a single partition to evaluate which is around 12x faster. It's also pretty generic, so you can call the endpoint optimizer with any number of pixels (not just 16). About the only thing good you can say about this encoder is that it's high quality and easy to follow. Its performance is dreadful (even worse then DirectXTex, believe it or not) unless you force it to evaluate only a single partition per mode in which case it's ~2.6x faster than DirectXTex.

I then wrote a new, simpler C++ encoder, then quickly ported this to C. For this simpler encoder, I focused on just the modes that mattered most to performance vs. quality. The endpoint optimizer was less generic and more focused, and I removed some less useful features in perceptual mode. I then added in additional modes to this encoder to bring its quality up vs. Intel's ispc_texcomp. After this, I took the C encoder and got it building with iscp, then profiled and optimized this over a period of a week or so.

1. Ridiculously slow C++ encoder, evaluate all partitions:
46286.7 secs, 51.56 Luma PSNR

2. Ridiculously slow C++ encoder, evaluate the single best partition only:
4615.5 secs, 51.35 Luma PSNR

3. Faster C encoder:
458.7 secs, 50.77 Luma PSNR

4. Vectorized encoder - C encoder ported to ispc then further optimized and enhanced:
76.4 secs, 51.16 Luma PSNR

Notice each time I got roughly a 6-10x performance increase at similar quality. The iscp version is currently 6x faster than the C version. To get it there I had to examine the code generated for all the inner loops and tweak the iscp code for decent code generation. Without doing this it was only around 2x faster.

I used the previous encoder as a reference design during each step of this optimization process, which was invaluable for checking for regressions.

Some things about BC7 that are commonly believed but aren't really true, and other things I've learned along the way (along with some light ranting but I mean well):

- "BC7 has this massive search space - it takes forever to encode!" Actually, with a properly written encoder it's not that big of a deal. You can very quickly estimate which partition(s) to examine (which is an extremely vectorizable operation), you can drop half the modes for alpha textures, mode 7 is useless with opaque textures, the 3 subset modes can be dropped unless you want absolute max quality, and many fast endpoint optimization methods from BC1 are usable with BC7. BC1-style cluster fit isn't a good match for BC7 (because it has 3 and 4 bit indices), but you can easily approximate it.

For a fast opaque-only encoder, my statistics show that supporting just modes 1 and 6, or 0/1/6, results in reasonably high quality without noticeable block artifacts.

- CPU encoders are too slow, so let's use the GPU!
Actually, a properly vectorized CPU encoder can be extremely fast. There is nothing special about BC7 that requires compute shaders if you do it correctly. Using gradient descent or calling rand() in your encoder's inner loop are red flags that you are going off the rails. The search space is massive but most of it is useless.

You need the right algorithms, and strong heuristics matter in GPU texture compression. Write a BC1 encoder that can compete against squish first, then tackle BC7.

- Graphics dev BC7 folklore: "3 subset modes are useless!"
No, they aren't. iscp_texcomp and my encoder make heavy use of these modes. The perf reduction for checking the 3 subset modes is minimal but the PSNR gain from using them is high:

Across my 31 texture corpus, all modes (but 7) gets 46.72 dB, disabling the two 3 subset modes I get 45.96 dB. This is a significant difference. Some timings: 487 secs (3 subsets enabled) vs. 476 secs (3 subsets disabled). In another test at lower quality: 215.9 secs (3 subsets enabled) vs. 199.5 secs (disabled).

Of the publicly available BC7 encoders I've tested or looked at so far none of them have strong or properly working mode 0 encoders. Mode 0 is particularly tough to handle correctly because of its 4-bit/component endpoints with pbits. Most encoders mess up their pbit handling so mode 0 appears much weaker than it really is. 

- Excluding iscp_texcomp, I've been amazed at how weak the open source CPU/GPU BC7 encoders are. Every single encoder I've looked at except for iscp_texcomp has pbit handling issues or bugs holding them back. DirectXTex, Volition, NVTT - all use pbit voting/parity without properly rounding the component value, which is incorrect and introduces significant error. DirectXTex had outright pbit bugs causing it to return lower quality textures when 3 subset modes were enabled.

Note to encoder authors: You can't estimate error for a trial solution without factoring in the pbits, or your encoder will make very bad choices that actually increase the solution error. Those LSB's are valuable, and you can't flip them without adjusting the components too.

- None of the available encoders support perceptual colorspace metrics (such as computing weighted error in YCbCr space like etc2comp does), which is hugely valuable in BC7. You can get up to 2 dB gain from switching to perceptual metrics, which means your encoder doesn't have to work nearly as hard to compete against a non-perceptual encoder.

Before you write a BC1 or BC7 encoder, you should probably write a simple JPEG compressor first to learn all about chroma subsampling, etc.

- SSIM and PSNR are highly correlated for basic 4x4 block compression. I'll blog about this soon. I test with both, but honestly it's just not valuable to do so because they are so correlated and SSIM is more expensive to compute.

- Whoever wrote iscp_texcomp at Intel should be given a big bonus because it's overall very good. Without it the BC7/BC6H situation would be pretty bad.

Comparing ispc_texcomp alpha performance vs. my encoder

Most papers and encoders focus on opaque performance with BC7, but alpha textures are very important too. BC7's alpha support is somewhat weaker than opaque, especially with alpha signals that are uncorrelated vs. RGB. It only has a single 2 subset mode that can handle alpha, with limited color precision (555.1 with 2-bit indices).

This test exercises each codec in a different way than my previous opaque-only tests. Modes 0-3 are useless with transparent blocks.

I've finished the alpha path in my new non-RDO BC7 encoder. Results on a 4k test texture containing random 4x4 blocks (both in RGB and A) picked from thousands of textures in my corpus:

My encoder - higher quality settings:
Time: 42.4 secs
RGB: 46.058 RGB Avg. PSNR
A: 44.521 PSNR

My encoder - lower quality settings:
Time: 21.9 secs
RGB: 45.986
A: 44.539

My encoder - faster settings:
Time: 12.8 secs
RGB: 45.850
A: 43.949

ispc_texcomp slow:
Time: 155.4 secs
RGB: 45.826
A: 44.216

ispc_texcomp basic: 
Time: 45.6 secs
RGB: 45.820
A: 44.188

ispc_texcomp fast:
Time: 23.3 secs
RGB: 45.647
A: 44.307

This was with linear colorspace metrics, and benchmarking was on a single thread. The RGB/A stats are PSNR (higher is better). (In case you're wondering, SSIM and PSNR are highly correlated with block compression, so I usually use PSNR until I start doing whole-texture stuff with RDO.)

I'm now ~2x faster at higher quality vs. the fastest CPU BC7 encoder I'm aware of, and there are several easy optimizations left. And this is before enabling perceptual metrics.

BC7 mode utilization comparison of three encoders

I've been doing some benchmarking today to see where I stand with raw (non-RDO) BC7 encoding. Depending on the profile, I'm up to 2.26x faster at higher average quality using linear colorspace metrics vs. ispc_texcomp.

BC7 mode utilization, ispc_texcomp's basic profile, with my encoder set to a roughly similar profile:


Here's ispc_texcomp's slow profile, my encoder at a higher quality profile, and DirectXTex with BC_FLAGS_USE_3SUBSETS (with the pbit bug fixed so this is a fair comparison).


After modifying DirectXTex to try mode 6 first it get slightly faster and uses mode 6 much more often:


This was a multithreaded test. The timings are the overall amount of CPU time utilized only for encoding across all threads.

My encoder favors mode 6 on grayscale inputs (one large texture is grayscale), and it's always the first mode that's checked for opaque blocks so on simple blocks mode 6 gets favored. Mode 6 has very good endpoint precision (7777.1) and large 4-bit indices, so even on complex blocks it's fairly good.

GPU texture compression error metrics

While working on an encoder I conduct a lot of experiments (probably thousands over time) to improve it. To check for regressions or silly bugs, you must use some sort of error metrics otherwise it'll be impossible to make forward progress just by examining the output with your eye.

Here's what I commonly use:

Total time: 0.124466, encode-only CPU time: 3.288052
Compressed size: 360041, 7.325053 bpp

BC7 Mode histogram:
6 22 2 0 3809 2754 17936 0

RGB Total   Error: Max:  53, Mean: 5.220, MSE: 21.670, RMSE: 4.655, PSNR: 34.772, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

RGB Average Error: Max:  53, Mean: 1.740, MSE: 7.223, RMSE: 2.688, PSNR: 39.543, SSIM: 0.985854, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

Luma        Error: Max:  35, Mean: 1.123, MSE: 3.048, RMSE: 1.746, PSNR: 43.290, SSIM: 0.993128, CRCA: 0x351EFAEF CRCB: 0xA29288A7

Red         Error: Max:  52, Mean: 1.766, MSE: 7.544, RMSE: 2.747, PSNR: 39.355, SSIM: 0.987541, CRCA: 0xC673D5EA CRCB: 0x8623AA59

Green       Error: Max:  40, Mean: 1.553, MSE: 5.462, RMSE: 2.337, PSNR: 40.758, SSIM: 0.989642, CRCA: 0x2BF4294F CRCB: 0x418E5EE1

Blue        Error: Max:  53, Mean: 1.900, MSE: 8.664, RMSE: 2.944, PSNR: 38.753, SSIM: 0.980377, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

Alpha       Error: Max:  31, Mean: 1.039, MSE: 3.308, RMSE: 1.819, PSNR: 42.935, SSIM: 0.976243, CRCA: 0x524049DA CRCB: 0x7492A642

I report some timings (two timings because it may be a threaded encode), the compressed size of the data (using LZMA) to get a feel for the encoder's output entropy, the mode histogram on those modes that support this concept, and then a bunch of metrics. 

I report a bunch of variations of PSNR (RGB Total and RGB Average) because there really are no rock solid standards for this and every author or tool seems to do this in a different way. I sometimes also use RGBA PSNR to compare my results against other tools.

Sometimes, an encoder bug will subtly break just a few blocks. That's where the Max errors are really handy.

Luma error is what I pay attention to during perceptual encoding tests, and RGB Average for non-perceptual tests. SSIM and PSNR are highly correlated in my experience for block compression work, which I'll show in a future blog post.

I also use PSNR per second or SSIM per second metrics (or some variation).

The CRC's are there to detect when the output (or input) data is exactly the same across runs.

I have a tool which can compare two encoders across a large corpus of test textures, which is really handy for finding regressions.

I also test with large textures containing random blocks gathered from thousands of test textures. For example (blogger.com has resampled it, here's the original 4K PNG):


These corpus textures can be generated using crunch in the corpus_gen mode.

I've found that testing like this is a reliable way of gauging the overall strength of an encoder vs. another. Graphics devs will instantly say "but wait why aren't you looking at minimizing block artifacts by looking at adjacent blocks!" That turns block compression into a global optimization problem (like PVRTC), and it's already hard enough as it is to do in a practical amount of CPU time. Also with BC7, the error is already pretty low (45-60 dB), and BC1-style block artifacts in a properly written BC7 encoder at high quality settings are uncommon.

Monday, April 23, 2018

LZHAM decompressor vectorization

The is just a sketch of an idea. It might not pan out, there are too many details to know until I try it. LZHAM is an old codec now, but maybe I could stretch it a bit further before moving on to a new design.

It should be possible to decode multiple segments of each LZHAM block simultaneously with vector instructions. Porting the decoder to C, then vectorizing the decoder loop with SPMD (using iscp) shouldn't be that hard. The compressed stream will need some hints so it knows where each lane needs to start decoding from.

So for 8 parallel decodes, at the beginning of the block you memcpy() the arithmetic and Huffman statistics into 8 copies, run the SPMD decoder on say 1-2K bytes per lane, then sync the statistics after 8 1-2K blocks are processed. Then repeat the process.

Lane 0 will be able to process match copies from any previously decompressed bytes, but lane 1 can't access lane 0's decoded bytes, and lane 2 can't access lane 1/0's, etc. That'll definitely lower ratio. However, it should be possible for the compressor to simulate lane 0's decompressor and predict which bytes are available in lane 1 at each point in the compressed data stream.

The various if() statements in the decoder's inner loop won't be particularly coherent, which would impact lane efficiency. But even a 2-3x improvement in decoding speed would be pretty nice.

DirectXTex BC7 3 subset wierdness

I brought Microsoft's DirectXTex project (latest code) into my test project, to see how it fairs vs. ispc_texcomp and my encoder. Unfortunately, it appears broken. Across 31 test images (kodim and others):

DirectXTex BC7:

No flags (0): 9972.6 secs 44.41 dB

BC_FLAGS_USE_3SUBSETS: 13534.6 secs 44.25 dB

This is wrong. Quality should go up with 3 subset modes enabled, not down. I'm tempted to go figure out what's wrong in there myself, but it's a lot of code.

By comparison, my ispc encoder gets 477.1 secs 46.72 dB (using high quality settings). ispc_texcomp is in the same ballpark. With 3 subset modes disabled, I get 45.96 dB (as expected - the 3 subset modes are useful!).

I verified that the flag is doing something. Here's the BC7 mode histogram for kodim01 with the flags parameter to DirectX::D3DXEncodeBC7() set to 0:

0 17968 0 1752 0 0 4856 0

With 3 subsets enabled:

1435 16647 26 1675 0 0 4793 0

The source looks nice and readable, and as a library it was dead-simple to get it building and linked against my stuff. But it doesn't appear to be a production-ready encoder, it's more like a sample.

I'm calling it from multiple threads using OpenMP (it's too slow to benchmark otherwise). It makes my 20 core Xeon workstation crawl for a while, it's that slow.

Also, there is no need to disable 3 subset modes in a properly written encoder. Some timings with my encoder: 487 secs (3 subsets enabled) vs. 476 secs (3 subsets disabled). In another test at lower quality: 215.9 secs (3 subsets enabled) vs. 199.5 secs (disabled). This was on a 20 core Xeon workstation/40 threads/31 images (kodim+others). 

The extra cost of a 3 subset mode isn't a big deal (3 endpoint optimizations) once you've estimated which partition(s) to examine more deeply. Partition estimation is fast and simple with SIMD, and a nice property of 3 subset modes is that the # of pixels fed to the endpoint optimizer per subset is rather low (enabling 3-subset specific optimizations). If your pbit handling is correct these modes are quite valuable.


Sunday, April 22, 2018

Proper pbit computation in the BC7 texture format

The BC7 GPU texture format supports the clever concept of endpoint "pbits", where the LSB's of RGB(A) endpoints are forced to the same value so only 1 bit (vs. 3 or 4) needs to be coded. BC7's use of pbits saves precious bits which can be used for other things which decrease error more. Some modes support a unique pbit per endpoint, and some only have 1 pbit for each endpoint pair.

I'm covering this because the majority of available BC7 encoders mess this important detail up. (I now kinda doubt any available BC7 encoder handles this correctly in all modes.) The overall difference across a 31 texture corpus (including the kodim images) is .26 dB RGB PSNR, which is quite a lot considering the CPU/GPU cost of doing this correctly vs. incorrectly is the same. (The improvement is even greater if you try all pbit combinations with proper rounding: .4 dB.)

ispc_texcomp handles this correctly for sure in most if not all modes, while the DirectXTex CPU, Volition GPU, and NVidia Texture Tool encoders don't as far as I can tell (they use pbit voting without proper rounding - the worst). The difference to doing this correctly in some modes is pretty significant - by at least ~.6 dB in mode 0!

Not doing this properly means your encoder will run slower because it will have to work harder (scanning more of the search space) to keep PSNR up vs. the competition. The amount of compute involved in lifting a BC7 encoder "up" by .26 dB across an entire test corpus is actually quite large, because there's a very steep quality vs. compute "wall" in BC7.

Here are some of the ways p-bits can be computed. The RGB PSNR's were for a single encoded image (kodim18), purposely limited to mode 0 (with 4 bit components+unique per-endpoint pbits) to emphasize the importance of doing this correctly:
  • 40.217 dB: pbit search+compensated rounding: Compute properly rounded component endpoints compensating for the chosen pbit, try all pbit combinations. This is an encoder option in my new BC7 encoder. Encoding error evaluation cost: 2X or 4X (expensive!)
  • 39.663 dB: Round to middle of component bin+pbit search: Compute rounded endpoints (with a scale factor not including the pbit), then evaluate the full error of all 2 or 4 pbit possibilities. This is what I initially started doing, because it's trivial to code. In mode 0, you scale by 2^4, round, then iterate through all the pbits and test the error of each combination. Error evaluation cost: 2X or 4X
  • 39.431 dB: Compensated rounding (ispc_texcomp method): Proper quantization interval rounding, factoring in the shift introduced when the pbit is 1 vs. 0. The key idea: If an endpoint scaled and rounded to full-precision (with a scale factoring in the pbit) has an LSB which differs from the pbit actually encoded, you must properly round the output component value to compensate for the different LSB or you will introduce more error than necessary. So basically, if the LSB you want is different from what's encoded, you need to correctly round the encoded component index to compensate for this difference. You must also factor in the way the hardware scales up the encoded component+pbit to 8-bits. Error evaluation cost: 1X
  • 39.151 dB: Voting/parity (DirectXTex/Volition): Count how many endpoint components in the scaled colors (with a scale factor including the pbit) sharing each pbit have set LSB's. If half or more do then set the encoded pbit to 1, otherwise 0. pbit voting doesn't round the encoded components so it introduces a lot of error. Error evaluation cost: 1X
  • 38.712 dB: Always 0 or 0,1
  • 37.878: Always 0 or 0,0
I tested a few different ways of breaking ties when computing pbits by voting and just reported the best one. At least on this image biasing the high endpoint towards 1 helps a little:

Shared  Unique
> >    39.053
> >=   39.151
>= >   38.996
>= >=  38.432
>=  > >=   39.151

This stuff is surprisingly tricky to get right, so here's a mode 0 example to illustrate what's going on. Each component can be coded to 16 possible values with one pbit selecting between two different ramps. So factoring in the pbit we have 32 possible representable 8-bit values. Here are the resulting 8-bit values (scaled using the shift+or method BC7 uses - not pure arithmetic scaling by 255/31 which is slightly different):

pbit 0:
0
16
33
49
66
82
99
115
132
148
165
181
198
214
231
247

pbit 1:
8
24
41
57
74
90
107
123
140
156
173
189
206
222
239
255

Let's say the encoder wants to encode an endpoint value of 9/255 (using 8-bit precision) in mode 0 (4-bit components+pbit). The pbit voting encoders will compute a quantized/scaled component value of 1/31 (from a range of [0,31] - not [0,15] because we're factoring in the pbit). The LSB is 1 and the encoded component index (the top 4 MSB's) is 0, and if more than half of the other component LSB's are also 1 we're ok. In the good case we're coding a value of 8/255, which is closer to 9/255 than 24/255.

If instead a pbit of 0 is chosen, we're now encoding a value of 0/255 (because the encoded component index of 0 wasn't compensated), when we should have chosen the closer value of 16/255 (i.e. a component index of 1). Choosing the wrong LSB and not compensating the index has resulted in increased error.

There's an additional bit of complexity to all this: The hardware scales the mode 0 index+pbit up to 8-bits by shifting the index+pbit left by 3 bits for mode 0, then it takes the upper 3 MSB's of this and logically or's them into the lower 3 bits to fill in. This isn't quite the same as scaling by 255/31. So proper pbit determination code needs to factor this in, too. Here are the ramps computed using arithmetic scaling+rounding (notice they slightly differ from the previous ramps computed using shifting+or'ing):

0
16
33
49
66
82
99
115
132
148
165
181
197
214
230
247

8
25
41
58
74
90
107
123
140
156
173
189
206
222
239
255

I worked out the formulas involved on a piece of paper: 

How to compute [0,1] values from mode 0 bins+pbits (using arithmetic scaling, NOT hardware scaling):
pbit 0: value=bin*2/31
pbit 1: value=(bin*2+1)/31

How to compute mode 0 bins from [0,1] values with proper compensation/rounding (rearranging the equations+rounding) for each pbit index:
pbit 0: bin=floor(value*31/2+.5)
pbit 1: bin=floor((value*31-1)/2+.5)

Here's the clever code in ispc_texcomp that handles this correctly for modes with unique p-bits (modes 0,3,6,7). I bolded the bin calculations, which are slightly optimized forms of the previous set of equations. 

I believe there's actually a bug in here for mode 7 - I don't see it scaling the component values up to 8-bit bytes for this mode. It has a special case in there to handle mode 0, and modes 3/6 don't need scaling because they have 7-bit components, but mode 7 has 5-bit components. I didn't check the rest of the code to see if it actually handles mode 7 elsewhere, but it's possible ispc_texcomp's handling of mode 7 is actually broken due to this bug. Mode 7 isn't valuable when encoding opaque textures, but is pretty valuable for alpha textures because it's the only alpha mode that supports partitions.

///////////////////////////
// endpoint quantization

inline int unpack_to_byte(int v, uniform const int bits)
{
    assert(bits >= 4);
    int vv = v << (8 - bits);
    return vv + shift_right(vv, bits);
}

void ep_quant0367(int qep[], float ep[], uniform int mode, uniform int channels)
{
    uniform int bits = 7;
    if (mode == 0) bits = 4;
    if (mode == 7) bits = 5;

    uniform int levels = 1 << bits;
    uniform int levels2 = levels * 2 - 1;

    for (uniform int i = 0; i < 2; i++)
    {
        int qep_b[8];

        for (uniform int b = 0; b < 2; b++)
            for (uniform int p = 0; p < 4; p++)
            {
                int v = (int)((ep[i * 4 + p] / 255f*levels2 - b) / 2 + 0.5) * 2 + b;
                qep_b[b * 4 + p] = clamp(v, b, levels2 - 1 + b);
            }

        float ep_b[8];
        for (uniform int j = 0; j < 8; j++)
            ep_b[j] = qep_b[j];

        if (mode == 0)
            for (uniform int j = 0; j < 8; j++)
                ep_b[j] = unpack_to_byte(qep_b[j], 5);

        float err0 = 0f;
        float err1 = 0f;
        for (uniform int p = 0; p < channels; p++)
        {
            err0 += sq(ep[i * 4 + p] - ep_b[0 + p]);
            err1 += sq(ep[i * 4 + p] - ep_b[4 + p]);
        }

        for (uniform int p = 0; p < 4; p++)
            qep[i * 4 + p] = (err0 < err1) ? qep_b[0 + p] : qep_b[4 + p];
    }
}