Friday, April 27, 2018

BC7 showdown #2: Basis ispc vs. NVidia Texture Tools vs. ispc_texcomp slow

Got my BC7 encoder test app linking with NVTT. The BC7 encoder in NVTT is the granddaddy of them all, from what I understand. It's painfully slow but very high quality. I called it using the blob of code below. It supports weighted metrics, which is great.

Anyhow, here are the test results using linear RGB metrics (non-perceptual), comparing NVTT and ispc_texcomp vs. Basis non-RDO BC7 ispc. The test corpus was kodim01-24 plus a number of other images/textures I commonly test with. I turned up the Basis BC7 encoder options to the highest currently supported.

Basis BC7:         365.5 secs   47.162389 dB
NVTT:              28061.9 secs 47.141580 dB
ispc_texcomp slow: 353.6 secs   46.769749 dB

This was a multithreaded test (using OpenMP) on a dual Xeon workstation supporting AVX.

Here's the code snippet for calling NVTT's AVPCL encoder directly to pack BC7 blocks (bypassing the rest of NVTT because I don't want to pack entire textures, just blocks):

#include "../3rdparty/nvidia-texture-tools/src/bc7/avpcl.h"

void nvtt_bc7_compress(void *pBlock, const uint8_t *pPixels, bool perceptual)
AVPCL::mode_rgb = false;
AVPCL::flag_premult = false; //(alphaMode == AlphaMode_Premultiplied);
AVPCL::flag_nonuniform = false;
AVPCL::flag_nonuniform_ati = perceptual;

// Convert NVTT's tile struct to AVPCL's.
AVPCL::Tile avpclTile(4, 4);
memset(, 0, sizeof(;

for (uint y = 0; y < 4; ++y) 
for (uint x = 0; x < 4; ++x) 
nv::Vector4 &p =[y][x];

p.x = pPixels[0];
p.y = pPixels[1];
p.z = pPixels[2];
p.w = pPixels[3];

pPixels += 4;

avpclTile.importance_map[y][x] = 1.0f; //weights[4*y+x];

AVPCL::compress(avpclTile, (char *)pBlock);

Thursday, April 26, 2018

New BC7 encoder open sourced

bc7m16.c/.h is one of the strongest CPU BC7 encoders available at the moment for opaque textures:

In perceptual mode using luma PSNR on opaque textures it beats ispc_texcomp, without vectorization and only using 2 modes.

Using linear RGB metrics it's around 20% slower at same PSNR (compared to ispc_texcomp's fast profile). I don't suggest you use linear RGB metrics because the results visually look worse vs. perceptual metrics, as we learned years ago with BC1 encoding.

This encoder was basically the prototype for my new non-RDO ispc BC7 encoder I wrote for Basis, which supports all the modes and is even faster.

Wednesday, April 25, 2018

A tale of multiple BC7 encoders

To get going with BC7, I found an open source block decompressor and first created a BC7 struct (using bitfields) to create a single mode 6 block. I filled in the fields for a simple mode 6 block, called the decompressor, and examined the output. (Turns out this decompressor has some bugs which I've reported to the author.)

For my first encoder I started in C++ and wrote a straightforward but feature complete encoder. It supports exhaustive partition scanning or it can estimate a single partition to evaluate which is around 12x faster. It's also pretty generic, so you can call the endpoint optimizer with any number of pixels (not just 16). About the only thing good you can say about this encoder is that it's high quality and easy to follow. Its performance is dreadful (even worse then DirectXTex, believe it or not) unless you force it to evaluate only a single partition per mode in which case it's ~2.6x faster than DirectXTex.

I then wrote a new, simpler C++ encoder, then quickly ported this to C. For this simpler encoder, I focused on just the modes that mattered most to performance vs. quality. The endpoint optimizer was less generic and more focused, and I removed some less useful features in perceptual mode. I then added in additional modes to this encoder to bring its quality up vs. Intel's ispc_texcomp. After this, I took the C encoder and got it building with iscp, then profiled and optimized this over a period of a week or so.

1. Ridiculously slow C++ encoder, evaluate all partitions:
46286.7 secs, 51.56 Luma PSNR

2. Ridiculously slow C++ encoder, evaluate the single best partition only:
4615.5 secs, 51.35 Luma PSNR

3. Faster C encoder:
458.7 secs, 50.77 Luma PSNR

4. Vectorized encoder - C encoder ported to ispc then further optimized and enhanced:
76.4 secs, 51.16 Luma PSNR

Notice each time I got roughly a 6-10x performance increase at similar quality. The iscp version is currently 6x faster than the C version. To get it there I had to examine the code generated for all the inner loops and tweak the iscp code for decent code generation. Without doing this it was only around 2x faster.

I used the previous encoder as a reference design during each step of this optimization process, which was invaluable for checking for regressions.

Some things about BC7 that are commonly believed but aren't really true, and other things I've learned along the way (along with some light ranting but I mean well):

- "BC7 has this massive search space - it takes forever to encode!" Actually, with a properly written encoder it's not that big of a deal. You can very quickly estimate which partition(s) to examine (which is an extremely vectorizable operation), you can drop half the modes for alpha textures, mode 7 is useless with opaque textures, the 3 subset modes can be dropped unless you want absolute max quality, and many fast endpoint optimization methods from BC1 are usable with BC7. BC1-style cluster fit isn't a good match for BC7 (because it has 3 and 4 bit indices), but you can easily approximate it.

For a fast opaque-only encoder, my statistics show that supporting just modes 1 and 6, or 0/1/6, results in reasonably high quality without noticeable block artifacts.

- CPU encoders are too slow, so let's use the GPU!
Actually, a properly vectorized CPU encoder can be extremely fast. There is nothing special about BC7 that requires compute shaders if you do it correctly. Using gradient descent or calling rand() in your encoder's inner loop are red flags that you are going off the rails. The search space is massive but most of it is useless.

You need the right algorithms, and strong heuristics matter in GPU texture compression. Write a BC1 encoder that can compete against squish first, then tackle BC7.

- Graphics dev BC7 folklore: "3 subset modes are useless!"
No, they aren't. iscp_texcomp and my encoder make heavy use of these modes. The perf reduction for checking the 3 subset modes is minimal but the PSNR gain from using them is high:

Across my 31 texture corpus, all modes (but 7) gets 46.72 dB, disabling the two 3 subset modes I get 45.96 dB. This is a significant difference. Some timings: 487 secs (3 subsets enabled) vs. 476 secs (3 subsets disabled). In another test at lower quality: 215.9 secs (3 subsets enabled) vs. 199.5 secs (disabled).

Of the publicly available BC7 encoders I've tested or looked at so far none of them have strong or properly working mode 0 encoders. Mode 0 is particularly tough to handle correctly because of its 4-bit/component endpoints with pbits. Most encoders mess up their pbit handling so mode 0 appears much weaker than it really is. 

- Excluding iscp_texcomp, I've been amazed at how weak the open source CPU/GPU BC7 encoders are. Every single encoder I've looked at except for iscp_texcomp has pbit handling issues or bugs holding them back. DirectXTex, Volition, NVTT - all use pbit voting/parity without properly rounding the component value, which is incorrect and introduces significant error. DirectXTex had outright pbit bugs causing it to return lower quality textures when 3 subset modes were enabled.

Note to encoder authors: You can't estimate error for a trial solution without factoring in the pbits, or your encoder will make very bad choices that actually increase the solution error. Those LSB's are valuable, and you can't flip them without adjusting the components too.

- None of the available encoders support perceptual colorspace metrics (such as computing weighted error in YCbCr space like etc2comp does), which is hugely valuable in BC7. You can get up to 2 dB gain from switching to perceptual metrics, which means your encoder doesn't have to work nearly as hard to compete against a non-perceptual encoder.

Before you write a BC1 or BC7 encoder, you should probably write a simple JPEG compressor first to learn all about chroma subsampling, etc.

- SSIM and PSNR are highly correlated for basic 4x4 block compression. I'll blog about this soon. I test with both, but honestly it's just not valuable to do so because they are so correlated and SSIM is more expensive to compute.

- Whoever wrote iscp_texcomp at Intel should be given a big bonus because it's overall very good. Without it the BC7/BC6H situation would be pretty bad.

Comparing ispc_texcomp alpha performance vs. my encoder

Most papers and encoders focus on opaque performance with BC7, but alpha textures are very important too. BC7's alpha support is somewhat weaker than opaque, especially with alpha signals that are uncorrelated vs. RGB. It only has a single 2 subset mode that can handle alpha, with limited color precision (555.1 with 2-bit indices).

This test exercises each codec in a different way than my previous opaque-only tests. Modes 0-3 are useless with transparent blocks.

I've finished the alpha path in my new non-RDO BC7 encoder. Results on a 4k test texture containing random 4x4 blocks (both in RGB and A) picked from thousands of textures in my corpus:

My encoder - higher quality settings:
Time: 42.4 secs
RGB: 46.058 RGB Avg. PSNR
A: 44.521 PSNR

My encoder - lower quality settings:
Time: 21.9 secs
RGB: 45.986
A: 44.539

My encoder - faster settings:
Time: 12.8 secs
RGB: 45.850
A: 43.949

ispc_texcomp slow:
Time: 155.4 secs
RGB: 45.826
A: 44.216

ispc_texcomp basic: 
Time: 45.6 secs
RGB: 45.820
A: 44.188

ispc_texcomp fast:
Time: 23.3 secs
RGB: 45.647
A: 44.307

This was with linear colorspace metrics, and benchmarking was on a single thread. The RGB/A stats are PSNR (higher is better). (In case you're wondering, SSIM and PSNR are highly correlated with block compression, so I usually use PSNR until I start doing whole-texture stuff with RDO.)

I'm now ~2x faster at higher quality vs. the fastest CPU BC7 encoder I'm aware of, and there are several easy optimizations left. And this is before enabling perceptual metrics.

BC7 mode utilization comparison of three encoders

I've been doing some benchmarking today to see where I stand with raw (non-RDO) BC7 encoding. Depending on the profile, I'm up to 2.26x faster at higher average quality using linear colorspace metrics vs. ispc_texcomp.

BC7 mode utilization, ispc_texcomp's basic profile, with my encoder set to a roughly similar profile:

Here's ispc_texcomp's slow profile, my encoder at a higher quality profile, and DirectXTex with BC_FLAGS_USE_3SUBSETS (with the pbit bug fixed so this is a fair comparison).

After modifying DirectXTex to try mode 6 first it get slightly faster and uses mode 6 much more often:

This was a multithreaded test. The timings are the overall amount of CPU time utilized only for encoding across all threads.

My encoder favors mode 6 on grayscale inputs (one large texture is grayscale), and it's always the first mode that's checked for opaque blocks so on simple blocks mode 6 gets favored. Mode 6 has very good endpoint precision (7777.1) and large 4-bit indices, so even on complex blocks it's fairly good.

GPU texture compression error metrics

While working on an encoder I conduct a lot of experiments (probably thousands over time) to improve it. To check for regressions or silly bugs, you must use some sort of error metrics otherwise it'll be impossible to make forward progress just by examining the output with your eye.

Here's what I commonly use:

Total time: 0.124466, encode-only CPU time: 3.288052
Compressed size: 360041, 7.325053 bpp

BC7 Mode histogram:
6 22 2 0 3809 2754 17936 0

RGB Total   Error: Max:  53, Mean: 5.220, MSE: 21.670, RMSE: 4.655, PSNR: 34.772, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

RGB Average Error: Max:  53, Mean: 1.740, MSE: 7.223, RMSE: 2.688, PSNR: 39.543, SSIM: 0.985854, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

Luma        Error: Max:  35, Mean: 1.123, MSE: 3.048, RMSE: 1.746, PSNR: 43.290, SSIM: 0.993128, CRCA: 0x351EFAEF CRCB: 0xA29288A7

Red         Error: Max:  52, Mean: 1.766, MSE: 7.544, RMSE: 2.747, PSNR: 39.355, SSIM: 0.987541, CRCA: 0xC673D5EA CRCB: 0x8623AA59

Green       Error: Max:  40, Mean: 1.553, MSE: 5.462, RMSE: 2.337, PSNR: 40.758, SSIM: 0.989642, CRCA: 0x2BF4294F CRCB: 0x418E5EE1

Blue        Error: Max:  53, Mean: 1.900, MSE: 8.664, RMSE: 2.944, PSNR: 38.753, SSIM: 0.980377, CRCA: 0x76C0BC5D CRCB: 0x9A5C47D1

Alpha       Error: Max:  31, Mean: 1.039, MSE: 3.308, RMSE: 1.819, PSNR: 42.935, SSIM: 0.976243, CRCA: 0x524049DA CRCB: 0x7492A642

I report some timings (two timings because it may be a threaded encode), the compressed size of the data (using LZMA) to get a feel for the encoder's output entropy, the mode histogram on those modes that support this concept, and then a bunch of metrics. 

I report a bunch of variations of PSNR (RGB Total and RGB Average) because there really are no rock solid standards for this and every author or tool seems to do this in a different way. I sometimes also use RGBA PSNR to compare my results against other tools.

Sometimes, an encoder bug will subtly break just a few blocks. That's where the Max errors are really handy.

Luma error is what I pay attention to during perceptual encoding tests, and RGB Average for non-perceptual tests. SSIM and PSNR are highly correlated in my experience for block compression work, which I'll show in a future blog post.

I also use PSNR per second or SSIM per second metrics (or some variation).

The CRC's are there to detect when the output (or input) data is exactly the same across runs.

I have a tool which can compare two encoders across a large corpus of test textures, which is really handy for finding regressions.

I also test with large textures containing random blocks gathered from thousands of test textures. For example ( has resampled it, here's the original 4K PNG):

These corpus textures can be generated using crunch in the corpus_gen mode.

I've found that testing like this is a reliable way of gauging the overall strength of an encoder vs. another. Graphics devs will instantly say "but wait why aren't you looking at minimizing block artifacts by looking at adjacent blocks!" That turns block compression into a global optimization problem (like PVRTC), and it's already hard enough as it is to do in a practical amount of CPU time. Also with BC7, the error is already pretty low (45-60 dB), and BC1-style block artifacts in a properly written BC7 encoder at high quality settings are uncommon.

Monday, April 23, 2018

LZHAM decompressor vectorization

The is just a sketch of an idea. It might not pan out, there are too many details to know until I try it. LZHAM is an old codec now, but maybe I could stretch it a bit further before moving on to a new design.

It should be possible to decode multiple segments of each LZHAM block simultaneously with vector instructions. Porting the decoder to C, then vectorizing the decoder loop with SPMD (using iscp) shouldn't be that hard. The compressed stream will need some hints so it knows where each lane needs to start decoding from.

So for 8 parallel decodes, at the beginning of the block you memcpy() the arithmetic and Huffman statistics into 8 copies, run the SPMD decoder on say 1-2K bytes per lane, then sync the statistics after 8 1-2K blocks are processed. Then repeat the process.

Lane 0 will be able to process match copies from any previously decompressed bytes, but lane 1 can't access lane 0's decoded bytes, and lane 2 can't access lane 1/0's, etc. That'll definitely lower ratio. However, it should be possible for the compressor to simulate lane 0's decompressor and predict which bytes are available in lane 1 at each point in the compressed data stream.

The various if() statements in the decoder's inner loop won't be particularly coherent, which would impact lane efficiency. But even a 2-3x improvement in decoding speed would be pretty nice.

DirectXTex BC7 3 subset wierdness

I brought Microsoft's DirectXTex project (latest code) into my test project, to see how it fairs vs. ispc_texcomp and my encoder. Unfortunately, it appears broken. Across 31 test images (kodim and others):

DirectXTex BC7:

No flags (0): 9972.6 secs 44.41 dB

BC_FLAGS_USE_3SUBSETS: 13534.6 secs 44.25 dB

This is wrong. Quality should go up with 3 subset modes enabled, not down. I'm tempted to go figure out what's wrong in there myself, but it's a lot of code.

By comparison, my ispc encoder gets 477.1 secs 46.72 dB (using high quality settings). ispc_texcomp is in the same ballpark. With 3 subset modes disabled, I get 45.96 dB (as expected - the 3 subset modes are useful!).

I verified that the flag is doing something. Here's the BC7 mode histogram for kodim01 with the flags parameter to DirectX::D3DXEncodeBC7() set to 0:

0 17968 0 1752 0 0 4856 0

With 3 subsets enabled:

1435 16647 26 1675 0 0 4793 0

The source looks nice and readable, and as a library it was dead-simple to get it building and linked against my stuff. But it doesn't appear to be a production-ready encoder, it's more like a sample.

I'm calling it from multiple threads using OpenMP (it's too slow to benchmark otherwise). It makes my 20 core Xeon workstation crawl for a while, it's that slow.

Also, there is no need to disable 3 subset modes in a properly written encoder. Some timings with my encoder: 487 secs (3 subsets enabled) vs. 476 secs (3 subsets disabled). In another test at lower quality: 215.9 secs (3 subsets enabled) vs. 199.5 secs (disabled). This was on a 20 core Xeon workstation/40 threads/31 images (kodim+others). 

The extra cost of a 3 subset mode isn't a big deal (3 endpoint optimizations) once you've estimated which partition(s) to examine more deeply. Partition estimation is fast and simple with SIMD, and a nice property of 3 subset modes is that the # of pixels fed to the endpoint optimizer per subset is rather low (enabling 3-subset specific optimizations). If your pbit handling is correct these modes are quite valuable.

Sunday, April 22, 2018

Proper pbit computation in the BC7 texture format

The BC7 GPU texture format supports the clever concept of endpoint "pbits", where the LSB's of RGB(A) endpoints are forced to the same value so only 1 bit (vs. 3 or 4) needs to be coded. BC7's use of pbits saves precious bits which can be used for other things which decrease error more. Some modes support a unique pbit per endpoint, and some only have 1 pbit for each endpoint pair.

I'm covering this because the majority of available BC7 encoders mess this important detail up. (I now kinda doubt any available BC7 encoder handles this correctly in all modes.) The overall difference across a 31 texture corpus (including the kodim images) is .26 dB RGB PSNR, which is quite a lot considering the CPU/GPU cost of doing this correctly vs. incorrectly is the same. (The improvement is even greater if you try all pbit combinations with proper rounding: .4 dB.)

ispc_texcomp handles this correctly for sure in most if not all modes, while the DirectXTex CPU, Volition GPU, and NVidia Texture Tool encoders don't as far as I can tell (they use pbit voting without proper rounding - the worst). The difference to doing this correctly in some modes is pretty significant - by at least ~.6 dB in mode 0!

Not doing this properly means your encoder will run slower because it will have to work harder (scanning more of the search space) to keep PSNR up vs. the competition. The amount of compute involved in lifting a BC7 encoder "up" by .26 dB across an entire test corpus is actually quite large, because there's a very steep quality vs. compute "wall" in BC7.

Here are some of the ways p-bits can be computed. The RGB PSNR's were for a single encoded image (kodim18), purposely limited to mode 0 (with 4 bit components+unique per-endpoint pbits) to emphasize the importance of doing this correctly:
  • 40.217 dB: pbit search+compensated rounding: Compute properly rounded component endpoints compensating for the chosen pbit, try all pbit combinations. This is an encoder option in my new BC7 encoder. Encoding error evaluation cost: 2X or 4X (expensive!)
  • 39.663 dB: Round to middle of component bin+pbit search: Compute rounded endpoints (with a scale factor not including the pbit), then evaluate the full error of all 2 or 4 pbit possibilities. This is what I initially started doing, because it's trivial to code. In mode 0, you scale by 2^4, round, then iterate through all the pbits and test the error of each combination. Error evaluation cost: 2X or 4X
  • 39.431 dB: Compensated rounding (ispc_texcomp method): Proper quantization interval rounding, factoring in the shift introduced when the pbit is 1 vs. 0. The key idea: If an endpoint scaled and rounded to full-precision (with a scale factoring in the pbit) has an LSB which differs from the pbit actually encoded, you must properly round the output component value to compensate for the different LSB or you will introduce more error than necessary. So basically, if the LSB you want is different from what's encoded, you need to correctly round the encoded component index to compensate for this difference. You must also factor in the way the hardware scales up the encoded component+pbit to 8-bits. Error evaluation cost: 1X
  • 39.151 dB: Voting/parity (DirectXTex/Volition): Count how many endpoint components in the scaled colors (with a scale factor including the pbit) sharing each pbit have set LSB's. If half or more do then set the encoded pbit to 1, otherwise 0. pbit voting doesn't round the encoded components so it introduces a lot of error. Error evaluation cost: 1X
  • 38.712 dB: Always 0 or 0,1
  • 37.878: Always 0 or 0,0
I tested a few different ways of breaking ties when computing pbits by voting and just reported the best one. At least on this image biasing the high endpoint towards 1 helps a little:

Shared  Unique
> >    39.053
> >=   39.151
>= >   38.996
>= >=  38.432
>=  > >=   39.151

This stuff is surprisingly tricky to get right, so here's a mode 0 example to illustrate what's going on. Each component can be coded to 16 possible values with one pbit selecting between two different ramps. So factoring in the pbit we have 32 possible representable 8-bit values. Here are the resulting 8-bit values (scaled using the shift+or method BC7 uses - not pure arithmetic scaling by 255/31 which is slightly different):

pbit 0:

pbit 1:

Let's say the encoder wants to encode an endpoint value of 9/255 (using 8-bit precision) in mode 0 (4-bit components+pbit). The pbit voting encoders will compute a quantized/scaled component value of 1/31 (from a range of [0,31] - not [0,15] because we're factoring in the pbit). The LSB is 1 and the encoded component index (the top 4 MSB's) is 0, and if more than half of the other component LSB's are also 1 we're ok. In the good case we're coding a value of 8/255, which is closer to 9/255 than 24/255.

If instead a pbit of 0 is chosen, we're now encoding a value of 0/255 (because the encoded component index of 0 wasn't compensated), when we should have chosen the closer value of 16/255 (i.e. a component index of 1). Choosing the wrong LSB and not compensating the index has resulted in increased error.

There's an additional bit of complexity to all this: The hardware scales the mode 0 index+pbit up to 8-bits by shifting the index+pbit left by 3 bits for mode 0, then it takes the upper 3 MSB's of this and logically or's them into the lower 3 bits to fill in. This isn't quite the same as scaling by 255/31. So proper pbit determination code needs to factor this in, too. Here are the ramps computed using arithmetic scaling+rounding (notice they slightly differ from the previous ramps computed using shifting+or'ing):



I worked out the formulas involved on a piece of paper: 

How to compute [0,1] values from mode 0 bins+pbits (using arithmetic scaling, NOT hardware scaling):
pbit 0: value=bin*2/31
pbit 1: value=(bin*2+1)/31

How to compute mode 0 bins from [0,1] values with proper compensation/rounding (rearranging the equations+rounding) for each pbit index:
pbit 0: bin=floor(value*31/2+.5)
pbit 1: bin=floor((value*31-1)/2+.5)

Here's the clever code in ispc_texcomp that handles this correctly for modes with unique p-bits (modes 0,3,6,7). I bolded the bin calculations, which are slightly optimized forms of the previous set of equations. 

I believe there's actually a bug in here for mode 7 - I don't see it scaling the component values up to 8-bit bytes for this mode. It has a special case in there to handle mode 0, and modes 3/6 don't need scaling because they have 7-bit components, but mode 7 has 5-bit components. I didn't check the rest of the code to see if it actually handles mode 7 elsewhere, but it's possible ispc_texcomp's handling of mode 7 is actually broken due to this bug. Mode 7 isn't valuable when encoding opaque textures, but is pretty valuable for alpha textures because it's the only alpha mode that supports partitions.

// endpoint quantization

inline int unpack_to_byte(int v, uniform const int bits)
    assert(bits >= 4);
    int vv = v << (8 - bits);
    return vv + shift_right(vv, bits);

void ep_quant0367(int qep[], float ep[], uniform int mode, uniform int channels)
    uniform int bits = 7;
    if (mode == 0) bits = 4;
    if (mode == 7) bits = 5;

    uniform int levels = 1 << bits;
    uniform int levels2 = levels * 2 - 1;

    for (uniform int i = 0; i < 2; i++)
        int qep_b[8];

        for (uniform int b = 0; b < 2; b++)
            for (uniform int p = 0; p < 4; p++)
                int v = (int)((ep[i * 4 + p] / 255f*levels2 - b) / 2 + 0.5) * 2 + b;
                qep_b[b * 4 + p] = clamp(v, b, levels2 - 1 + b);

        float ep_b[8];
        for (uniform int j = 0; j < 8; j++)
            ep_b[j] = qep_b[j];

        if (mode == 0)
            for (uniform int j = 0; j < 8; j++)
                ep_b[j] = unpack_to_byte(qep_b[j], 5);

        float err0 = 0f;
        float err1 = 0f;
        for (uniform int p = 0; p < channels; p++)
            err0 += sq(ep[i * 4 + p] - ep_b[0 + p]);
            err1 += sq(ep[i * 4 + p] - ep_b[4 + p]);

        for (uniform int p = 0; p < 4; p++)
            qep[i * 4 + p] = (err0 < err1) ? qep_b[0 + p] : qep_b[4 + p];

Saturday, April 21, 2018

RDO BC7 encoder planning

I'm building my first RDO BC7/BC6H encoders for our product (Basis), and I'm trying to decide which BC7 modes to implement first. I've decided on modes 1+6 for opaque textures. For alpha, I've boiled down the options to modes 1+5+6, 1+4+6, or maybe 1+4+6+7.

For alpha I know I'll need either 4 and/or 5 because they are the only modes with uncorrelated alpha support. Mode 1 isn't optional because it supports 2 subsets, greatly reducing ugly block artifacts. The 3 subset modes aren't used much in practice and aren't necessary. Mode 6 is optional but boosts quality on simple blocks due to its awesome 7777.1 bit endpoint precision and beefy 4-bit selectors. The bare minimum modes for a decent opaque/alpha encoder seems to be 1+4 or 1+5, but adding 6 and 7 will improve quality especially with alpha textures.

Others have suggested that I do a RDO BC7 encoder using only a single subset mode (6 seems perfect), but the result would have BC1-style block artifacts. I've already built a 2 subset RDO encoder for ETC1 in Basis and it isn't that much harder to support 2 subsets vs. 1.

RDO GPU texture encoders try to optimize the encoded output data so that, after the output is LZ compressed (and the output pretty much always is!), you get as much quality per compressed bit as you can. The output bit patterns matter a lot. If you can reuse a bit pattern that's occurred in a nearby block, without impacting quality too much, an LZ codec can exploit this to issue a match vs. literals (saving bits and improving quality per compressed output bit). There's more to it than that, because you also want the ability to control the output quality vs. compressed size tradeoff in an artist-friendly way.

My RDO BC1 encoder in Basis uses Lagrangian optimization, and this is simplified because in BC1 the bit patterns nicely line up to byte boundaries which modern LZ codecs like. The endpoints and selectors aren't munged together, and a large chunk of the output entropy is in the selector data.

Anyhow, here's some of the data I used to make these decisions about BC7. I have a lot more data than this which I may post later.

image corpus was kodim+7 others
For comparison ispc_basic (using 7 modes) gets 46.54 dB
Encoder was set to extremely high quality (better than ispc_slow)

Single mode:

mode enctime   dB               
1    83        45.8 
3    91.2      43.4 
6    12.1      42.9 
2    12.9      42.7 
0    19.6      42.34
4    38.56     41.48
5    14.4      38.45 

Key mode combinations:
1+6  86.37     46.27
1+5  11.161    45.79
1+6  21.9      45.7 
1+4  137.6     45.38

This data definitely isn't ideal, because I used my ispc encoder which was heavily tuned for performance. So modes 1 and 3 tried all 64 partitions in this test, while modes 0 and 2 only tried the best predicted partition. But it matches up with earlier data I generated from a more brute force encoder.

Note that the mode 4 and 5 encoders used all rotation/index selector options, which skews the output a bit. I doubt I'll be supporting these mode 4/5 options. Two 1+6 entries are for the encoder set to very high vs. much lower quality levels.

Some earlier encoding error data for just kodim18, using an earlier C++ version of my encoder:

Best of any mode error: 12988.968550

Best of modes 0 and 1: 13186.750320
Best of modes 1 and 6: 13440.390471
Best of modes 1 and 2: 13521.855790
Best of modes 1 and 4: 13565.064541
Best of modes 1 and 3: 13589.143902
Best of modes 1 and 5: 13592.709517
Best of modes 1 and 7: 13604.182592
Best of modes 1 and 1: 13605.033774
Best of modes 0 and 6: 14099.723969
Best of modes 0 and 3: 15046.616630
Best of modes 2 and 6: 15383.155463
Best of modes 0 and 4: 15563.136445
Best of modes 0 and 5: 15665.245609
Best of modes 0 and 2: 15892.424359
Best of modes 3 and 6: 15977.876955

Best of modes 0 and 1 and 6: 13037.149688
Best of modes 0 and 1 and 4: 13160.175075
Best of modes 0 and 1 and 2: 13168.570462
Best of modes 0 and 1 and 3: 13171.807469
Best of modes 0 and 1 and 5: 13176.588633
Best of modes 0 and 1 and 7: 13186.504616
Best of modes 0 and 0 and 1: 13186.750320
Best of modes 0 and 1 and 1: 13186.750320
Best of modes 1 and 2 and 6: 13360.493404
Best of modes 1 and 4 and 6: 13400.774007
Best of modes 1 and 5 and 6: 13429.100640
Best of modes 1 and 3 and 6: 13433.822985
Best of modes 1 and 6 and 7: 13439.954762
Best of modes 1 and 1 and 6: 13440.390471
Best of modes 1 and 6 and 6: 13440.390471
Best of modes 1 and 2 and 4: 13489.904966
Best of modes 1 and 2 and 3: 13508.004738
Best of modes 1 and 2 and 5: 13511.406144
Best of modes 1 and 2 and 2: 13521.855790
Best of modes 1 and 1 and 2: 13521.855790

The 1+6 combination is pretty powerful. Only a quarter dB separates it from ispc_basic on this corpus. Also, mode 1's bit packing is nicely aligned to bytes (more or less) - perfect for RDO encoding!

byte bits
0    2 6  mode+partition
1    6 2  endpoints
2    4 4  endpoints
3    2 6  endpoints
4    6 2  endpoints
5    4 4  endpoints
6    2 6  endpoints
7    6 2  endpoints
8    4 4  endpoints
9    2 6  endpoints
10   2 6  pbits+selectors
11   8    selectors
12   8    selectors
13   8    selectors 
14   8    selectors
15   8    selectors

Note that most of mode 1's output is endpoint data, not selector data, so partition/pbit biasing, endpoint quantization and bit pattern optimization will be critical.

BC7 showdown: ispc_texcomp vs. my ispc encoder

This benchmark compares the Fast ISPC Texture Compressor's BC7 encoder vs. my new ispc vectorized encoder. I've just barely begun to profile and optimize it, but it's already looking really strong. To create this encoder, I studied all other available BC7 encoders and leveraged all the things I learned while creating crunch's BC1 high-quality encoder and Basic's new ETC1 and universal format encoders. This is a non-RDO encoding test, i.e. what matters here is how much quality each encoder can achieve per unit time.

It was conducted on a 20 core Xeon workstation across 31 test images (first 24 are the kodim images). Both use AVX instructions. The quality metric is RGB average PSNR, perceptual mode disabled. The PSNR's below are averages across the entire set. The timings are the total amount of CPU time used only calling the encoder functions (across all threads). OpenMP was used for threading, and each encoder was called with 64 blocks per function call.

I'm currently focusing on ispc_texcomp's basic and slow profiles:

basic profile: 100.5 secs, 46.54 dB, .4631 dB/sec
slow profile: 355.29 secs, 46.77 dB, .1316 dB/sec

My encoder:
uber 0: 56.7 secs, 46.49 dB, .8199 dB/sec
uber 1: 86.4 secs, 46.72 dB, .5407 dB/sec
uber 2: 129.1 secs, 46.79 dB, .3624 dB/sec
uber 2 (2 refinement passes, 16 max 1,3 partitions): 161.9 secs, 46.84 dB, .2893 dB/sec
uber 3 (2 refinement passes, 32 max 1,3 partitions, pbit search): 215.2 secs, 46.91 dB, .2180 dB/sec
uber 4 (2 refinement passes, 64 max 1,3 partitions, pbit search): : 292.5 secs, 46.96 dB, .1605 dB/sec

The dB/sec. values are a simple measure of encoder efficiency. ispc_texcomp's slow profile at .1315 dB/sec. is working very hard for very little quality per unit time compared to its basic profile. The efficiency of both encoders decreases as the quality is improved, but ispc_texcomp falls off very rapidly above basic and mine falls off later. I believe a whole texture encoder like etc2comp's can more efficiently get through the quality barrier here.

What this boils down to: If you use ispc_texcomp, definitely avoid the slow profile (the tiny gain in quality isn't worth it). And it's definitely possible to compete against ispc_texcomp using plain RGB metrics.

Friday, April 20, 2018

ispc_texcomp BC7 issues

Been studying ispc_texcomp today to better understand why it's so slow compared to my encoder (currently by a factor of 2x at max quality). We do many of the same things, so why is it slower? Overall, there are many clever/smart things in there (overall it's surprisingly well done!), but it's being held back by weak vectorization, some missing optimizations, and it only supports linear RGB metrics.

Here's what I've found so far:

- The inner loops are bogged down with gathers and scatters. Definitely not good. There's even a set of helper functions at the top with a comment of "(perf warning expected)". (Umm - the compiler perf warnings are there for a reason!) For an example, check out block_quant().

The innermost loops should not have gathers, period.

- The partition estimation code is using full PCA. I've found this to be unnecessary in my testing - just using Waveren's bounding box approximation works well and is trivially vectorizable. After all, we're not trying to compute the actual output, just a reasonable approximation.

So ispc_texcomp goes into overkill mode and computes PCA while estimating the partition. At least the way it computes each subset's PCA is smart: it first computes the overall block's statistics/covariance, then it subtracts out the statistics of each partition's active (masked) pixels to compute each subset's individual covar.

Also, it's only computing what looks like an upper bound on the error from the block statistics, not an approximation of the actual error. The approximation of the actual error (factoring in quantization to 4/8/16 selectors) is extremely fast to compute with SIMD code, so it's not clear to me what's better yet.

Overall, the author seems to be favoring cleverness vs. exploiting the properties of fast but simple SIMD code.

- It uses squish-style iterative refinement after choosing the partition: Basically, it computes the PCA, comes up with some initial selectors, uses least squares to optimize the endpoints, then it computes new selectors and tries all over again a few times. In my tests, the PSNR gain from this method is too low (fraction of a dB) to justify the repeated LS computation and selector selection costs. Just do it once and then optionally try other things. It's more effective to just vary the selectors in simple ways (simplified cluster fit) in each trial.

- There's no support for perceptual colorspace metrics in there. This indirectly impacts performance (against other codecs that do support perceptual metrics) because it's stuck competing against RGB PSNR, and getting RGB PSNR up in BC7 is VERY computational intensive. You basically hit a steep quality wall, then it takes massively more compute to get it up above that wall even by a fraction of a dB.

If it supported perceptual metrics (where error in R,G,B is allowed to become a little unbalanced by approx. .25 - 1.5 dB, favoring G and R) it wouldn't have to try as hard because it would gain ~1.5 dB or more instantly before hitting the wall.

- First, the good news: The selector quantizer (see block_quant()) is using a clever algorithm: It dots the desired color by the subset's axis, converts that to a scaled int by rounding, clamps that to between [1,num_selectors-1], then it computes full squared euclidean error between the desired color and the subset's interpolated colors (s-1) and s. It only has to compute the full distance to 2 colors vs. all of them, which is cool.

I've compared this method vs. full distance to all colors and the results are super close (~1/1000th of a dB) on many images (but not all - I've seen .1 dB RGB PSNR loss on some images).

Now the bad news: The implementation is just bad. First, it recomputes the subset axis for every pixel (even though there are only 2 or 3 of them in BC7). And it uses multiple gather's to fetch the endpoints! This is done for all 16 pixels in the block - ouch! There's also a per-pixel divide in there.

Also, with good SIMD computing full distance to all subset colors isn't that expensive, at least for 4 and maybe 8 color blocks. I've implemented optimized forms of full search vs. ispc_texcomp's method. At least with AVX, all the fetches into the weighted_colors[] array (one for each lane) just slow the method down. Brute force leads to simpler code once vectorized and seems to slightly win out overall for 4 and 8 color blocks. With 16 color blocks the smarter method wins.

- After iterative refinement it doesn't have any more ways of improving quality. Trying to vary the selectors in keys ways (say by incrementing the lowest values and decrementing the highest values - to exploit extrapolation) and then LS optimizing the results helps a lot (.3-.5 dB) and is very fast if you SIMD optimize the trial solution evaluator function, yet it doesn't do that.

- Its mode 0 encoder suffers from a lot of quantization error - which is indicative of some weaknesses in its endpoint selection:

ispc_texcomp mode 0 only:

My encoder mode 0 only (no dithering - just stronger endpoint selection):

- ispc_texcomp is weak with grayscale images, by around .6-1.2 dB in my testing. Granted, once you're over ~60dB it doesn't matter much.

The "slow" profile is solidly in the quality "wall" region I described earlier. The basic and faster profiles are in much healthier regions.

A few Intel SPMD Compiler (ispc) C porting tips

I took notes as I was porting my new BC7 encoder from C to ispc. First, be sure to read and re-read the user guideperformance guide, and FAQ. This compiler tech kicks ass and I hope Intel keeps throwing resources at it. My initial port of 3k lines of C and initial stabs at vectorization was only ~2x faster, but after tuning the inner loops perf. shot up to over 5x vs. regular C code (on AVX). All without having to use a single ugly intrinsic instruction.

I'm new to ispc so hopefully I haven't made any mistakes below, but here's what I learned during the process:

If you're not starting from scratch, port your code to plain C with minimal dependencies and test that first. Make some simple helper functions like clamp(), min(), etc. that look like the ispc standard lib's, so when you do get to ispc you can easily switch them over to use the stdlib's (which is very important for performance).

Then port your C code to ispc, but put "uniform" on ALL variables and pointers. Test that and make sure it still works. In my experience so far you should have minimal problems at this stage assuming you put uniforms everywhere. Now you can dive into vectorizing the thing. I would first figure out how things should be laid out in memory and go from there. You may be able to just vectorize the hotspots, or you may have to vectorize the entire thing (like I did which was hours of messing around with uniform/varying keywords).

While developing and recompiling over and over again, change your --target option to only target one specific instruction set temporarily: --target=avx (or SSE2, etc.). There's little use targeting a bunch of different instruction sets (SSE, SSE2, AVX, AVX2, etc.) while developing new code, at this point all you care about is getting it working correctly.

The mental model is like shaders but for the CPU. Conceptually, the entire program gang executes each instruction, but the results can be masked off on a per-lane basis. If you are comfortable with shaders you will get this model immediately. Just beware there's a lot of variability in the CPU cost of operations, and optimal code sequences can be dramatically faster than slower ones. Study the generated assembly of your hotspots in the debugger and experiment. CPU SIMD instruction sets seem more brittle than ones for GPU's (why?).

A single pointer deref can hide a super expensive gather or scatter. Don't ignore the compiler warnings. These warnings are critical and can help you understand what the compiler is actually doing with your code. Examine every gather and scatter and understand why the compiler is doing them. If these operations are in your hotspots/inner loops then rethink how your data is stored in memory. (I can't emphasize this enough - scatters and gathers kill perf. unless you are lucky enough to have a Xeon Phi.)

varying and uniform take on godlike properties in ispc. You must master them. A "varying struct" means the struct is internally expanded to contain X values for each member (one each for the size of the gang). sizeof(uniform struct) != sizeof(varying struct). While porting I had to check, recheck, and check again all uniform and varying keywords everywhere in my code.

You need to master pointers with ispc, which are definitely tricky at first. The pointee is uniform by default, but the pointer itself is varying by default which isn't always what you want. "varying struct *uniform ptr" is a uniform pointer to a varying struct (read it right to left). In most cases, I wanted varying struct's and uniform pointers to them.

Find all memset/memmove/memcpy's and examine them extremely closely. In many cases, they won't work as expected after vectorization. Check all sizeof()'s too. The compiler won't always give you warnings when you do something obviously dumb. In most cases I just replaced them with hand-rolled loops to copy/initialize the values, because once you switch to varying types all bets are off if a memset() will do the expected thing.

Sometimes, code sequences in vectorized code just don't work right. I had some code that inserted an element into a sorted list, that wouldn't work right until I rearranged it. Maybe it was something silly I did, but it pays to litter your code with assert()'s until you get things working.

assert()'s aren't automatically disabled in release, you must use "--opt=disable-assertions" to turn them off. assert()'s in vectorized code can be quite slow. The compiler should probably warn you about assert()'s when optimizations are enabled.

print("%", var); is how you print things (not "%u" or "%f" etc.). Double parentheses around the value means the lane was masked out. If using Visual Studio I wouldn't fully trust the locals window when debugging - use print().

Once you start vectorizing, either the compiler is going to crash, or the compiler is going to generate function prologs that immediately crash. Both events are unfortunately going to happen until it's more mature. For the func. prolog crashes, in most if not all cases this was due to a mismatch between the varying/uniform attributes of the passed in pointers to functions that didn't cause compiler errors or warnings. Check and double check your varying and uniform attributes on your pointers. Fix your function parameters until the crash goes away. These were quite painful early on. To help track them down, #if 0 out large sections of code until it works, then slowly bring code in until it fails.

The latest version of ispc (1.9.2) supports limited debugging with Visual Studio. Examining struct's with bool's doesn't seem to work, the locals window is very iffy but more or less works. Single stepping works. Profiling works but seems a little iffy.

If you start to really fight the compiler on a store somewhere, you've probably got something wrong with your varying/uniform keywords. Rethink your data and how your code manipulates it.

If you're just starting a port and are new to ispc, and you wind up with a "varying varying" pointer then it's ok to be paranoid. It's probably not really what you want.

Experienced obvious codegen issues with uniform shifts and logical or's of uint16 values. Once I casted them to uint32's the problems went away. Be wary of integer shifts, which I had issues with in a few spots.

Some very general/hand-wavy recommendations with vectorized code: Prefer SP math over DP. Prefer FP math over integer math. Prefer 32-bit integer math over 64-bit. Prefer signed vs. unsigned integers. Prefer FP math vs. looking stuff up from tables if using the tables requires gathering. Avoid uint64's. Prefer 32-bit int math intermediates vs. 8-bit. Prefer simpler algorithms that load from constant array entries in a table (so all lanes lookup at the same location in the table), vs. more complex algorithms that require table lookups with unique per-lane indices.

Study stdlib.ispc. Prefer stdlib's clamp() vs. custom functions, and prefer stdlib vs. your own stuff for min, max, etc. The compiler apparently will not divine that what you are doing is just a clamp, you should use the stdlib functions to get good SIMD code.

Use uniform's as much as you possibly can. Prefer to make loop iterators uniform by default. Make loop iterators uniform by default when you start iterating at 0, even if the high loop limit is varying.

Use cif() etc. on conditionals which will strongly be taken or not taken by the entire gang. Compilation can get noticeably slower as you switch to cif().

A few min's or max's and some boolean/bit twiddling ops can be much faster than the equivalent multiple if() statements. Study the SSE2 etc. instruction sets because there are some powerful things in there. Prefer building code out of helpers like select() from the stdlib for performance.

Things that usually make perfect sense in CPU code, like early outs, may actually just hurt you with SIMD code. If your early out checks have to check all lanes, and it's an uncommon early out, consider just removing or rethinking them.

Tuesday, April 17, 2018

BC7 encoding using weighted YCbCr colorspace metrics

I've written my second BC7 block encoder. My first was written in a straightforward way to gain experience with the format. My second was more focused on competing against the Fast ISPC Texture Compressor, but without using any SIMD, and was over 30x faster than my first attempt.

The BC7 encoders I've studied seem to be hyper focused on RGB PSNR metrics, which is just the wrong metric for many types of textures. Encoding authors that treat input textures as opaque arrays of 4x4 vectors are at a disadvantage in this domain. RGB PSNR tends to spread the error equally between the channels, which isn't what we want on sRGB textures. Instead, it's desirable to tradeoff a small amount of additional R/B error for less G error. This is what perceptual codecs like JPEG do: they transform the input into YCbCr space, then downsample and quantize the hell out of the CbCr coefficients because preserving chroma is a waste of bits.

Many other BC1 block compression codecs support weighted RGB metrics because in BC1 not doing so visually looks worse on sRGB photos/albedo textures/etc. Encoders using perceptual metrics look better on color gradients and with highly saturated blocks. Heavy usage of perceptual metrics dates back to at least NVidia's original nvdxt compressor, and it wasn't possible for crunch to compete against nvdxt without supporting perceptual metrics. The squish library recommends using perceptual metrics by default, because BC1 without perceptual metrics looks worse.

Anyhow, etc2comp by John Brooks takes things a step further and supports computing error metrics in weighted YCbCr space. Compared to vanilla RGB weighted metrics, this looks better in my experience writing Basis (especially with ETC1). I'm currently using weights (128,64,16).

Here's the REC 709 luma PSNR of 31 test textures encoded with ispc_texcomp (slow/highest quality - uses 7 modes) and my non-SIMD encoder in perceptual mode using just 4 modes:

The overall average PSNR for ispc_texcomp was 48.57, mine was 50.4. Even with ispc_texcomp's massive mode and SIMD advantages it does worse on this metric. ispc_texcomp doesn't support optimizing for perceptual metrics, which puts it at a huge disadvantage on many texture types.

I re-encoded the textures with linear metrics. My encoder used 6 modes: 0, 1, 3, 4, 5, and 6 (including all component rotations and the index flag).

ispc_texcomp's average PSNR was 46.77, mine was 46.50. My encoder can easily bridge this ~.25 dB gap (by using more modes and trying more partitions), but at a time penalty.

Note that ispc_texcomp in its best/slowest profile is pretty slow, and is much easier to compete against without SIMD code. It's just trying way too hard. It's faster in its lower quality "basic" profile, but it still doesn't support perceptual metrics so it'll continue to fight up a very steep hill.

For benchmarking, I ran each encoder in a single thread, and called ispc_texcomp with 64 blocks at a time.

Other findings: ispc_texcomp has a very weak mode 0 encoder, and it's weaker than it should be on grayscale textures. I'll blog examples soon.

Wednesday, April 4, 2018

Imaginary GPU formats

Every once in a while I wonder about alternative GPU texture format encodings. (Why not? It's fun.) There must be a sweet spot somewhere along the continuum between BC1 and BC7. Something that is more complex than BC1 but simpler than BC7. (I somewhat dislike ASTC, mostly because of its insanely complex encoding format.)

Here's one idea for an 128-bit per 4x4 block format (8 bits/texel) that mashes together ETC1+BC7. One thing I learned from ETC1 is that a lot of bits can be saved by forcing each subset's principle axis to always lie along the intensity direction. With a strong encoder, this constraint isn't as bad as one would think.

The format only has two modes: opaque and transparent. The opaque mode has 3 subsets, and the transparent mode has 2 subsets for RGB and 1 subset for alpha. Each color has 1 shared pbit, and each mode has 16 partitions for colors.

The color encoding is "RGB PBit IntensityTable". The intensity tables could be borrowed from ETC1 and expanded to 8 entries. For the transparent blocks, two 8-bit alpha values are specified (like BC4), and by borrowing degeneracy breaking from BC7 we can shave one bit from the alpha selectors. "CompRot" is a BC7-style component rotation, so any of the channels can be encoded into alpha.

Some things I like about this format: equal precision for all components, and there are only two simple modes. The opaque mode is powerful but simple: always 3 subsets, with color and selector precision better than BC1 and even better than BC7's 3 subset modes. The transparent mode is more powerful than BC3 for RGB (better color precision, and 2 subsets), but weaker for alpha (2 bit selectors vs. 3).

The main downside is that each subset's endpoints are constrained to lie along the intensity axis. I've seen commercial games ship with normal maps encoded into ETC1 and DXT1 so I know this isn't a total deal breaker.

Opaque block:
ModeBit    1 
Partition  4
Color0     777 1 3 
Color1     777 1 3 
Color2     777 1 3

Color selectors;
3 3 3 3
3 3 3 3
3 3 3 3
3 3 3 3

Total bits: 128

Transparent block:
ModeBit    1 
Partition  4
Color0     666 3 
Color1     666 3 
AlphaLoHi  8 8
CompRot    2

Color selectors:
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2

Alpha selectors:
1 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2

Total bits: 128

A strong encoder would adaptively choose between opaque blocks and transparent blocks using various component rotations, to minimize overall error. Transparent blocks can be used even on all-opaque textures.

I have no idea if this format is useful. On a rainy day I'll make a simple encoder and compare it against BC1 and BC7.