Sunday, October 2, 2016

Visualization of random ETC2 Planar Mode blocks

For fun I've been poking around at the planar mode in ETC2. From this presentation:


Okay, they are intended for use on smoothly varying blocks. I'm intrigued by planar mode because the colors are stored at high precision (676) and there are no selectors like in the other modes. But what do planar blocks really look like though? This random planar block image, sorted by standard deviation, was pixel (box filter) upsampled by 400%:


Just the green channel:


This image was computed by poking random 8-bit bytes into an ETC2 block. If the block passed the planar ETC2 mode check it gets decoded and stored in a temporary image. After the temp image was full it gets sorted by standard deviation.

Hey - most of these random planar blocks are not actually smoothly varying! I'm unsure if this is actually useful in practice, but it's interesting.

Looking at how Planar blocks are actually unpacked, the H and V vectors are interpreted relative to the "origin" color, so they're actually signed and the unpacking code uses per-component [0,255] clamping. This is where the high frequency patterns come from.

// ro, go, bo - origin color, unpacked from 676
// rv, gv, bv and rh, gh, bh - vertical and horizontal colors, unpacked from 676
// unpack planar block's pixels - there are no selectors in this mode
for (int y = 0; y < 4; y++)
{
    for (int x = 0; x < 4; x++)
    {
        pDst->set(
            (4 * ro + x * (rh - ro) + y * (rv - ro) + 2) >> 2,
            (4 * go + x * (gh - go) + y * (gv - go) + 2) >> 2,
            (4 * bo + x * (bh - bo) + y * (bv - bo) + 2) >> 2,
            255);
        pDst++;
    }
}

Saturday, October 1, 2016

ETC1 encoder performance on kodim18 at various quality/effort levels

I'm trying to get a handle on how the available ETC1 compressors perform, using their public API's, at their various quality or effort levels. This is only for a single image (kodim18 - my usual for quick tests like this).

First, here's the performance of etc2comp on kodim18, in ETC1-only mode, multithreading enabled, RGB Avg PSNR metrics, on effort levels [0,100] in increments of 5:


Efforts between roughly 40-65 seem to be the sweet spot. Effort=100 is obviously wasteful.

Here's another graph (can you tell I'm practicing my Excel graphing skills!), this time comparing the time and quality of various ETC1 (and now ETC2 - for etc2comp) compressors at different encoder quality/effort settings:



Important Notes:
  • To be fair to Intel's ETC1 encoder (function CompressBlocksETC1()), which is not multithreaded, I added another bar labeled "ispc_etc1 MT" which has the total CPU time divided by 20 (to roughly match the speedup I'm seeing using 40 threads in the other natively multithreaded encoders). 
  • basislib is now using a variant of cluster fit (see previous posts). basislib_1 is lowest quality, basislib_3 is highest. Notice that basislib_3 is only ~2X slower than Intel's SIMD code, but basislib doesn't use any SIMD at all.
  • basislib and etc2comp both use 40 threads
  • etcpak's timings are currently single threaded, because I'm still using a single threaded entrypoint inside the code (BlockData::Process()). It's on my TODO to fix this. IMHO unless you need a real-time ETC1 encoder I think it trades off too much quality. However, if you need a real-time encoder and don't mind the loss in quality it's your best bet. If the author added a few optional SIMD-optimized cluster fit trials in there it would probably kick ass.
To get an idea how efficient etc2comp currently is at scanning the ETC1 search space for kodim18, let's see what effort level (and how much CPU time) etc2comp takes to approximately match two other encoder's quality levels:
  • basislib Cluster Fit (64 trials out of 165): .115 secs 35.917 dB
Quality matched or exceeded first at etc2comp effort 70: 1.095 secs 35.953 dB
  • Intel ISPC: 1.03 secs 35.969 dB
Quality matched or exceeded first at etc2comp effort 80: 1.83 secs 35.992 dB

perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim18.png 512x768
--- basislib Quality: 4
basislib pack time: 0.115
basislib ETC image Error: Max:  56, Mean: 2.865, MSE: 16.648, RMSE: 4.080, PSNR: 35.917, SSIM: 0.965767

--- etc2comp effort: 0
etc2comp time: 0.051663
etc2comp Error: Max:  64, Mean: 3.186, MSE: 21.123, RMSE: 4.596, PSNR: 34.883, SSIM: 0.958817
--- etc2comp effort: 5
etc2comp time: 0.083509
etc2comp Error: Max:  64, Mean: 3.129, MSE: 19.658, RMSE: 4.434, PSNR: 35.196, SSIM: 0.959313
--- etc2comp effort: 10
etc2comp time: 0.106361
etc2comp Error: Max:  64, Mean: 3.092, MSE: 19.052, RMSE: 4.365, PSNR: 35.331, SSIM: 0.959794
--- etc2comp effort: 15
etc2comp time: 0.133278
etc2comp Error: Max:  64, Mean: 3.063, MSE: 18.661, RMSE: 4.320, PSNR: 35.421, SSIM: 0.960250
--- etc2comp effort: 20
etc2comp time: 0.193460
etc2comp Error: Max:  64, Mean: 3.042, MSE: 18.416, RMSE: 4.291, PSNR: 35.479, SSIM: 0.960595
--- etc2comp effort: 25
etc2comp time: 0.162790
etc2comp Error: Max:  64, Mean: 3.027, MSE: 18.256, RMSE: 4.273, PSNR: 35.517, SSIM: 0.960869
--- etc2comp effort: 30
etc2comp time: 0.182370
etc2comp Error: Max:  64, Mean: 3.012, MSE: 18.108, RMSE: 4.255, PSNR: 35.552, SSIM: 0.961207
--- etc2comp effort: 35
etc2comp time: 0.196609
etc2comp Error: Max:  64, Mean: 2.998, MSE: 17.980, RMSE: 4.240, PSNR: 35.583, SSIM: 0.961578
--- etc2comp effort: 40
etc2comp time: 0.217227
etc2comp Error: Max:  64, Mean: 2.987, MSE: 17.888, RMSE: 4.229, PSNR: 35.605, SSIM: 0.961854
--- etc2comp effort: 45
etc2comp time: 0.248881
etc2comp Error: Max:  64, Mean: 2.970, MSE: 17.771, RMSE: 4.216, PSNR: 35.634, SSIM: 0.962461
--- etc2comp effort: 50
etc2comp time: 0.361306
etc2comp Error: Max:  59, Mean: 2.916, MSE: 17.175, RMSE: 4.144, PSNR: 35.782, SSIM: 0.963669
--- etc2comp effort: 55
etc2comp time: 0.379762
etc2comp Error: Max:  59, Mean: 2.902, MSE: 17.091, RMSE: 4.134, PSNR: 35.803, SSIM: 0.964149
--- etc2comp effort: 60
etc2comp time: 0.522357
etc2comp Error: Max:  59, Mean: 2.882, MSE: 16.840, RMSE: 4.104, PSNR: 35.867, SSIM: 0.964800
--- etc2comp effort: 65
etc2comp time: 0.560707
etc2comp Error: Max:  59, Mean: 2.878, MSE: 16.818, RMSE: 4.101, PSNR: 35.873, SSIM: 0.964974
--- etc2comp effort: 70
etc2comp time: 1.095014
etc2comp Error: Max:  59, Mean: 2.857, MSE: 16.512, RMSE: 4.063, PSNR: 35.953, SSIM: 0.965366
--- etc2comp effort: 75
etc2comp time: 1.166479
etc2comp Error: Max:  59, Mean: 2.852, MSE: 16.490, RMSE: 4.061, PSNR: 35.959, SSIM: 0.965534
--- etc2comp effort: 80
etc2comp time: 1.829960
etc2comp Error: Max:  59, Mean: 2.842, MSE: 16.362, RMSE: 4.045, PSNR: 35.992, SSIM: 0.965769
--- etc2comp effort: 85
etc2comp time: 1.904691
etc2comp Error: Max:  59, Mean: 2.836, MSE: 16.329, RMSE: 4.041, PSNR: 36.001, SSIM: 0.966037
--- etc2comp effort: 90
etc2comp time: 2.709250
etc2comp Error: Max:  59, Mean: 2.829, MSE: 16.255, RMSE: 4.032, PSNR: 36.021, SSIM: 0.966277
--- etc2comp effort: 95
etc2comp time: 2.802099
etc2comp Error: Max:  59, Mean: 2.827, MSE: 16.251, RMSE: 4.031, PSNR: 36.022, SSIM: 0.966315
--- etc2comp effort: 100
etc2comp time: 3.619217
etc2comp Error: Max:  59, Mean: 2.825, MSE: 16.216, RMSE: 4.027, PSNR: 36.031, SSIM: 0.966349

--- etcpak time: 0.006
etcpak Error: Max:  90, Mean: 3.464, MSE: 25.640, RMSE: 5.064, PSNR: 34.042, SSIM: 0.950396

--- ispc_etc time: 1.033881
ispc_etc1 Error: Max:  56, Mean: 2.866, MSE: 16.450, RMSE: 4.056, PSNR: 35.969, SSIM: 0.965412


ETC2 enabled:

perceptual: 0 etc2: 1 rec709: 1
Source filename: kodak\kodim18.png 512x768
--- basislib Quality: 1
basislib pack time: 0.036
basislib ETC image Error: Max:  73, Mean: 3.168, MSE: 20.910, RMSE: 4.573, PSNR: 34.927, SSIM: 0.959441
--- basislib Quality: 2
basislib pack time: 0.054
basislib ETC image Error: Max:  56, Mean: 2.945, MSE: 17.772, RMSE: 4.216, PSNR: 35.634, SSIM: 0.964359
--- basislib Quality: 3
basislib pack time: 0.107
basislib ETC image Error: Max:  56, Mean: 2.865, MSE: 16.648, RMSE: 4.080, PSNR: 35.917, SSIM: 0.965767
--- etc2comp effort: 0
etc2comp time: 0.086306
etc2comp Error: Max:  64, Mean: 3.158, MSE: 20.810, RMSE: 4.562, PSNR: 34.948, SSIM: 0.959069
--- etc2comp effort: 5
etc2comp time: 0.179816
etc2comp Error: Max:  59, Mean: 3.087, MSE: 19.044, RMSE: 4.364, PSNR: 35.333, SSIM: 0.959701
--- etc2comp effort: 10
etc2comp time: 0.247190
etc2comp Error: Max:  59, Mean: 3.046, MSE: 18.396, RMSE: 4.289, PSNR: 35.484, SSIM: 0.960256
--- etc2comp effort: 15
etc2comp time: 0.287620
etc2comp Error: Max:  59, Mean: 3.013, MSE: 17.977, RMSE: 4.240, PSNR: 35.584, SSIM: 0.960751
--- etc2comp effort: 20
etc2comp time: 0.339065
etc2comp Error: Max:  59, Mean: 2.990, MSE: 17.709, RMSE: 4.208, PSNR: 35.649, SSIM: 0.961162
--- etc2comp effort: 25
etc2comp time: 0.383907
etc2comp Error: Max:  59, Mean: 2.971, MSE: 17.515, RMSE: 4.185, PSNR: 35.697, SSIM: 0.961507
--- etc2comp effort: 30
etc2comp time: 0.432019
etc2comp Error: Max:  59, Mean: 2.954, MSE: 17.352, RMSE: 4.166, PSNR: 35.737, SSIM: 0.961876
--- etc2comp effort: 35
etc2comp time: 0.480186
etc2comp Error: Max:  59, Mean: 2.938, MSE: 17.210, RMSE: 4.149, PSNR: 35.773, SSIM: 0.962279
--- etc2comp effort: 40
etc2comp time: 0.516155
etc2comp Error: Max:  59, Mean: 2.925, MSE: 17.107, RMSE: 4.136, PSNR: 35.799, SSIM: 0.962590
--- etc2comp effort: 45
etc2comp time: 0.565827
etc2comp Error: Max:  59, Mean: 2.911, MSE: 17.009, RMSE: 4.124, PSNR: 35.824, SSIM: 0.963044
--- etc2comp effort: 50
etc2comp time: 1.124057
etc2comp Error: Max:  59, Mean: 2.892, MSE: 16.852, RMSE: 4.105, PSNR: 35.864, SSIM: 0.963703
--- etc2comp effort: 55
etc2comp time: 1.192462
etc2comp Error: Max:  59, Mean: 2.880, MSE: 16.772, RMSE: 4.095, PSNR: 35.885, SSIM: 0.964164
--- etc2comp effort: 60
etc2comp time: 1.713074
etc2comp Error: Max:  59, Mean: 2.851, MSE: 16.424, RMSE: 4.053, PSNR: 35.976, SSIM: 0.964913
--- etc2comp effort: 65
etc2comp time: 1.828673
etc2comp Error: Max:  59, Mean: 2.846, MSE: 16.398, RMSE: 4.049, PSNR: 35.983, SSIM: 0.965099
--- etc2comp effort: 70
etc2comp time: 2.461853
etc2comp Error: Max:  59, Mean: 2.836, MSE: 16.274, RMSE: 4.034, PSNR: 36.016, SSIM: 0.965358
--- etc2comp effort: 75
etc2comp time: 2.608303
etc2comp Error: Max:  59, Mean: 2.831, MSE: 16.247, RMSE: 4.031, PSNR: 36.023, SSIM: 0.965534
--- etc2comp effort: 80
etc2comp time: 3.383624
etc2comp Error: Max:  59, Mean: 2.820, MSE: 16.156, RMSE: 4.019, PSNR: 36.047, SSIM: 0.965855
--- etc2comp effort: 85
etc2comp time: 3.719689
etc2comp Error: Max:  59, Mean: 2.814, MSE: 16.125, RMSE: 4.016, PSNR: 36.056, SSIM: 0.966079
--- etc2comp effort: 90
etc2comp time: 4.675509
etc2comp Error: Max:  59, Mean: 2.808, MSE: 16.072, RMSE: 4.009, PSNR: 36.070, SSIM: 0.966264
--- etc2comp effort: 95
etc2comp time: 4.619700
etc2comp Error: Max:  59, Mean: 2.806, MSE: 16.068, RMSE: 4.008, PSNR: 36.071, SSIM: 0.966293
--- etc2comp effort: 100
etc2comp time: 4.771136
etc2comp Error: Max:  59, Mean: 2.805, MSE: 16.064, RMSE: 4.008, PSNR: 36.072, SSIM: 0.966309
--- etcpak time: 0.045
etcpak Error: Max:  90, Mean: 3.458, MSE: 25.593, RMSE: 5.059, PSNR: 34.050, SSIM: 0.950551
ispc_etc time: 1.034064
ispc_etc1 Error: Max:  56, Mean: 2.866, MSE: 16.450, RMSE: 4.056, PSNR: 35.969, SSIM: 0.965412

More ETC1 cluster fit data

Ignore the SSIM stats, this is a corpus test image with random 4x4 blocks chosen from many other images.

This was on a 20 core Xeon workstation, with multithreading enabled in basislib and etc2comp (40 threads each). The API I'm using in etcpak is not threaded (it's already sorta ridiculously fast), and Intel is not multithreaded either. (As far as I can tell from breakpointing inside it.) etcpak and Intel make heavy use of SIMD operations, while etc2comp and basislib do not.

Cluster fit (new algorithm):

J:\dev\basislib1\bin>texexp test_0.tga
perceptual: 0 etc2: 0 rec709: 1
Source filename: test_0.tga 4096x4096
--- basislib Quality: 4
basislib pack time: 4.211
basislib ETC image Error: Max: 255, Mean: 3.277, MSE: 40.847, RMSE: 6.391, PSNR: 32.019, SSIM: 0.999618

--- etc2comp effort: 100
etc2comp time: 154.562167
etc2comp Error: Max: 255, Mean: 3.286, MSE: 42.777, RMSE: 6.540, PSNR: 31.819, SSIM: 0.999612

--- etcpak time: 0.258
etcpak Error: Max: 255, Mean: 4.282, MSE: 69.927, RMSE: 8.362, PSNR: 29.684, SSIM: 0.998407

--- ispc_etc time: 43.694160
ispc_etc1 Error: Max: 255, Mean: 3.249, MSE: 39.912, RMSE: 6.318, PSNR: 32.120, SSIM: 0.999655

Previous algorithm (rg_etc1's lattice scan+refinement):

J:\dev\basislib1\bin>texexp test_0.tga
perceptual: 0 etc2: 0 rec709: 1
Source filename: test_0.tga 4096x4096

--- basislib Quality: 4
basislib pack time: 240.638
basislib ETC image Error: Max: 255, Mean: 3.203, MSE: 39.336, RMSE: 6.272, PSNR: 32.183, SSIM: 0.999680

--- etc2comp effort: 100
etc2comp time: 151.739932
etc2comp Error: Max: 255, Mean: 3.286, MSE: 42.777, RMSE: 6.540, PSNR: 31.819, SSIM: 0.999612

--- etcpak time: 0.243
etcpak Error: Max: 255, Mean: 4.282, MSE: 69.927, RMSE: 8.362, PSNR: 29.684, SSIM: 0.998407

--- ispc_etc time: 46.807596
ispc_etc1 Error: Max: 255, Mean: 3.249, MSE: 39.912, RMSE: 6.318, PSNR: 32.120, SSIM: 0.999655

ETC1 block flip estimation

ETC1's block includes a "flip" bit, which describes how the subblocks are oriented in each block. When this bit is set the two subblocks are oriented horizontally in a 4x4 pixel block like this:


ABCD
EFGH

IJKL
MNOP

Otherwise they're oriented vertically like this:

AB CD
EF GH
IJ KL
MN OP

(In BC7 terminology, ETC1 supports 1 or 2 subsets per block, with 1 or 2 partitions.)

Anyhow, a high quality encoder typically tries both subblock orientations and chooses the one which results in the lowest overall error. Interestingly, it's possible to predict with a reasonable degree of certainty which subblock orientation is best. The advantage to this method is obvious: higher encoding throughput, with (hopefully) only a small loss in quality.

The method used by etc2comp computes each possible subblock's average color, then it sums the squared "gray distance" from each pixel to each subblock's average color. (To compute gray distance from color A vs. B, it projects A onto the grayscale line going through B, then it computes the squared RGB distance from A to this projected point.) etc2comp then chooses the flip orientation with the lowest overall gray distance to try first.

Success rates on a handful of images, with RGB avg. PSNR and SSIM stats (using accurate vs. estimated flips):

kodim01: 68% (35.904 vs. 35.607, .975236 vs. .973744)
kodim03: 65% (39.949 vs. 38.801, .964774 vs. .963467)
kodim18: 69% (35.917 vs. 35.661, .965767 vs. .964385)
kodim20: 63% (38.849 vs. 38.589, .975558 vs. .974669)

ETC1 block flip bit visualizations on kodim03:

Image:



basislib best flips:

Flip estimates, using etc2comp's gray line estimate algorithm:


etc2comp's flips (in ETC1 mode):


etcpak's flips (in ETC1 mode):


Example code: 4x4 Haar transform on ETC1 selectors

Someone asked me for the code that implements the selector frequency domain filtering experiment I did on one of my previous posts. Here you go. The linear algebra and utility code is all very similar to crunch's. The Haar stuff was copied straight from wikipedia.

If you understand how coefficient quantization works in JPEG, then this example should be pretty easy to follow. Let me know if you have any questions or want to see the 8x8 version, but it's basically the same thing with larger matrices (and the selectors within 2x2 ETC1 blocks).

Notes:
gpu_image: A compressed GPU texture (in this case, ETC1). image_u8: A 32bpp ARGB raster image. etc_block: A simple helper class to get/set ETC1/2 selectors, subblock colors, codework table indices, etc.
static void selector_haar_transform_test(const gpu_image &g)
{
image_u8 g_unpacked;
g.unpack_image(g_unpacked);

image_utils::write_to_file("g_orig.png", g_unpacked);

gpu_image g_temp(g);

#define sqrt2 (1.4142135623730951f)

matrix44F haar_matrix(
.5f * 1, .5f * 1, .5f * 1, .5f * 1,
.5f * 1, .5f * 1, .5f * -1, .5f * -1,
.5f * sqrt2, .5f * -sqrt2, 0, 0,
0, 0, .5f * sqrt2, .5f * -sqrt2);

matrix44F inv_haar_matrix(haar_matrix.get_transposed());

for (uint block_y = 0; block_y < g_temp.get_blocks_y(); block_y++)
{
for (uint block_x = 0; block_x < g_temp.get_blocks_x(); block_x++)
{
etc_block &blk = g_temp.get_element_of_type<etc_block>(block_x, block_y);

matrix44F s;
for (uint y = 0; y < 4; y++)
for (uint x = 0; x < 4; x++)
s(x, y) = (float)blk.get_selector(x, y) + .5f;

matrix44F h(haar_matrix * s * inv_haar_matrix);

for (uint i = 0; i < 4; i++)
{
for (uint j = 0; j < 4; j++)
{
float q = h(i, j);

const float scale = 4.0f;
int iq = math::float_to_int_nearest(q * scale);

int quant = (i+j+1);

if (quant < 1)
quant = 1;

iq /= quant;

iq *= quant;

h(i, j) = iq / scale;
}
}

matrix44F ih(inv_haar_matrix * h * haar_matrix);

for (uint y = 0; y < 4; y++)
for (uint x = 0; x < 4; x++)
//blk.set_selector(x, y, math::clamp<int>((int)ih(x, y) + .5f, 0, 3));
blk.set_selector(x, y, math::clamp<int>((int)ih(x, y), 0, 3));
}
}

write_etc1_vis_images(g_temp, "g_temp_");

g_temp.unpack_image(g_unpacked);

image_utils::write_to_file("g.png", g_unpacked);
exit(0);
}

Thursday, September 29, 2016

libsquish's DXT1 "Cluster Fit" method applied to ETC1

(This post is probably of interest to like a dozen people in the world, so it's kinda hairy.)

libsquish (a popular DXT encoding library) internally uses a total ordering based method to find high-quality DXT endpoints. This method can also be applied to ETC1 encoding, using the equations in rg_etc1's optimizer's remarks to solve for optimal subblock colors given each possible selector distribution in the total ordering and the current best intensity index and subblock color.

I don't actually compute the total ordering, I instead iterate over all selector distributions present in the total ordering because the actual per-pixel selector values don't matter to the solver. So basically, the new optimizer first tries the subblock's average color, then it computes and tries a series of "correction" factors (relative to the subblock's average color), which depend on the current intensity table index and the current best subblock color found so far (to account for clamping). A hash table is also used to prevent the optimizer from evaluating a trial solution more than once.

Single threaded results:

perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim03.png 768x512

--- basislib Quality: 4
basislib time: 5.644
basislib ETC image Error: Max:  70, Mean: 1.952, MSE: 8.220, RMSE: 2.867, PSNR: 38.982, SSIM: 0.964853

--- etc2comp effort: 100
etc2comp time: 75.792851
etc2comp Error: Max:  75, Mean: 1.925, MSE: 8.009, RMSE: 2.830, PSNR: 39.095, SSIM: 0.965339

--- etcpak time: 0.006
etcpak Error: Max:  80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.944697

--- ispc_etc time: 1.021655
ispc_etc1 Error: Max:  75, Mean: 1.965, MSE: 8.280, RMSE: 2.877, PSNR: 38.951, SSIM: 0.963916



After enabling multithreading (40 threads) in those encoders that support it:

J:\dev\basislib1\bin>texexp kodak\kodim03.png
perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim03.png 768x512

--- basislib Quality: 4
basislib pack time: 0.266
basislib ETC image Error: Max:  70, Mean: 1.952, MSE: 8.220, RMSE: 2.867, PSNR: 38.982, SSIM: 0.964853

--- etc2comp effort: 100
etc2comp time: 3.608819
etc2comp Error: Max:  75, Mean: 1.925, MSE: 8.009, RMSE: 2.830, PSNR: 39.095, SSIM: 0.965339

--- etcpak time: 0.006
etcpak Error: Max:  80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.944697

--- ispc_etc time: 1.054324

ispc_etc1 Error: Max:  75, Mean: 1.965, MSE: 8.280, RMSE: 2.877, PSNR: 38.951, SSIM: 0.963916

Intel is doing some kind of amazing SIMD dark magic in there. The ETC1 cluster fit method is around 10-27x faster than rg_etc1 (which uses my previous method, a hybrid of a 3D neighborhood search with iterative base color refinement) and etc2comp (effort 100) in ETC1 mode. RGB Avg. PSNR is usually within ~.1 dB of Intel.

I'm so tempted to update rg_etc1 with this algorithm, if only I had the time.

Update 10/1: Okay, I computed a histogram of the "winning" subblock average color correction factors applied over a few hundred test textures. I then selected the top 64 correction factors (out of 165), and don't bother trying the rest. Here's a graph showing the usage histogram of the selector distributions across all the test textures, sorted by frequency (the most successfully selector distributions are to the right).


This works well in my testing, for another 2.25 - 5x speedup (depending on the # of factors you choose to apply):

J:\dev\basislib1\bin>texexp  kodak\kodim03.png
perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim03.png 768x512

--- basislib Quality: 4
basislib pack time: 0.118
basislib ETC image Error: Max:  70, Mean: 1.960, MSE: 8.276, RMSE: 2.877, PSNR: 38.953, SSIM: 0.964881

--- etc2comp effort: 100
etc2comp time: 3.621138
etc2comp Error: Max:  75, Mean: 1.925, MSE: 8.009, RMSE: 2.830, PSNR: 39.095, SSIM: 0.965339

--- etcpak time: 0.006
etcpak Error: Max:  80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.944697

--- ispc_etc time: 1.038211
ispc_etc1 Error: Max:  75, Mean: 1.965, MSE: 8.280, RMSE: 2.877, PSNR: 38.951, SSIM: 0.963916

How It Works


rg_etc1 tries refining the current best subblock color found so far as it scans "around" the ETC1 444 or 555 color 3D lattice surrounding the subblock's average color. This refinement approach takes as input the current set of 2-bit selectors, the current best intensity table index, and the current best subblock color (only to account for clamping). Given this information, you can compute a RGB "correction" factor which is subtracted from the subblock's average color to compute a potentially better (lower error) subblock color.

Here's the basic math, from the comments:

// Now we have the input block, the avg. color of the input pixels, a set of trial selector indices, and the block color+intensity index.
// Now, for each component, attempt to refine the current solution by solving a simple linear equation. For example, for 4 pixels:
// The goal is:
// pixel0 - (block_color+inten_table[selector0]) + pixel1 - (block_color+inten_table[selector1]) + pixel2 - (block_color+inten_table[selector2]) + pixel3 - (block_color+inten_table[selector3]) = 0
// Rearranging this:
// (pixel0 + pixel1 + pixel2 + pixel3) - (block_color+inten_table[selector0]) - (block_color+inten_table[selector1]) - (block_color+inten_table[selector2]) - (block_color+inten_table[selector3]) = 0
// (pixel0 + pixel1 + pixel2 + pixel3) - block_color - inten_table[selector0] - block_color-inten_table[selector1] - block_color-inten_table[selector2] - block_color-inten_table[selector3] = 0
// (pixel0 + pixel1 + pixel2 + pixel3) - 4*block_color - inten_table[selector0] - inten_table[selector1] - inten_table[selector2] - inten_table[selector3] = 0
// (pixel0 + pixel1 + pixel2 + pixel3) - 4*block_color - (inten_table[selector0] + inten_table[selector1] + inten_table[selector2] + inten_table[selector3]) = 0
// (pixel0 + pixel1 + pixel2 + pixel3)/4 - block_color - (inten_table[selector0] + inten_table[selector1] + inten_table[selector2] + inten_table[selector3])/4 = 0
// block_color = (pixel0 + pixel1 + pixel2 + pixel3)/4 - (inten_table[selector0] + inten_table[selector1] + inten_table[selector2] + inten_table[selector3])/4
// So what this means:
// optimal_block_color = avg_input - avg_inten_delta
// So the optimal block color can be computed by taking the average block color and subtracting the current average of the intensity delta.
// Unfortunately, optimal_block_color must then be quantized to 555 or 444 so it's not always possible to improve matters using this formula.
// Also, the above formula is for unclamped intensity deltas. The actual implementation takes into account clamping.

To implement cluster fit for ETC1, you can iterate over the total ordering of all the selectors for each of the 8 subblock pixels, much like squish does in DXT1. However, doing this is unnecessary, because all that ultimately matters in the refinement equation is the computed avg_inten_delta, which just depends on the selector distribution (and not what each pixel's selector actually is).

Here's my current optimizer's compute() function. It first tries the subblock's average color (at coordinates m_br, m_bg, m_bb), to establish a baseline (minimally useful) solution, then it iterates over the precomputed (and sorted) selector distribution table and attempts applying (usually) a few dozen or so avg. color "correction" factors from this table. The table is sorted, so the entries with the highest probability of applying the best correction appear first (as mentioned above).

Note that evaluate_solution() uses a hash table to avoid trying the same solution more than once. The sorted table of total ordering selector distributions is at the bottom of this post.

// total_perms_to_try: 1-64 seems good enough (out of 165)
void etc1_optimizer::compute_internal_cluster_fit_fast(uint total_perms_to_try)
{
    if ((!m_best_solution.m_valid) || ((m_br != m_best_solution.m_coords.m_unscaled_color.r) || (m_bg != m_best_solution.m_coords.m_unscaled_color.g) || (m_bb != m_best_solution.m_coords.m_unscaled_color.b)))
    {
        evaluate_solution(etc1_solution_coordinates(m_br, m_bg, m_bb, 0, m_pParams->m_use_color4), m_trial_solution, &m_best_solution);
    }

    if ((m_best_solution.m_error == 0) || (!m_best_solution.m_valid))
        return;
            
    for (uint i = 0; i < total_perms_to_try; i++)
    {
        int delta_sum_r = 0, delta_sum_g = 0, delta_sum_b = 0;

        const int *pInten_table = g_etc1_inten_tables[m_best_solution.m_coords.m_inten_table];
        const color_quad_u8 base_color(m_best_solution.m_coords.get_scaled_color());

        const uint8 *pNum_selectors = g_cluster_fit_order_tab[i].m_v;

        for (uint q = 0; q < 4; q++)
        {
            const int yd_temp = pInten_table[q];

            delta_sum_r += pNum_selectors[q] * (math::clamp<int>(base_color.r + yd_temp, 0, 255) - base_color.r);
            delta_sum_g += pNum_selectors[q] * (math::clamp<int>(base_color.g + yd_temp, 0, 255) - base_color.g);
            delta_sum_b += pNum_selectors[q] * (math::clamp<int>(base_color.b + yd_temp, 0, 255) - base_color.b);
        }

        if ((!delta_sum_r) && (!delta_sum_g) && (!delta_sum_b))
            continue;

        const float avg_delta_r_f = static_cast<float>(delta_sum_r) / 8;
        const float avg_delta_g_f = static_cast<float>(delta_sum_g) / 8;
        const float avg_delta_b_f = static_cast<float>(delta_sum_b) / 8;

        const int br1 = math::clamp<int>(static_cast<uint>((m_avg_color[0] - avg_delta_r_f) * m_limit / 255.0f + .5f), 0, m_limit);
        const int bg1 = math::clamp<int>(static_cast<uint>((m_avg_color[1] - avg_delta_g_f) * m_limit / 255.0f + .5f), 0, m_limit);
        const int bb1 = math::clamp<int>(static_cast<uint>((m_avg_color[2] - avg_delta_b_f) * m_limit / 255.0f + .5f), 0, m_limit);

#if BASISLIB_DEBUG_ETC_ENCODER_DEEPER
        printf("Second refinement trial %u, avg_delta %f %f %f\n", refinement_trial, avg_delta_r_f, avg_delta_g_f, avg_delta_b_f);
#endif

        evaluate_solution(etc1_solution_coordinates(br1, bg1, bb1, 0, m_pParams->m_use_color4), m_trial_solution, &m_best_solution);

        if (m_best_solution.m_error == 0)
            break;
    }
}

const uint BASISLIB_CLUSTER_FIT_ORDER_TABLE_SIZE = 165;

static const struct { uint8 m_v[4]; } g_cluster_fit_order_tab[BASISLIB_CLUSTER_FIT_ORDER_TABLE_SIZE] =
{
    { 0, 0, 0, 8 }, { 0, 5, 2, 1 }, { 0, 6, 1, 1 }, { 0, 7, 0, 1 }, { 0, 7, 1, 0 },
    { 0, 0, 8, 0 }, { 0, 0, 3, 5 }, { 0, 1, 7, 0 }, { 0, 0, 4, 4 }, { 0, 0, 2, 6 },
    { 0, 0, 7, 1 }, { 0, 0, 1, 7 }, { 0, 0, 5, 3 }, { 1, 6, 0, 1 }, { 0, 0, 6, 2 },
    { 0, 2, 6, 0 }, { 2, 4, 2, 0 }, { 0, 3, 5, 0 }, { 3, 3, 1, 1 }, { 4, 2, 0, 2 },
    { 1, 5, 2, 0 }, { 0, 5, 3, 0 }, { 0, 6, 2, 0 }, { 2, 4, 1, 1 }, { 5, 1, 0, 2 },
    { 6, 1, 1, 0 }, { 3, 3, 0, 2 }, { 6, 0, 0, 2 }, { 0, 8, 0, 0 }, { 6, 1, 0, 1 },
    { 0, 1, 6, 1 }, { 1, 6, 1, 0 }, { 4, 1, 3, 0 }, { 0, 2, 5, 1 }, { 5, 0, 3, 0 },
    { 5, 3, 0, 0 }, { 0, 1, 5, 2 }, { 0, 3, 4, 1 }, { 2, 5, 1, 0 }, { 1, 7, 0, 0 },
    { 0, 1, 4, 3 }, { 6, 0, 2, 0 }, { 0, 4, 4, 0 }, { 2, 6, 0, 0 }, { 0, 2, 4, 2 },
    { 0, 5, 1, 2 }, { 0, 6, 0, 2 }, { 3, 5, 0, 0 }, { 0, 4, 3, 1 }, { 3, 4, 1, 0 },
    { 4, 3, 1, 0 }, { 1, 5, 0, 2 }, { 0, 3, 3, 2 }, { 1, 4, 1, 2 }, { 0, 4, 2, 2 },
    { 2, 3, 3, 0 }, { 4, 4, 0, 0 }, { 1, 2, 4, 1 }, { 0, 5, 0, 3 }, { 0, 1, 3, 4 },
    { 1, 5, 1, 1 }, { 1, 4, 2, 1 }, { 1, 3, 2, 2 }, { 5, 2, 1, 0 }, { 1, 3, 3, 1 },
    { 0, 1, 2, 5 }, { 1, 1, 5, 1 }, { 0, 3, 2, 3 }, { 2, 5, 0, 1 }, { 3, 2, 2, 1 },
    { 2, 3, 0, 3 }, { 1, 4, 3, 0 }, { 2, 2, 1, 3 }, { 6, 2, 0, 0 }, { 1, 0, 6, 1 },
    { 3, 3, 2, 0 }, { 7, 1, 0, 0 }, { 3, 1, 4, 0 }, { 0, 2, 3, 3 }, { 0, 4, 1, 3 },
    { 0, 4, 0, 4 }, { 0, 1, 0, 7 }, { 2, 0, 5, 1 }, { 2, 0, 4, 2 }, { 3, 0, 2, 3 },
    { 2, 2, 4, 0 }, { 2, 2, 3, 1 }, { 4, 0, 3, 1 }, { 3, 2, 3, 0 }, { 2, 3, 2, 1 },
    { 1, 3, 4, 0 }, { 7, 0, 1, 0 }, { 3, 0, 4, 1 }, { 1, 0, 5, 2 }, { 8, 0, 0, 0 },
    { 3, 0, 1, 4 }, { 4, 1, 1, 2 }, { 4, 0, 2, 2 }, { 1, 2, 5, 0 }, { 4, 2, 1, 1 },
    { 3, 4, 0, 1 }, { 2, 0, 3, 3 }, { 5, 0, 1, 2 }, { 5, 0, 0, 3 }, { 2, 4, 0, 2 },
    { 2, 1, 4, 1 }, { 4, 0, 1, 3 }, { 2, 1, 5, 0 }, { 4, 2, 2, 0 }, { 4, 0, 4, 0 },
    { 1, 0, 4, 3 }, { 1, 4, 0, 3 }, { 3, 0, 3, 2 }, { 4, 3, 0, 1 }, { 0, 1, 1, 6 },
    { 1, 3, 1, 3 }, { 0, 2, 2, 4 }, { 2, 0, 2, 4 }, { 5, 1, 1, 1 }, { 3, 0, 5, 0 },
    { 2, 3, 1, 2 }, { 3, 0, 0, 5 }, { 0, 3, 1, 4 }, { 5, 0, 2, 1 }, { 2, 1, 3, 2 },
    { 2, 0, 6, 0 }, { 3, 1, 3, 1 }, { 5, 1, 2, 0 }, { 1, 0, 3, 4 }, { 1, 1, 6, 0 },
    { 4, 0, 0, 4 }, { 2, 0, 1, 5 }, { 0, 3, 0, 5 }, { 1, 3, 0, 4 }, { 4, 1, 2, 1 },
    { 1, 2, 3, 2 }, { 3, 1, 0, 4 }, { 5, 2, 0, 1 }, { 1, 2, 2, 3 }, { 3, 2, 1, 2 },
    { 2, 2, 2, 2 }, { 6, 0, 1, 1 }, { 1, 2, 1, 4 }, { 1, 1, 4, 2 }, { 3, 2, 0, 3 },
    { 1, 2, 0, 5 }, { 1, 0, 7, 0 }, { 3, 1, 2, 2 }, { 1, 0, 2, 5 }, { 2, 0, 0, 6 },
    { 2, 1, 1, 4 }, { 2, 2, 0, 4 }, { 1, 1, 3, 3 }, { 7, 0, 0, 1 }, { 1, 0, 0, 7 },
    { 2, 1, 2, 3 }, { 4, 1, 0, 3 }, { 3, 1, 1, 3 }, { 1, 1, 2, 4 }, { 2, 1, 0, 5 },
    { 1, 0, 1, 6 }, { 0, 2, 1, 5 }, { 0, 2, 0, 6 }, { 1, 1, 1, 5 }, { 1, 1, 0, 6 } 
};

Wednesday, September 28, 2016

LodePNG

So far this is a nice looking library, and I've heard it reliably handles 16-bit/component .PNG's :

http://lodev.org/lodepng/

An interesting ETC1/2 encoding test vector

Here's the 4x4 test vector image (zoomed in 32X for ease of visibility), provided to me by John Brooks and Victor Reynolds at Blue Shift:

Red pixel: 255,0,0
Blue pixel: 0,0,255

Seems simple enough, right? Here's what happens with the various encoders (in non-perceptual mode if the encoder supports the flag), using up to date versions from early last week, and non-perceptual RGB avg. metrics for both PSNR and SSIM:

etcpak (PSNR: 15.612, SSIM: 0.265737):

Red pixel: 93,60,93
Blue pixel: 51,18,51

etc2comp ETC1 (PSNR: 17.471, SSIM: 0.372446):


Red pixel: 111,60,60
Blue pixel: 60,60,111

Intel ISPC (PSNR: 24.968, SSIM: 0.587142):


Red pixel: 234,47,47
Blue pixel: 47,47,234

basislib_etc1 from yesterday (PSNR: 19.987, SSIM: 0.511227):

Red pixel: 149,47,47
Blue pixel: 47,47,149

etc2comp ETC2 (PSNR: 19.779, SSIM: 0.517508):



Red pixel: 255, 0, 0
Blue pixel: 64,64,98

This is an example of an well-tuned ETC1 encoder (Intel's) holding its own vs. etc2comp in ETC2 mode.

Want a little challenge: Try to figure how how Intel's encoder produced the best output.

John Brooks, the lead on etc2comp, told me that BSI is working with that test image because it's a known low-quality encoding pattern for etc2comp. It wasn't in their test corpus, so the PSNR of 17 & 19 should improve with future etc2comp iterations.

I've improved basislib's handling of this test vector, but the results now need a optimization pass. I've prototyped a version of squish's total ordering method in ETC1, by applying the equations in the remarks in rg_etc1.cpp's code. Amazingly, it competed against rg_etc1's current algorithm for quality on my first try of the new method, but it's slower.