Thursday, September 29, 2016

libsquish's DXT1 "Cluster Fit" method applied to ETC1

(This post is probably of interest to like a dozen people in the world, so it's kinda hairy.)

libsquish (a popular DXT encoding library) internally uses a total ordering based method to find high-quality DXT endpoints. This method can also be applied to ETC1 encoding, using the equations in rg_etc1's optimizer's remarks to solve for optimal subblock colors given each possible selector distribution in the total ordering and the current best intensity index and subblock color.

I don't actually compute the total ordering, I instead iterate over all selector distributions present in the total ordering because the actual per-pixel selector values don't matter to the solver. So basically, the new optimizer first tries the subblock's average color, then it computes and tries a series of "correction" factors (relative to the subblock's average color), which depend on the current intensity table index and the current best subblock color found so far (to account for clamping). A hash table is also used to prevent the optimizer from evaluating a trial solution more than once.

Single threaded results:

perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim03.png 768x512

--- basislib Quality: 4
basislib time: 5.644
basislib ETC image Error: Max:  70, Mean: 1.952, MSE: 8.220, RMSE: 2.867, PSNR: 38.982, SSIM: 0.964853

--- etc2comp effort: 100
etc2comp time: 75.792851
etc2comp Error: Max:  75, Mean: 1.925, MSE: 8.009, RMSE: 2.830, PSNR: 39.095, SSIM: 0.965339

--- etcpak time: 0.006
etcpak Error: Max:  80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.944697

--- ispc_etc time: 1.021655
ispc_etc1 Error: Max:  75, Mean: 1.965, MSE: 8.280, RMSE: 2.877, PSNR: 38.951, SSIM: 0.963916

After enabling multithreading (40 threads) in those encoders that support it:

J:\dev\basislib1\bin>texexp kodak\kodim03.png
perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim03.png 768x512

--- basislib Quality: 4
basislib pack time: 0.266
basislib ETC image Error: Max:  70, Mean: 1.952, MSE: 8.220, RMSE: 2.867, PSNR: 38.982, SSIM: 0.964853

--- etc2comp effort: 100
etc2comp time: 3.608819
etc2comp Error: Max:  75, Mean: 1.925, MSE: 8.009, RMSE: 2.830, PSNR: 39.095, SSIM: 0.965339

--- etcpak time: 0.006
etcpak Error: Max:  80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.944697

--- ispc_etc time: 1.054324

ispc_etc1 Error: Max:  75, Mean: 1.965, MSE: 8.280, RMSE: 2.877, PSNR: 38.951, SSIM: 0.963916

Intel is doing some kind of amazing SIMD dark magic in there. The ETC1 cluster fit method is around 10-27x faster than rg_etc1 (which uses my previous method, a hybrid of a 3D neighborhood search with iterative base color refinement) and etc2comp (effort 100) in ETC1 mode. RGB Avg. PSNR is usually within ~.1 dB of Intel.

I'm so tempted to update rg_etc1 with this algorithm, if only I had the time.

Update 10/1: Okay, I computed a histogram of the "winning" subblock average color correction factors applied over a few hundred test textures. I then selected the top 64 correction factors (out of 165), and don't bother trying the rest. Here's a graph showing the usage histogram of the selector distributions across all the test textures, sorted by frequency (the most successfully selector distributions are to the right).

This works well in my testing, for another 2.25 - 5x speedup (depending on the # of factors you choose to apply):

J:\dev\basislib1\bin>texexp  kodak\kodim03.png
perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim03.png 768x512

--- basislib Quality: 4
basislib pack time: 0.118
basislib ETC image Error: Max:  70, Mean: 1.960, MSE: 8.276, RMSE: 2.877, PSNR: 38.953, SSIM: 0.964881

--- etc2comp effort: 100
etc2comp time: 3.621138
etc2comp Error: Max:  75, Mean: 1.925, MSE: 8.009, RMSE: 2.830, PSNR: 39.095, SSIM: 0.965339

--- etcpak time: 0.006
etcpak Error: Max:  80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.944697

--- ispc_etc time: 1.038211
ispc_etc1 Error: Max:  75, Mean: 1.965, MSE: 8.280, RMSE: 2.877, PSNR: 38.951, SSIM: 0.963916

How It Works

rg_etc1 tries refining the current best subblock color found so far as it scans "around" the ETC1 444 or 555 color 3D lattice surrounding the subblock's average color. This refinement approach takes as input the current set of 2-bit selectors, the current best intensity table index, and the current best subblock color (only to account for clamping). Given this information, you can compute a RGB "correction" factor which is subtracted from the subblock's average color to compute a potentially better (lower error) subblock color.

Here's the basic math, from the comments:

// Now we have the input block, the avg. color of the input pixels, a set of trial selector indices, and the block color+intensity index.
// Now, for each component, attempt to refine the current solution by solving a simple linear equation. For example, for 4 pixels:
// The goal is:
// pixel0 - (block_color+inten_table[selector0]) + pixel1 - (block_color+inten_table[selector1]) + pixel2 - (block_color+inten_table[selector2]) + pixel3 - (block_color+inten_table[selector3]) = 0
// Rearranging this:
// (pixel0 + pixel1 + pixel2 + pixel3) - (block_color+inten_table[selector0]) - (block_color+inten_table[selector1]) - (block_color+inten_table[selector2]) - (block_color+inten_table[selector3]) = 0
// (pixel0 + pixel1 + pixel2 + pixel3) - block_color - inten_table[selector0] - block_color-inten_table[selector1] - block_color-inten_table[selector2] - block_color-inten_table[selector3] = 0
// (pixel0 + pixel1 + pixel2 + pixel3) - 4*block_color - inten_table[selector0] - inten_table[selector1] - inten_table[selector2] - inten_table[selector3] = 0
// (pixel0 + pixel1 + pixel2 + pixel3) - 4*block_color - (inten_table[selector0] + inten_table[selector1] + inten_table[selector2] + inten_table[selector3]) = 0
// (pixel0 + pixel1 + pixel2 + pixel3)/4 - block_color - (inten_table[selector0] + inten_table[selector1] + inten_table[selector2] + inten_table[selector3])/4 = 0
// block_color = (pixel0 + pixel1 + pixel2 + pixel3)/4 - (inten_table[selector0] + inten_table[selector1] + inten_table[selector2] + inten_table[selector3])/4
// So what this means:
// optimal_block_color = avg_input - avg_inten_delta
// So the optimal block color can be computed by taking the average block color and subtracting the current average of the intensity delta.
// Unfortunately, optimal_block_color must then be quantized to 555 or 444 so it's not always possible to improve matters using this formula.
// Also, the above formula is for unclamped intensity deltas. The actual implementation takes into account clamping.

To implement cluster fit for ETC1, you can iterate over the total ordering of all the selectors for each of the 8 subblock pixels, much like squish does in DXT1. However, doing this is unnecessary, because all that ultimately matters in the refinement equation is the computed avg_inten_delta, which just depends on the selector distribution (and not what each pixel's selector actually is).

Here's my current optimizer's compute() function. It first tries the subblock's average color (at coordinates m_br, m_bg, m_bb), to establish a baseline (minimally useful) solution, then it iterates over the precomputed (and sorted) selector distribution table and attempts applying (usually) a few dozen or so avg. color "correction" factors from this table. The table is sorted, so the entries with the highest probability of applying the best correction appear first (as mentioned above).

Note that evaluate_solution() uses a hash table to avoid trying the same solution more than once. The sorted table of total ordering selector distributions is at the bottom of this post.

// total_perms_to_try: 1-64 seems good enough (out of 165)
void etc1_optimizer::compute_internal_cluster_fit_fast(uint total_perms_to_try)
    if ((!m_best_solution.m_valid) || ((m_br != m_best_solution.m_coords.m_unscaled_color.r) || (m_bg != m_best_solution.m_coords.m_unscaled_color.g) || (m_bb != m_best_solution.m_coords.m_unscaled_color.b)))
        evaluate_solution(etc1_solution_coordinates(m_br, m_bg, m_bb, 0, m_pParams->m_use_color4), m_trial_solution, &m_best_solution);

    if ((m_best_solution.m_error == 0) || (!m_best_solution.m_valid))
    for (uint i = 0; i < total_perms_to_try; i++)
        int delta_sum_r = 0, delta_sum_g = 0, delta_sum_b = 0;

        const int *pInten_table = g_etc1_inten_tables[m_best_solution.m_coords.m_inten_table];
        const color_quad_u8 base_color(m_best_solution.m_coords.get_scaled_color());

        const uint8 *pNum_selectors = g_cluster_fit_order_tab[i].m_v;

        for (uint q = 0; q < 4; q++)
            const int yd_temp = pInten_table[q];

            delta_sum_r += pNum_selectors[q] * (math::clamp<int>(base_color.r + yd_temp, 0, 255) - base_color.r);
            delta_sum_g += pNum_selectors[q] * (math::clamp<int>(base_color.g + yd_temp, 0, 255) - base_color.g);
            delta_sum_b += pNum_selectors[q] * (math::clamp<int>(base_color.b + yd_temp, 0, 255) - base_color.b);

        if ((!delta_sum_r) && (!delta_sum_g) && (!delta_sum_b))

        const float avg_delta_r_f = static_cast<float>(delta_sum_r) / 8;
        const float avg_delta_g_f = static_cast<float>(delta_sum_g) / 8;
        const float avg_delta_b_f = static_cast<float>(delta_sum_b) / 8;

        const int br1 = math::clamp<int>(static_cast<uint>((m_avg_color[0] - avg_delta_r_f) * m_limit / 255.0f + .5f), 0, m_limit);
        const int bg1 = math::clamp<int>(static_cast<uint>((m_avg_color[1] - avg_delta_g_f) * m_limit / 255.0f + .5f), 0, m_limit);
        const int bb1 = math::clamp<int>(static_cast<uint>((m_avg_color[2] - avg_delta_b_f) * m_limit / 255.0f + .5f), 0, m_limit);

        printf("Second refinement trial %u, avg_delta %f %f %f\n", refinement_trial, avg_delta_r_f, avg_delta_g_f, avg_delta_b_f);

        evaluate_solution(etc1_solution_coordinates(br1, bg1, bb1, 0, m_pParams->m_use_color4), m_trial_solution, &m_best_solution);

        if (m_best_solution.m_error == 0)


static const struct { uint8 m_v[4]; } g_cluster_fit_order_tab[BASISLIB_CLUSTER_FIT_ORDER_TABLE_SIZE] =
    { 0, 0, 0, 8 }, { 0, 5, 2, 1 }, { 0, 6, 1, 1 }, { 0, 7, 0, 1 }, { 0, 7, 1, 0 },
    { 0, 0, 8, 0 }, { 0, 0, 3, 5 }, { 0, 1, 7, 0 }, { 0, 0, 4, 4 }, { 0, 0, 2, 6 },
    { 0, 0, 7, 1 }, { 0, 0, 1, 7 }, { 0, 0, 5, 3 }, { 1, 6, 0, 1 }, { 0, 0, 6, 2 },
    { 0, 2, 6, 0 }, { 2, 4, 2, 0 }, { 0, 3, 5, 0 }, { 3, 3, 1, 1 }, { 4, 2, 0, 2 },
    { 1, 5, 2, 0 }, { 0, 5, 3, 0 }, { 0, 6, 2, 0 }, { 2, 4, 1, 1 }, { 5, 1, 0, 2 },
    { 6, 1, 1, 0 }, { 3, 3, 0, 2 }, { 6, 0, 0, 2 }, { 0, 8, 0, 0 }, { 6, 1, 0, 1 },
    { 0, 1, 6, 1 }, { 1, 6, 1, 0 }, { 4, 1, 3, 0 }, { 0, 2, 5, 1 }, { 5, 0, 3, 0 },
    { 5, 3, 0, 0 }, { 0, 1, 5, 2 }, { 0, 3, 4, 1 }, { 2, 5, 1, 0 }, { 1, 7, 0, 0 },
    { 0, 1, 4, 3 }, { 6, 0, 2, 0 }, { 0, 4, 4, 0 }, { 2, 6, 0, 0 }, { 0, 2, 4, 2 },
    { 0, 5, 1, 2 }, { 0, 6, 0, 2 }, { 3, 5, 0, 0 }, { 0, 4, 3, 1 }, { 3, 4, 1, 0 },
    { 4, 3, 1, 0 }, { 1, 5, 0, 2 }, { 0, 3, 3, 2 }, { 1, 4, 1, 2 }, { 0, 4, 2, 2 },
    { 2, 3, 3, 0 }, { 4, 4, 0, 0 }, { 1, 2, 4, 1 }, { 0, 5, 0, 3 }, { 0, 1, 3, 4 },
    { 1, 5, 1, 1 }, { 1, 4, 2, 1 }, { 1, 3, 2, 2 }, { 5, 2, 1, 0 }, { 1, 3, 3, 1 },
    { 0, 1, 2, 5 }, { 1, 1, 5, 1 }, { 0, 3, 2, 3 }, { 2, 5, 0, 1 }, { 3, 2, 2, 1 },
    { 2, 3, 0, 3 }, { 1, 4, 3, 0 }, { 2, 2, 1, 3 }, { 6, 2, 0, 0 }, { 1, 0, 6, 1 },
    { 3, 3, 2, 0 }, { 7, 1, 0, 0 }, { 3, 1, 4, 0 }, { 0, 2, 3, 3 }, { 0, 4, 1, 3 },
    { 0, 4, 0, 4 }, { 0, 1, 0, 7 }, { 2, 0, 5, 1 }, { 2, 0, 4, 2 }, { 3, 0, 2, 3 },
    { 2, 2, 4, 0 }, { 2, 2, 3, 1 }, { 4, 0, 3, 1 }, { 3, 2, 3, 0 }, { 2, 3, 2, 1 },
    { 1, 3, 4, 0 }, { 7, 0, 1, 0 }, { 3, 0, 4, 1 }, { 1, 0, 5, 2 }, { 8, 0, 0, 0 },
    { 3, 0, 1, 4 }, { 4, 1, 1, 2 }, { 4, 0, 2, 2 }, { 1, 2, 5, 0 }, { 4, 2, 1, 1 },
    { 3, 4, 0, 1 }, { 2, 0, 3, 3 }, { 5, 0, 1, 2 }, { 5, 0, 0, 3 }, { 2, 4, 0, 2 },
    { 2, 1, 4, 1 }, { 4, 0, 1, 3 }, { 2, 1, 5, 0 }, { 4, 2, 2, 0 }, { 4, 0, 4, 0 },
    { 1, 0, 4, 3 }, { 1, 4, 0, 3 }, { 3, 0, 3, 2 }, { 4, 3, 0, 1 }, { 0, 1, 1, 6 },
    { 1, 3, 1, 3 }, { 0, 2, 2, 4 }, { 2, 0, 2, 4 }, { 5, 1, 1, 1 }, { 3, 0, 5, 0 },
    { 2, 3, 1, 2 }, { 3, 0, 0, 5 }, { 0, 3, 1, 4 }, { 5, 0, 2, 1 }, { 2, 1, 3, 2 },
    { 2, 0, 6, 0 }, { 3, 1, 3, 1 }, { 5, 1, 2, 0 }, { 1, 0, 3, 4 }, { 1, 1, 6, 0 },
    { 4, 0, 0, 4 }, { 2, 0, 1, 5 }, { 0, 3, 0, 5 }, { 1, 3, 0, 4 }, { 4, 1, 2, 1 },
    { 1, 2, 3, 2 }, { 3, 1, 0, 4 }, { 5, 2, 0, 1 }, { 1, 2, 2, 3 }, { 3, 2, 1, 2 },
    { 2, 2, 2, 2 }, { 6, 0, 1, 1 }, { 1, 2, 1, 4 }, { 1, 1, 4, 2 }, { 3, 2, 0, 3 },
    { 1, 2, 0, 5 }, { 1, 0, 7, 0 }, { 3, 1, 2, 2 }, { 1, 0, 2, 5 }, { 2, 0, 0, 6 },
    { 2, 1, 1, 4 }, { 2, 2, 0, 4 }, { 1, 1, 3, 3 }, { 7, 0, 0, 1 }, { 1, 0, 0, 7 },
    { 2, 1, 2, 3 }, { 4, 1, 0, 3 }, { 3, 1, 1, 3 }, { 1, 1, 2, 4 }, { 2, 1, 0, 5 },
    { 1, 0, 1, 6 }, { 0, 2, 1, 5 }, { 0, 2, 0, 6 }, { 1, 1, 1, 5 }, { 1, 1, 0, 6 } 

Wednesday, September 28, 2016


So far this is a nice looking library, and I've heard it reliably handles 16-bit/component .PNG's :

An interesting ETC1/2 encoding test vector

Here's the 4x4 test vector image (zoomed in 32X for ease of visibility), provided to me by John Brooks and Victor Reynolds at Blue Shift:

Red pixel: 255,0,0
Blue pixel: 0,0,255

Seems simple enough, right? Here's what happens with the various encoders (in non-perceptual mode if the encoder supports the flag), using up to date versions from early last week, and non-perceptual RGB avg. metrics for both PSNR and SSIM:

etcpak (PSNR: 15.612, SSIM: 0.265737):

Red pixel: 93,60,93
Blue pixel: 51,18,51

etc2comp ETC1 (PSNR: 17.471, SSIM: 0.372446):

Red pixel: 111,60,60
Blue pixel: 60,60,111

Intel ISPC (PSNR: 24.968, SSIM: 0.587142):

Red pixel: 234,47,47
Blue pixel: 47,47,234

basislib_etc1 from yesterday (PSNR: 19.987, SSIM: 0.511227):

Red pixel: 149,47,47
Blue pixel: 47,47,149

etc2comp ETC2 (PSNR: 19.779, SSIM: 0.517508):

Red pixel: 255, 0, 0
Blue pixel: 64,64,98

This is an example of an well-tuned ETC1 encoder (Intel's) holding its own vs. etc2comp in ETC2 mode.

Want a little challenge: Try to figure how how Intel's encoder produced the best output.

John Brooks, the lead on etc2comp, told me that BSI is working with that test image because it's a known low-quality encoding pattern for etc2comp. It wasn't in their test corpus, so the PSNR of 17 & 19 should improve with future etc2comp iterations.

I've improved basislib's handling of this test vector, but the results now need a optimization pass. I've prototyped a version of squish's total ordering method in ETC1, by applying the equations in the remarks in rg_etc1.cpp's code. Amazingly, it competed against rg_etc1's current algorithm for quality on my first try of the new method, but it's slower.

Tuesday, September 27, 2016

How to use crunch's GPU block encoder test vector generator

This option selects a different mode of operation from crunch's usual texture file conversion role. It causes the tool to crawl through a directory and load every .PNG file there. It will then randomly select a percentage of the 4x4 pixel blocks from the image and append the results into one or more 4096x4096 output images. These output images can then be used as test vectors to compare different block encoders.

crunch -corpus_gen -deep .035 -width 4096 -height 4096 -in J:\dev\test_images\*.png

You can specify multiple -in arguments, and -in @file.txt loads a textual listing file of files/directories to load or scan.

The -corpus_test option can be used to compare the different DXT encoders supported by crunch, using images generated using -corpus_gen.

Here's a very zoomed in example from the test vector generator:

Notice how the blocks are sorted by the sum of R, G's, and B's standard deviation as a key.

Sunday, September 25, 2016

More on SSIM

This paper is referenced in the SSIM article on Wikipedia:

"A comprehensive assessment of the structural similarity index"
"In this paper, it is shown, both empirically and analytically, that the index is directly related to the conventional, and often unreliable, mean squared error. In the first evaluation, the two metrics are statistically compared with one another. Then, in the second, a pair of functions that algebraically connects the two is derived. These results suggest a much closer relationship between the structural similarity index and mean squared error."
"This research, however, appears to be the first to directly consider the statistical relationships between the two methods. As well, this work develops a pair of mathematical functions that directly link the two. Given these findings, one is left to question whether the structural similarity index is ready for widespread adoption."
Interesting! I get the feeling there's more to SSIM than meets the eye. Unfortunately, this paper is behind a paywall. Another quote from the paper:
"These findings suggest a reasonably significant level of correlation between the SSIM and MSE. Values range from r = 0.6364 to r = 1.0000, with an average of r = 0.9116 and a variance of 0.007. An average this large, along with a small variance, suggests that most of the correlations are decidedly significant. Clearly, when ordering coded images, the SSIM and MSE often choose similar arrangements. Results such as this are likely a sign of a deeper relationship between the two methods."
Hmm, okay. So MSE and SSIM are highly correlated. The paper even has simple algorithms to convert between MSE<->SSIM. Perhaps I could use these algorithms to help optimize my SSIM code. (Just joking.) From the conclusion:
"Collectively, these findings suggest that the performance of the SSIM is perhaps much closer to that of the MSE than some might claim. Consequently, one is left to question the legitimacy of many of the applications of the SSIM."
Got it. Here's another interesting paper, this one not behind a paywall:

"Mystery behind similarity measures MSE and SSIM"

Some quotes:
"We see that it is based on the same sample moments and correlation coefficient as MSE. So this is the first observation/property or mystery revealed about MSE and SSIM: both measures are composed of the same parameters which are only combined in a different way."
"So the third observation for SSIM is its instability around zero point (0,0) and the fourth one – it can be used only for data of the same sign. The authors of SSIM solve these problems by introducing small constants and restricting the usage to non-negative data only, respectively."
"The fifth observation for Dice measure and thus for SSIM too is that it depends on the absolute values of input parameters. First, it is insensitive at all if one of the parameters is equal 0. Secondly, its sensitivity is decreasing by the increase of absolute parameter values."
Hmm, none of that sounds great to me. They go on to introduce their own metric they call CMSC, and claim "all proposed measures are free of drawbacks of MSE and SSIM and thus are more suitable as objective similarity/quality measures not only for the images but any signals."

John Brooks at Blue Shift experimented with using SSIM in his new ETC1/2 encoder, etc2comp. In a conversation about SSIM, he said that:
"It [SSIM] becomes insensitive in high-contrast areas. SSIM is all about matching contrast & structure. But Block Truncation Coding by its nature is increasing contrast because it posterizes color transitions to 4 selector values. This made the encoder freak out and try to reduce contrast to compensate, making the encoding look crappy. I think it might be the right tool for high-level jobs, but was a poor tool for driving low-level encoder behavior."
"BTC trades 16 shades for 4 which means sharper transitions and more contrast when measured against the original. It also usually means less structure than the original due to posterizing 16-to-4. But neither artifact can be controlled by the encoder as they are a result of the encoding, so it's very hard to navigate the encoding search space when SSIM is so outside its design parameters."
Sounds pretty reasonable to me. I'm going to be doing some testing using a ETC1 encoder optimized for SSIM very soon. Let's see what happens.

Image error metrics

While developing and refining crunch I used a matrix of statistics like this:

RGB Total   Error: Max:  73, Mean: 17.404, MSE: 176.834, RMSE: 13.298, PSNR: 25.655, SSIM: 0.000000
RGB Average Error: Max:  73, Mean: 5.801, MSE: 58.945, RMSE: 7.678, PSNR: 30.426, SSIM: 0.907993
Luma        Error: Max:  64, Mean: 4.640, MSE: 37.593, RMSE: 6.131, PSNR: 32.380, SSIM: 0.945000
Red         Error: Max:  69, Mean: 5.387, MSE: 52.239, RMSE: 7.228, PSNR: 30.951, SSIM: 0.921643
Green       Error: Max:  70, Mean: 5.052, MSE: 48.298, RMSE: 6.950, PSNR: 31.291, SSIM: 0.934051
Blue        Error: Max:  73, Mean: 6.966, MSE: 76.296, RMSE: 8.735, PSNR: 29.306, SSIM: 0.868285

I computed these stats from a PNG image uploaded by @dougallj showing the progress he's been making on his experimental ETC1 encoder with kodim18, originally from here:

The code that computes this stuff is actually used by the DXT1 front-end to determine how the 8x8 "macroblocks" should be tiled.

The per-channel stuff is useful for debugging, and for tuning the encoder's perceptual RGB weights (which is only used when the compressor is in perceptual mode). Per-channel stats are also useful when trying to get a rough idea what weights a closed source block encoder uses, too.

Here's a useful PCA paper I found while writing HW1's renderer

I used this technique in a real-time GPU DXT1 encoder I wrote around 10 years ago:

"Candid Covariance-Free Incremental Principal Component Analysis"

With this approach you can compute a decent-enough PCA in a few lines of shader code.

HW1 used this encoder to compress all of the GPU splatted terrain textures into a GPU texture cache. One of my coworkers, Colt McAnlis, designed and wrote the game's amazing terrain texture caching system.


Alright, I'm implementing SSIM. There are like 30 different implementations on the web, and most either rely on huge dependencies like OpenCV or have crappy licenses. So which one do I compare mine too? The situation with SSIM seems worse than PSNR. There are just so many variations on how to compute this thing.

I'm choosing this implementation for comparison purposes, because I already have the fundamental image processing primitives handy:

On Multi-Scale SSIM: I've been given conflicting information on whether or not this is actually useful to me. Let's first try regular SSIM.

For testing, I compared my implementation, using my own float image processing code, vs. the code above that uses doubles and OpenCV. To generate some distorted test images, I loaded kodim18 into Paint Shop Pro X8 and saved to various JPEG quality levels from 1-99. I then ran the two tools and graphed the results in Excel:

The X axis represents the various quality levels, from highest to lowest quality. The 12 PSP JPEG quality levels tested are 1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, and 99. Y axis is SSIM.

Thanks to John Brooks at Blue Shift for feedback on this post.

Friday, September 23, 2016

About the HW1 codebase having "too many globals"

First off, this project was a death march. What Paul Bettner (formerly Ensemble, now at Playful Corp) publicly said years ago is true: Ensemble Studios was addicted to crunching. I lived, breathed, and slept that codebase. We had demos every 4-8 weeks or something. This time in my life was beyond intense. I totally understand why Microsoft shut us down, because we really needed to be put out of our collective misery.

I was more or less addicted to crunch at Ensemble. I remember working so much, and being so consumed with work on this game, that the muscles in my neck would basically "lock up". Working on all those demo milestones was a 3 year adventure. That team was so amazing, and we all got along so well. I could never do it again like that unless lives depended on it.

Anyhow, the engine/tools team on that project built a low-level, very 360-specific "game OS" in C++ for the simulation team. Why did we build a whole new engine from the ground up? Because the Age3 engine just completely melted down after Billy Khan and I ported it to 360. (That was 4 months of the most painful, mind numbing full-time coding, porting and debugging I've ever done.)

The Age3 360 port ran at ~7 FPS, on a single thread, and took 3-5 minutes to load. After I got the net code working on 360 (no easy task, because Age3 used the Win32 window message-based Winsock API's), we played a few brutally slow multiplayer games on the 360. It was pretty bad.

Of course, we could have spent months trying to optimize and thread this engine to get it above 30Hz. But Billy and I just rolled off Age3, where we spent months working on optimizing and tuning the engine to run well on PC's. I also had a bunch of new 360-specific rendering features I wanted to implement, and doing this in the old PC-centric codebase would have been a nightmare.

The HW1 engine consisted of many global managers, very heavy use of synchronous/asynchronous cross-thread messaging, and lightweight platform-specific wrappers built on top of the Win32 and D3D API's. The renderer, animation, sound, streaming, decompression, networking, and overlapped I/O systems were heavily multithreaded. (Overlapped I/O actually worked properly on Xbox 360's OS.) We used 360-specific D3D9 extensions that allowed us to compose command buffers from multiple threads, and we carefully managed all GPU physical memory ourselves just like a driver would. There are lots of other cool things we did on HW1 that I'll cover here on rainy days.

The original idea for using message passing for most of our parallelism in our next engine was from Bill Jackson, now CCO at Boss Fight Entertainment in Dallas. I implemented it and refined the idea before I really understood how useful it was. It was inspired by message passing and concurrency in Erlang. It worked well and was really fun to use, but was hard to debug. Something like 5,000 intra and inter thread messages were involved in loading a map in the background while Scaleform UI was playing back on its own core. We also had a simple job system, but most of our concurrency was implemented using message passing. (See this article on a similar Message Passing system by Nicholas Vining.)

We tried to follow our expression of the Unix philosophy on this game: Lots of little objects, tools, and services interacting in an ecosystem. Entire "game OS" services were designed to only send/receive and process messages on particular 360 CPU cores.

My manager and I created this powerful, highly abstracted virtual file I/O system with streaming support. The entire game (except the 360 executable) could quickly load over the network using TCP/IP, or off the hard drive or DVD using package files. Hot reloading was supported over the network, so artists could watch their textures, models, animations, terrain, and lights change in real-time. We had the entire company (artists, designers, programmers) using this system.

Something like singletons made no sense for the managers. These services were abstracting away one specific global piece of hardware or global C API, so why bother. I've been told the C-based Halo codebases "followed not strictly the same philosophy, but of the same mind".

This codebase was very advanced for its time. It made the next series of codebases I learned and enhanced feel 5-10 behind the times. I don't talk about it because this entire period of time in my life was so intense.

Wednesday, September 21, 2016

ETC1/2 vs. DXT1 texture compression benchmark

I'm using the same testing tool, dataset and methodology explained in my ETC1/2 benchmark. In this benchmark, I've added in my vanilla (non-RDO/CRN) DXT1 block encoder (really, its DXT1 endpoint optimizer class), which is derived from crunch's.

In 2009 my DXT1 encoder was as good or better than all available DXT1 compressors that I tested it against, such as squish, ATI Compressonator, NVidia's original and old NVDXT libary, and D3DX's. Not sure how much change has occurred in DXT1 compression since that time. I can also throw in other DXT1 encoders if there's interest.

RGB error metrics:

Here's just ETC2 vs. DXT1:

This is fascinating!

Next up: BC7.

Tuesday, September 20, 2016

Let's try DXT1 vs. ETC1/2 benchmarks

John Brooks at Blue Shift brought up this idea earlier. I think it's a great idea! I love good old DXT1 (or "BC1" as some call it). Let's see how ETC2 in particular compares against my old favorite.

Monday, September 19, 2016

Important note about PSNR

Yes, I know PSNR (and RMSE, etc.) is not an ideal quality metric for image and video compression. Keep in mind there is a large diversity of data stored as textures in modern games and applications: Albedo maps, specular maps, gloss maps, normal maps, light maps, various engine-specific multichannel control maps, 2D sprites, transparency (alpha) maps, satellite photos, cubemaps, etc. And let's not even talk about how anisotropic filtering, shading, normal mapping, shadowing, etc. impacts perceived quality once these textures are mapped onto 3D meshes.

RGB and Luma PSNR are simple and, in my experience writing and tuning crunch, reliable enough for practical usage. I'm not writing an image or video compressor, I'm writing a texture compressor.

How to compute PSNR (from an old Berkeley course)

This was part of Berkeley's CS294 Fall '97 courseware on "Multimedia Systems and Applications", but it got moved and disappeared. It was a useful little page so I'm duplicating it here for reference purposes:

Image Quality Computation

Back to Assignment ]

Signal-to-noise (SNR) measures are estimates of the quality of a reconstructed image compared with an original image. The basic idea is to compute a single number that reflects the quality of the reconstructed image. Reconstructed images with higher metrics are judged better. In fact, traditional SNR measures do not equate with human subjective perception. Several research groups are working on perceptual measures, but for now we will use the signal-to-noise measures because they are easier to compute. Just remember that higher measures do not always mean better quality.

The actual metric we will compute is the peak signal-to-reconstructed image measure which is called PSNR. Assume we are given a source image f(i,j) that contains N by N pixels and a reconstructed image F(i,j) where F is reconstructed by decoding the encoded version of f(i,j). Error metrics are computed on the luminance signal only so the pixel values f(i,j) range between black (0) and white (255).

First you compute the mean squared error (MSE) of the reconstructed image as follows

The summation is over all pixels. The root mean squared error (RMSE) is the square root of MSE. Some formulations use N rather N^2 in the denominator for MSE.

PSNR in decibels (dB) is computed by using

Typical PSNR values range between 20 and 40. They are usually reported to two decimal points (e.g., 25.47). The actual value is not meaningful, but the comparison between two values for different reconstructed images gives one measure of quality. The MPEG committee used an informal threshold of 0.5 dB PSNR to decide whether to incorporate a coding optimization because they believed that an improvement of that magnitude would be visible.

Some definitions of PSNR use 2552/MSE rather than 255/RMSE. Either formulation will work because we are interested in the relative comparison, not the absolute values. For our assignments we will use the definition given above.

The other important technique for displaying errors is to construct an error image which shows the pixel-by-pixel errors. The simplest computation of this image is to create an image by taking the difference between the reconstructed and original pixels. These images are hard to see because zero difference is black and most errors are small numbers which are shades of black. The typical construction of the error image multiples the difference by a constant to increase the visible difference and translates the entire image to a gray level. The computation is

You can adjust the constant (2) or the translation (128) to change the image. Some people use white (255) to signify no error and difference from white as an error which means that darker pixels are bigger errors.


A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995).

M. Rabbani and P.W. Jones, Digital Image Compression Techniques, Vol TT7, SPIE Optical Engineering Press, Bellevue, Washington (1991).

ETC1 and ETC1/2 Texture Compressor Benchmark

(This is a "sticky" blog post. I'll keep this page up to date as interesting or important events happen. Examples: When a new practical ETC encoder gets released, or when ETC codecs are significantly updated.)

The main purpose behind this particular benchmark is to conduct a deep survey of every known practical ETC1/2 encoder, so I can be sure basislib's ETC1 and universal encoders are very high quality. I want to closely understand where this space is at, and where it's going. This is exactly what I did while writing crunch. I need a very high quality, stable, and scalable ETC1/2 block parameter optimizer that works with potentially many thousands of input pixels. rg_etc1's internal ETC1 optimizer is the only thing I have right now that solves this problem.

I figured this data would be very useful to other developers, so here's a highest achievable quality benchmark of the following four practical ETC1/2 compressors:

  • etc2comp: A full-featured ETC1/2 encoder developed by engineers at Blue Shift and sponsored by Google. Supports both RGB and perceptual error metrics.
  • etcpak: Extremely fast, ETC1 and partial ETC2 (planar blocks only), RGB error metrics only
  • Intel ISPC Texture Compressor: A very fast ETC1 compressor, RGB error metrics only
  • basislib ETC1: An updated version of my open source ETC1 block encoder, rg_etc1. Supports both RGB and perceptual error metrics (unlike rg_etc1).

The test files were  ~1,500 .PNG textures from the larger test corpus I used to tune crunch. Each texture was compressed using each encoder, then unpacked using rg_etc1 modified to support the 3 new ETC2 block types (planar, T, and H).

Benchmarking like this is surprisingly tricky. The API's to all the encoders are different, most are not well documented, and even exactly how you compute PSNR (because there are multiple definitions each with slightly different equations) isn't super well defined. Please see the "developer feedback" notes below.

I've sanity checked these results by writing .KTX files, converting them to .PNG using Mali's GPU Texture Compression Tool (which thankfully worked, because the .KTX format is iffy when it comes to interchange), then computing PSNR's using ImageMagick's "compare" tool. Thanks to John Brooks at Blue Shift for helping me verify the data for etc2comp, and helping me track down and fix the effort=100.0 issue in the first release of this benchmark.

I also have performance statistics, which I'll cover in a future post. The perf. data I have for etcpak isn't usable for accurate timing right now, because the etcpak code I'm calling is only single threaded and includes some I/O.

This first graph compares all four compressors in ETC1 mode, using RGB (average) PSNR.

Error Metric: Avg. RGB


The next graph enables ETC2 support in the encoders that support it, currently just etc2comp and etcpak:


etc2comp in ETC2 mode really shines at the lower quality levels. At below approximately 32 dB it appears the minimum expected quality improvement from ETC2 is significant. Above ~32 dB, the minimum expected improvement drops down a bit, closer to ETC1's quality level. (Which seems to make sense, as ETC2 was designed to better handle blocks that ETC1 is weak at.)

etcpak doesn't support T and H blocks, so it suffers a lot here. This is why it's very important to pay attention to benchmarks like this one, because quality (even in ETC2-capable or aware compressors) can highly vary between libraries.

Error Metric: Perceptual


Developer Feedback

  • ISPC: I had to copy ispc.exe into your project directory for it to build in my VS2015 solution. That brought down the "out of the box" experience of getting your stuff into my solution. On the upside, your API was dead simple to figure out and was very "pure" - as it should be. (However, you should rename "stride" to "stride_in_bytes". I've seen at least one programmer get it wrong and I had to help them.)
  • etcpak: Can you add a single API to do compression with multithreading, like etc2comp? And have it return a double of how much time it takes to actually execute, excluding file I/O stuff. Your codec is so fast than I/O times will seriously skew the statistics.
  • etc2comp: Hey, ETC1 is still extremely important. Both Intel, basislib, and rg_etc1 have higher ETC1 quality than etc2comp. Also, could you add some defines like this to etc.h so developers know how to correctly call the public Etc::Encode() API:



  • 9/20: I fixed etc2comp's "effort" setting, added Intel's compressor, and removed the perceptual graphs (for now) to speed things up.
  • 9/20: Changed title and purpose of this post to a sticky benchmark page. I'm now moving into the public texture compression benchmarking space - why not? It's fun!

Saturday, September 17, 2016

Let's evaluate the current state of ETC1/2 compression libraries

For regular block encoders (not RDO or crunch-style systems), I think what I need to do is to plot this like I would a lossless Pareto Frontier, with the Y axis being some measure of quality and the X axis being encoding speed across a wide range of test textures. Perhaps I can normalize the quality metric achieved by each encoder at its various settings vs. the highest achievable quality, for each image.

As far as I can tell so far, nobody's beating the quality/performance of etcpak, at its performance point. It's going to be fascinating to compare etcpak vs. ETC2Comp. Let's see how these two compare for pure ETC1 encoding, which is available across a huge range of devices. I'll compare against crnlib's ETC1 block encoder in multithreading mode, which was released before either etcpak or ETC2Comp.

On 30Hz console games

That framerate feels incredibly low to me now. I've worked on 60Hz and 30Hz console titles, and the optimization efforts required felt very different. Keeping a smooth, hypnotic 60Hz was sometimes extremely tricky. Now with VR 30Hz seems so incredibly antiquated.

Quick etcpak quality test

etcpak is a useful and really fast ETC1 (and some of 2) texture compressor. There is no such thing as a free lunch however, and there are some tradeoffs involved here. Quick example:

Original (kodim03):

crnlib in ETC1 uber mode (8.067 seconds):

RGB: Error: Max:  88, Mean: 2.086, MSE: 9.770, RMSE: 3.126, PSNR: 38.232, SSIM: 0.982703
Y: Error: Max:  34, Mean: 1.304, MSE: 3.750, RMSE: 1.936, PSNR: 42.391, SSIM: 0.982703

etcpak, ETC1 mode only (.006 seconds):

RGB: Error: Max:  80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.980072
Y: Error: Max:  49, Mean: 1.494, MSE: 4.996, RMSE: 2.235, PSNR: 41.144, SSIM: 0.980072

Note I've integrated etcpak directly into my project, and used the BlockData class directly. This thing is *fast*, even without threading!

crnlib has several lower quality settings that are much faster (and still higher quality than etcpak), but nowhere near the speed of etcpak. I've not been focused on pure speed, but on quality and unique features like RDO and intermediate formats like .CRN.

I think the primary value of etcpak is its high performance and relatively compact code size (especially for an ETC2-aware compressor). On many textures/images it'll look perfectly fine. Next up is ETC2Comp, limited to ETC1 mode.

ETC1 with 3D/4D random-restart hill climbing

For fun, I implemented a full ETC1 block encoder using random-restart hill climbing, to see how it behaves compared to my current custom optimizer (the one in rg_etc1). This method works surprisingly well and is quite simple. (Note I'm switching to luma PSNR, because I've been using perceptually weighted color distance. My previous posts used average RGB PSNR.)

The number of attempts per block is fixed. The first 4D hill climb always starts at the subblock's average color, with an intensity table index of 3. The second 4D hill climb starts at a random color/intensity. (In differential mode, the 2nd subblock's hill climb position is constrained to lie near the first one, otherwise we can't code it.) Eventually, it switches from 4D to 3D hill climbing, by randomly climbing only within the best found intensity plane.

The nearly-best ETC1 encoding (using rg_etc1 - not hill climbing) was 38.053 dB:
Y: Error: Max:  38, Mean: 2.181, MSE: 10.181, RMSE: 3.191, PSNR: 38.053, SSIM: 0.983632

- 1 hill climb, 33.998 dB:

Y: Error: Max:  35, Mean: 3.943, MSE: 25.901, RMSE: 5.089, PSNR: 33.998, SSIM: 0.933979

- 2 hill climbs, 37.808 dB:

Y: Error: Max:  33, Mean: 2.281, MSE: 10.770, RMSE: 3.282, PSNR: 37.808, SSIM: 0.980324

- 4 hill climbs, 37.818 dB:

 Y: Error: Max:  33, Mean: 2.280, MSE: 10.748, RMSE: 3.278, PSNR: 37.818, SSIM: 0.980280

- 16 hill climbs, 37.919 dB:

Y: Error: Max:  38, Mean: 2.241, MSE: 10.499, RMSE: 3.240, PSNR: 37.919, SSIM: 0.981631

That 2nd random 4D hill climb helps a lot. Quality quickly plateaus however, at least on this image, and subsequent climbs don't add much. Very interestingly to me, even just 4 climbs nearly matches the quality of my hand-tuned ETC1 optimizer.

Friday, September 16, 2016

Visualizing ETC1 block encoding error as a 4D function

Given a particular 4x4 pixel block, what does the error of all possible ETC1 5:5:5 base color+3-bit intensity encodings look like? The resulting 4D visualization could inspire better optimization algorithms.

To compute these images, I created an ETC1 block in differential mode (5:5:5 base color with a 3:3:3 delta), set the base color to R,G,B, the diff color to (0,0,0), and set both subblock intensity table values to the same index from 0-7. I then encoded the source pixels (by finding the optimal selectors for each pixel), decoded them, and computed the overall block error (as perceptually R,G,B weighted color distance).

These visualizations are linear, where the brightest value (255) is max error, black is 0 error. The blocks used to compute each visualization are here too:

Finding the "best" block color+intensity table index to use in a subblock is basically a 4D search through functions like above. Hill climbing optimization seems useful, except for those pesky local minimums. For fun, I've already tried random-restart hill climbing, and it works, but there's got to be a better way.

rg_etc1 starts at the block's average color and scans outwards along the RGB axes, trying to find better colors. It always tries all 8 intensity tables every time it tries a candidate color (which in retrospect seems wildly inefficient, but hey I wrote it over a weekend years ago). It also has several refinement steps. One of them factors in the selectors of the best color found so far, in an attempt to improve the current block color. rg_etc1 ran circles around Mali's reference encoder, from what I remember, which was my goal.

ETC1 texture format visualizations

I've been thinking about how to improve my ETC1 block encoder's quality. What little curiosities lie inside this seemingly simple format?

Hmm: Out of all possible ETC1 subblock colors in 5:5:5 differential mode, how many involve clamping R, G, and/or B to 0 or 255? Turns out, 72% (189704 out of 262144) of the possibilities involve clamping one or more components. That's much more often than I thought!

Here's a bitmap visualizing when the clamping occurs on any of the 4 block colors encoded by each 5:5:5 base color/3-bit intensity table combination. White pixels signify that one or more color components had to be clamped, and black signifies no clamping:

The basic assumption that each ETC1 subblock color lies nicely spread out along a single colorspace line isn't accurate, due to [0,255] clamping. So any optimization techniques written with this assumption in mind could be missing better solutions. Also, this impacts converting ETC1 to other formats like DXT1, because both endpoints of each colorspace line in DXT1 are separately encoded. Is this really a big deal? I dunno, but it's good to know.

Anyhow, here's a visualization of all possible subcolors. First, there are 4 images, one for each subblock color [0,3]. The 2-bit ETC1 selectors basically selector a color from one of these images.

Within an image, there are 8 rows, one for each of the ETC1 intensity tables. Within a row, there are 32 small "tiles" for blue, and within each little 32x32 tile is red (X) and green (Y).

FasTC library

This library, which supports a bunch of common (ETC1, DXT, PVRTC, etc.) formats (not all for encoding yet though) looks great:

Thursday, September 15, 2016

Google's new ETC2 codec looks awesome

I've worked with many of the authors of this at one time or another:

Building a blazing fast ETC2 compressor


(I can't believe the Mali encoder was only single threaded!)

Wednesday, September 14, 2016

ETC1 principle axis optimization

One possible potential (probably minor) optimization to ETC1 encoding: determine the principle axis of the entire texture, rotate the texture's RGB pixels (by treating them as 3D vectors) so this axis is aligned along the grayscale axis, then compress the texture as usual. The pixel shader can undo the rotation using a trivial handful of instructions.

ETC1 uses colorspace lines constrained to be parallel to the grayscale axis, which this optimization exploits.


etcpak is a very fast, but low quality ETC1 (and a little bit of ETC2) compressor:

It's the fastest open source ETC1 encoder that I'm aware of.

Notice the lack of any PSNR/MSE/SSIM statistics anywhere (that I can see). Also, the developer doesn't seem to get that the other tools/libraries he compares his stuff against were optimized for quality, not raw speed. In particular, rg_etc1 (and crunch's ETC1 support) was tuned to compete against the reference encoder along both the quality and perf. axes.

Anyhow, there are some interesting things to learn from etcpak:

  • Best quality doesn't always matter. It obviously depends on your use case. If you have 10 gigs of textures to compress then iteration speed can be very important.
  • The value spectrum spans from highest quality/slow encode (to ship final assets) to crap quality/fast as hell encode (favoring iteration speed). 
  • Visually, the ETC1/2 formats are nicely forgiving. Even a low quality ETC1 encoder produces decent enough looking output for many use cases.

Sunday, September 11, 2016

Idea for next texture compression experiment

Right now, I've got a GPU texture in a simple ETC1 subset that is easily converted to most other GPU formats:

Base color: 15-bits, 5:5:5 RGB
Intensity table index: 3-bits
Selectors: 2-bits/texel

Most importantly, this is a "single subset" encoding, using BC7 terminology. BC7 supports between 1-3 subsets per block. A subset is just a colorspace line represented by two R,G,B endpoint colors.

This format is easily converted to DXT1 using a table lookup. It's also the "base" of the universal GPU texture format I've been thinking about, because it's the data needed for DXT1 support. The next step is to experiment with attempting to refine this base data to better take advantage of the full ETC1 specification. So let's try adding two subsets to each block, with two partitions (again using BC7 terminology), top/bottom or left/right, which are supported by both ETC1 and BC7.

For example, we can code this base color, then delta code the 2 subset colors relative to this base. We'll also add a couple more intensity indices, which can be delta coded against the base index. Another bit can indicate which ETC1 block color encoding "mode" should be used (individual 4:4:4 4:4:4 or differential 5:5:5 3:3:3) to represent the subset colors in the output block.

In DXT1 mode, we can ignore this extra delta coded data and just convert the basic (single subset) base format. In ETC1/BC7/ASTC modes, we can use the extra information to support 2 subsets and 2 partitions.

Currently, the idea is to share the same selector indices between the single subset (DXT1) and two subset (BC7/ASTC/full ETC1) encodings. This will constrain how well this idea works, but I think it's worth trying out.

To add more quality to the 2 subset mode, we can delta code (maybe with some fancy per-pixel prediction) another array of selectors in some way. We can also add support for more partitions (derived from BC7's or ASTC's), too.