Monday, June 11, 2018

Lookup table based real-time PVRTC encoding

I've found a table-based method of improving the output from a real-time PVRTC encoder. Fast real-time encoders first find the RGB(A) bounds of each 4x4 block to determine the block endpoints, then they evaluate the interpolated endpoints at each pixel to determine the modulation values which minimize the encoded error. This works okay, but the results are barely acceptable in practice due to banding artifacts on smooth features.

One way to improve the output of this process is to precompute, for all [0,255] 8-bit component values, the best PVRTC low/high endpoints to use to encode that value assuming the modulation values in the 7x7 pixel region are either all-1 or 2 (or all 0, 1, 2, or 3):

// Tables containing the 5-bit/5-bit L/H endpoints to use for each 8-bit value      
static uint g_pvrtc_opt55_e1[256];
static uint g_pvrtc_opt55_e2[256];

// Tables containing the 5-bit/4-bit L/H endpoints to use for each 8-bit value     
static uint g_pvrtc_opt54_e1[256];
static uint g_pvrtc_opt54_e2[256];

const int T = 120;

for (uint c = 0; c < 256; c++)
{
    uint best_err1 = UINT_MAX;
    uint best_l1 = 0, best_h1 = 0;
    uint best_err2 = UINT_MAX;
    uint best_l2 = 0, best_h2 = 0;

    for (uint l = 0; l < 32; l++)
    {
        const int lv = (l << 3) | (l >> 2);

        for (uint h = 0; h < 32; h++)
        {
            const int hv = (h << 3) | (h >> 2);

            if (lv > hv)
                continue;

            int delta = hv - lv;
            // Avoid endpoints that are too far apart to reduce artifacts
            if (delta > T)
                continue;

            uint e1 = (lv * 5 + hv * 3) / 8;

            int diff1 = math::iabs(c - e1);
            if (diff1 < best_err1)
            {
                best_err1 = diff1;
                best_l1 = l;
                best_h1 = h;
            }

            uint e2 = (lv * 3 + hv * 5) / 8;
            int diff2 = math::iabs(c - e2);
            if (diff2 < best_err2)
            {
                best_err2 = diff2;
                best_l2 = l;
                best_h2 = h;
            }
        }
    }

    g_pvrtc_opt55_e1[c] = best_l1 | (best_h1 << 8);
    g_pvrtc_opt55_e2[c] = best_l2 | (best_h2 << 8);
}

// 5-bit/4-bit loop is similar

Now that you have these tables, you can loop through all the 4x4 pixel blocks in the PVRTC texture and compute the 7x7 average RGB color surrounding each block (it's 7x7 pixels because you want the average of all colors influenced by each block's endpoint accounting for bilinear endpoint interpolation). You can look up the optimal endpoints to use for each component, set the block's endpoints to those trial endpoints, find the best modulation values for the impacted 7x7 pixels, and see if the error is reduced or not. The overall error is reduced on smooth blocks very often. You can try this process several times for each block using different precomputed tables.

For even more quality, you can also use precomputed tables for modulation values 0 and 3. You can also use two dimensional tables [256][256] that have the optimal endpoints to use for two colors, then quantize each 7x7 pixel area to 2 colors (using a few Lloyd algorithm iterations) and try those endpoints too. 2D tables result in higher quality high contrast transitions.

Here's some psuedocode showing how to use the tables for a single modulation value (you can apply this process multiple times for the other tables):

// Compute average color of all pixels influenced by this endpoint
vec4F c_avg(0);

for (int y = 0; y < 7; y++)
{
const uint py = wrap_or_clamp_y(by * 4 + y - 1);
for (uint x = 0; x < 7; x++)
{
const uint px = wrap_or_clamp_x(bx * 4 + x - 1);

const color_quad_u8 &c = orig_img(px, py);

c_avg[0] += c[0];
c_avg[1] += c[1];
c_avg[2] += c[2];
c_avg[3] += c[3];
}
}

// Save the 3x3 block neighborhood surrounding the current block
for (int y = -1; y <= 1; y++)
{
    for (int x = -1; x <= 1; x++)
    {
        const uint block_x = wrap_or_clamp_block_x(bx + x);
        const uint block_y = wrap_or_clamp_block_y(by + y);
        cur_blocks[x + 1][y + 1] = m_blocks(block_x, block_y);
    }
}

// Compute the rounded 8-bit average color
// c_avg is the average color of the 7x7 pixels around the block
c_avg += vec4F(.5f);
color_quad_u8 color_avg((int)c_avg[0], (int)c_avg[1], (int)c_avg[2], (int)c_avg[3]);

// Lookup the optimal PVRTC endpoints to use given this average color,
// assuming the modulation values will be all-1
color_quad_u8 l0(0), h0(0);
l0[0] = g_pvrtc_opt55_e1[color_avg[0]] & 0xFF;
h0[0] = g_pvrtc_opt55_e1[color_avg[0]] >> 8;

l0[1] = g_pvrtc_opt55_e1[color_avg[1]] & 0xFF;
h0[1] = g_pvrtc_opt55_e1[color_avg[1]] >> 8;

l0[2] = g_pvrtc_opt54_e1[color_avg[2]] & 0xFF;
h0[2] = g_pvrtc_opt54_e1[color_avg[2]] >> 8;

// Set the block's endpoints and evaluate the error of the 7x7 neighborhood (also choosing new modulation values!)
m_blocks(bx, by).set_opaque_endpoint_raw(0, l0);
m_blocks(bx, by).set_opaque_endpoint_raw(1, h0);

uint64 e1_err = remap_pixels_influenced_by_endpoint(bx, by, orig_img, perceptual, alpha_is_significant);
if (e1_err > current_best_err)
{
    // Error got worse, so restore the blocks
    for (int y = -1; y <= 1; y++)
    {
        for (int x = -1; x <= 1; x++)
        {
            const uint block_x = wrap_or_clamp_block_x(bx + x);
            const uint block_y = wrap_or_clamp_block_y(by + y);

            m_blocks(block_x, block_y) = cur_blocks[x + 1][y + 1];
        }
    }
}

Here's an example for kodim03 (cropped to 1k square due to PVRTC limitations). This image only uses 2 precomputed tables for modulation values 1 and 2 (because it's real-time):

Original:


Before table-based optimization:
RGB Average Error: Max:  86, Mean: 1.156, MSE: 9.024, RMSE: 3.004, PSNR: 38.577


Endpoint and modulation data:





After:
RGB Average Error: Max:  79, Mean: 0.971, MSE: 6.694, RMSE: 2.587, PSNR: 39.874



Endpoint and modulation data:





The 2D table version looks better on high contrast transitions, but needs more memory. Using 4 1D tables followed by a single 2D lookup results in the best quality.

The lookup table example code above assumes the high endpoints will usually be >= than the low endpoints. Whatever algorithm you use to create the endpoints in the first pass needs to be compatible with your lookup tables, or you'll loose quality.

You can apply this algorithm in multiple passes for higher quality. 2-3 passes seems sufficient.

For comparison, here's a grayscale ramp encoded using PVRTexTool (best quality), vs. this algorithm using 3 passes:

Original:



PVRTexTool:

Lookup-based algorithm:



No comments:

Post a Comment