Richard Geldreich's Blog: June 2018

Sunday, June 17, 2018

PVRTC encoding examples

This is "testpat.png", which I got somewhere on the web. It's a surprisingly tricky image to encode to PVRTC. The gradients, various patterns, the transitions between these regions and even the constant-color areas are hard to handle in PVRTC. (Sorry, there is lena in there. I will change this to something else eventually.)

Note my encoder used clamp addressing for both encoding and decoding but PVRTexTool used wrap (not that it matters with this image). Here's the .pvr file for testpat.

Original

BC1: 47.991 Y PSNR

PVRTexTool "Best Quality": 41.943 Y PSNR

Experimental encoder (bounding box, precomputed tables, 1x1 block LS): 44.914 Y PSNR:

Here's delorean (resampled to .25 original size):

Original

BC1: 43.293 Y PSNR, .997308 Y SSIM

PVRTexTool "Best Quality": 40.440 Y PSNR, .996007 Y SSIM

Experimental encoder: 42.891 Y PSNR, .997021 Y SSIM

Interestingly, on delorean you can see that PVRTC's handling of smooth gradients is clearly superior vs. BC1 with a strong encoder.

Here's xmen_1024:

Original

BC1: 37.757 Y PSNR, .984543 Y SSIM

BC1 (AMD Compressonator quality=1): 37.306 Y PSNR, .978997 Y SSIM

PVRTexTool "Best Quality": 36.762 Y PSNR, .976023 Y SSIM

Experimental encoder: 37.314 Y PSNR, .9812 Y SSIM

"Y" is REC 709 Luma, SSIM was computed using OpenCV. The images marked "BC1" were compressed using crunch (uber quality, perceptual mode), which is a bit better than AMD Compressonator's output.

Tuesday, June 12, 2018

Real-time PVRTC encoding for a universal GPU texture format system

Here's one way to support PVRTC in a universal GPU texture format system that transcodes from a block based format like ETC1S.

First, study this PVRTC code:
https://bitbucket.org/jthlim/pvrtccompressor/src/default/PvrTcEncoder.cpp

Unfortunately, this library has several key bugs, but its core texture encoding approach is sound for real-time use.

Don't use its decompressor, it's not bit accurate vs. the GPU and doesn't unpack alpha properly. Use this "official" decoder instead as a reference instead:

https://github.com/google/swiftshader/blob/master/third_party/PowerVR_SDK/Tools/PVRTDecompress.h

Function EncodeRgb4Bpp() has two passes:

1. The first pass computes RGB(A) bounding boxes for each 4x4 block:

    for(int y = 0; y < blocks; ++y)
    {
        for(int x = 0; x < blocks; ++x)
        {
            ColorRgbBoundingBox cbb;
            CalculateBoundingBox(cbb, bitmap, x, y);
            PvrTcPacket* packet = packets + GetMortonNumber(x, y);
            packet->usePunchthroughAlpha = 0;
            packet->SetColorA(cbb.min);
            packet->SetColorB(cbb.max);
        }
    }
    

Most importantly, SetColorA() must floor and SetColorB() must ceil. Note that the alpha version of the code in this library (function EncodeRgba4Bpp()) is very wrong: it assumes alpha 7=255, which is incorrect (it's actually (7*2)*255/15 or 238).

This pass can be done while decoding ETC1S blocks during transcoding. The endpoint/modulation values need to be saved to a temporary buffer.

It's possible to swap the low and high endpoints and get an encoding that results in less error (I believe because the endpoint encoding precision of blue isn't symmetrical - it's 4/5 not 5/5), but you have to encode the image twice so it doesn't seem worth the trouble.

2. Now that the per-block endpoints are computed, you can compute the per-pixel modulation values. This function is quite optimizable without requiring vector code (which doesn't work on the Web yet):

for(int y = 0; y < blocks; ++y) { for(int x = 0; x < blocks; ++x) { const unsigned char (*factor)[4] = PvrTcPacket::BILINEAR_FACTORS; const ColorRgba<unsigned char>* data = bitmap.GetData() + y * 4 * size + x * 4; uint32_t modulationData = 0; for(int py = 0; py < 4; ++py) { const int yOffset = (py < 2) ? -1 : 0; const int y0 = (y + yOffset) & blockMask; const int y1 = (y0+1) & blockMask; for(int px = 0; px < 4; ++px) { const int xOffset = (px < 2) ? -1 : 0; const int x0 = (x + xOffset) & blockMask; const int x1 = (x0+1) & blockMask; const PvrTcPacket* p0 = packets + GetMortonNumber(x0, y0); const PvrTcPacket* p1 = packets + GetMortonNumber(x1, y0); const PvrTcPacket* p2 = packets + GetMortonNumber(x0, y1); const PvrTcPacket* p3 = packets + GetMortonNumber(x1, y1); ColorRgb<int> ca = p0->GetColorRgbA() * (*factor)[0] + p1->GetColorRgbA() * (*factor)[1] + p2->GetColorRgbA() * (*factor)[2] + p3->GetColorRgbA() * (*factor)[3]; ColorRgb<int> cb = p0->GetColorRgbB() * (*factor)[0] + p1->GetColorRgbB() * (*factor)[1] + p2->GetColorRgbB() * (*factor)[2] + p3->GetColorRgbB() * (*factor)[3]; const ColorRgb<unsigned char>& pixel = data[py*size + px]; ColorRgb<int> d = cb - ca; ColorRgb<int> p{pixel.r*16, pixel.g*16, pixel.b*16}; ColorRgb<int> v = p - ca; // PVRTC uses weightings of 0, 3/8, 5/8 and 1 // The boundaries for these are 3/16, 1/2 (=8/16), 13/16 int projection = (v % d) * 16; int lengthSquared = d % d; if(projection > 3*lengthSquared) modulationData++; if(projection > 8*lengthSquared) modulationData++; if(projection > 13*lengthSquared) modulationData++; modulationData = BitUtility::RotateRight(modulationData, 2); factor++; } } PvrTcPacket* packet = packets + GetMortonNumber(x, y); packet->modulationData = modulationData; } }

The code above interpolates the endpoints in full RGB(A) space, which isn't necessary. You can sum each channel into a single value (like Luma, but just R+G+B), interpolate that instead (much faster in scalar code), then decide which modulation values to use in 1D space. Also, you can unroll the innermost px/py loops using macros or whatever.

Encoding from ETC1S simplifies things somewhat because, for each block, you can precompute the R+G+B values to use for each of the 4 possible input selectors.

That's basically it. If you combine this post with my previous one, you've got a nice real-time PVRTC encoder usable in WebAssembly/asm.js (i.e. it doesn't need vector ops to be fast). Quality is surprisingly good for a real-time encoder, especially if you add the optional 3rd pass described in my other post.

Opaque is tougher to handle, but the basic concepts are the same.

The encoder in this library doesn't support punch-through alpha, which is quite valuable and easy to encode in my testing.

Monday, June 11, 2018

Lookup table based real-time PVRTC encoding

I've found a table-based method of improving the output from a real-time PVRTC encoder. Fast real-time encoders first find the RGB(A) bounds of each 4x4 block to determine the block endpoints, then they evaluate the interpolated endpoints at each pixel to determine the modulation values which minimize the encoded error. This works okay, but the results are barely acceptable in practice due to banding artifacts on smooth features.

One way to improve the output of this process is to precompute, for all [0,255] 8-bit component values, the best PVRTC low/high endpoints to use to encode that value assuming the modulation values in the 7x7 pixel region are either all-1 or 2 (or all 0, 1, 2, or 3):

// Tables containing the 5-bit/5-bit L/H endpoints to use for each 8-bit value
static uint g_pvrtc_opt55_e1[256];
static uint g_pvrtc_opt55_e2[256];

// Tables containing the 5-bit/4-bit L/H endpoints to use for each 8-bit value
static uint g_pvrtc_opt54_e1[256];
static uint g_pvrtc_opt54_e2[256];

const int T = 120;

for (uint c = 0; c < 256; c++)
{
uint best_err1 = UINT_MAX;
uint best_l1 = 0, best_h1 = 0;
uint best_err2 = UINT_MAX;
uint best_l2 = 0, best_h2 = 0;

for (uint l = 0; l < 32; l++)
{
const int lv = (l << 3) | (l >> 2);

for (uint h = 0; h < 32; h++)
{
const int hv = (h << 3) | (h >> 2);

if (lv > hv)
continue;

int delta = hv - lv;
// Avoid endpoints that are too far apart to reduce artifacts
if (delta > T)
continue;

uint e1 = (lv * 5 + hv * 3) / 8;

int diff1 = math::iabs(c - e1);
if (diff1 < best_err1)
{
best_err1 = diff1;
best_l1 = l;
best_h1 = h;
}

uint e2 = (lv * 3 + hv * 5) / 8;
int diff2 = math::iabs(c - e2);
if (diff2 < best_err2)
{
best_err2 = diff2;
best_l2 = l;
best_h2 = h;
}
}
}

g_pvrtc_opt55_e1[c] = best_l1 | (best_h1 << 8);
g_pvrtc_opt55_e2[c] = best_l2 | (best_h2 << 8);
}

// 5-bit/4-bit loop is similar

Now that you have these tables, you can loop through all the 4x4 pixel blocks in the PVRTC texture and compute the 7x7 average RGB color surrounding each block (it's 7x7 pixels because you want the average of all colors influenced by each block's endpoint accounting for bilinear endpoint interpolation). You can look up the optimal endpoints to use for each component, set the block's endpoints to those trial endpoints, find the best modulation values for the impacted 7x7 pixels, and see if the error is reduced or not. The overall error is reduced on smooth blocks very often. You can try this process several times for each block using different precomputed tables.

For even more quality, you can also use precomputed tables for modulation values 0 and 3. You can also use two dimensional tables [256][256] that have the optimal endpoints to use for two colors, then quantize each 7x7 pixel area to 2 colors (using a few Lloyd algorithm iterations) and try those endpoints too. 2D tables result in higher quality high contrast transitions.

Here's some psuedocode showing how to use the tables for a single modulation value (you can apply this process multiple times for the other tables):

// Compute average color of all pixels influenced by this endpoint

vec4F c_avg(0);

for (int y = 0; y < 7; y++)

{

const uint py = wrap_or_clamp_y(by * 4 + y - 1);

for (uint x = 0; x < 7; x++)

{

const uint px = wrap_or_clamp_x(bx * 4 + x - 1);

const color_quad_u8 &c = orig_img(px, py);

c_avg[0] += c[0];

c_avg[1] += c[1];

c_avg[2] += c[2];

c_avg[3] += c[3];

}

// Save the 3x3 block neighborhood surrounding the current block

for (int y = -1; y <= 1; y++)

{

for (int x = -1; x <= 1; x++)

{

const uint block_x = wrap_or_clamp_block_x(bx + x);

const uint block_y = wrap_or_clamp_block_y(by + y);

cur_blocks[x + 1][y + 1] = m_blocks(block_x, block_y);

}

// Compute the rounded 8-bit average color

// c_avg is the average color of the 7x7 pixels around the block

c_avg += vec4F(.5f);

color_quad_u8 color_avg((int)c_avg[0], (int)c_avg[1], (int)c_avg[2], (int)c_avg[3]);

// Lookup the optimal PVRTC endpoints to use given this average color,

// assuming the modulation values will be all-1

color_quad_u8 l0(0), h0(0);

l0[0] = g_pvrtc_opt55_e1[color_avg[0]] & 0xFF;

h0[0] = g_pvrtc_opt55_e1[color_avg[0]] >> 8;

l0[1] = g_pvrtc_opt55_e1[color_avg[1]] & 0xFF;

h0[1] = g_pvrtc_opt55_e1[color_avg[1]] >> 8;

l0[2] = g_pvrtc_opt54_e1[color_avg[2]] & 0xFF;

h0[2] = g_pvrtc_opt54_e1[color_avg[2]] >> 8;

// Set the block's endpoints and evaluate the error of the 7x7 neighborhood (also choosing new modulation values!)

m_blocks(bx, by).set_opaque_endpoint_raw(0, l0);

m_blocks(bx, by).set_opaque_endpoint_raw(1, h0);

uint64 e1_err = remap_pixels_influenced_by_endpoint(bx, by, orig_img, perceptual, alpha_is_significant);

if (e1_err > current_best_err)

{

// Error got worse, so restore the blocks

for (int y = -1; y <= 1; y++)

{

for (int x = -1; x <= 1; x++)

{

const uint block_x = wrap_or_clamp_block_x(bx + x);

const uint block_y = wrap_or_clamp_block_y(by + y);

m_blocks(block_x, block_y) = cur_blocks[x + 1][y + 1];

}

Here's an example for kodim03 (cropped to 1k square due to PVRTC limitations). This image only uses 2 precomputed tables for modulation values 1 and 2 (because it's real-time):

Original:

Before table-based optimization:

RGB Average Error: Max: 86, Mean: 1.156, MSE: 9.024, RMSE: 3.004, PSNR: 38.577

Endpoint and modulation data:

After:

RGB Average Error: Max: 79, Mean: 0.971, MSE: 6.694, RMSE: 2.587, PSNR: 39.874

Endpoint and modulation data:

The 2D table version looks better on high contrast transitions, but needs more memory. Using 4 1D tables followed by a single 2D lookup results in the best quality.

The lookup table example code above assumes the high endpoints will usually be >= than the low endpoints. Whatever algorithm you use to create the endpoints in the first pass needs to be compatible with your lookup tables, or you'll loose quality.

You can apply this algorithm in multiple passes for higher quality. 2-3 passes seems sufficient.

For comparison, here's a grayscale ramp encoded using PVRTexTool (best quality), vs. this algorithm using 3 passes:

Original:

PVRTexTool:

Lookup-based algorithm:

Friday, June 8, 2018

ETC1S texture format encoding and how it's transcoded to BC1

I developed the ETC1S encoding method back in late 2016, and we talked about it publicly in our CppCon '16 presentation. It's good to see that this encoding is working well in crunch too (better bitrate for near equal error). There are kodim statistics on Alexander's checkin notes:

https://github.com/Unity-Technologies/crunch/commit/660322d3a611782202202ac00109fbd1a10d7602

I described the format details and asked Alexander to support ETC1S so we could add universal support to crunch.

Anyhow, ETC1S is great because it enables simplified transcoding to BC1 using a couple small lookup tables (one for the 5 bit DXT1 components, and the other for 6). You can precompute the best DXT1 component low/high endpoints to use for each possibility of used ETC1S selectors (or low/high selector "ranges") and ways of remapping the ETC1S selectors to DXT1 selectors. The method I came up with supports a strong subset of these possible mapping (6 low/high selector ranges and 10 selector remappings).

So the basic idea to this transcoder design is that we'll figure out the near-optimal DXT1 low/high endpoints to use for a ETC1S block, then just translate the ETC1S selectors through a remapping table. We don't need to do any expensive R,G,B vector calculations here, just simple math on endpoint components and selectors. To find the best endpoints, we need the ETC1S base color (5,5,5), intensity table index (3 bits), and the used selector range (because ETC1/ETC1S heavily depends on endpoint extrapolation to reduce overall error, so for example sometimes the encoder will only use a single selector in the "middle" of the intensity range).

First, here are the most used selector ranges used by the transcoder:
{ 0, 3 },
{ 1, 3 },
{ 0, 2 },
{ 1, 2 },
{ 2, 3 },
{ 0, 1 },

And here are the selector remapping tables:
{ 0, 0, 1, 1 },
{ 0, 0, 1, 2 },
{ 0, 0, 1, 3 },
{ 0, 0, 2, 3 },
{ 0, 1, 1, 1 },
{ 0, 1, 2, 2 },
{ 0, 1, 2, 3 },
{ 0, 2, 3, 3 },
{ 1, 2, 2, 2 },
{ 1, 2, 3, 3 },

So what does this stuff mean? In the first table, the first entry is { 0, 3 }. This index is used for blocks that use all 4 selectors. The 2nd one is for blocks that only use selectors 1-3, etc. We could support all possible ways that the 4 selectors could be used, but you reach a point of diminishing returns.

The second table is used to translate ETC1S selectors to DXT1 selectors. Again, we could support all possible ways of remapping selectors, but only a few are needed in practice.

So to translate an ETC1S block to BC1/DXT1:

- Scan the ETC1S selectors (which range from 0-3) to identify their low/high range, and map this to the best entry in the first table. This is the selector range table index, from 0-5.
(For crunch/basis this is precomputed for each selector codebook entry, so we don't need to do it for each block.)

- Now we have a selector range (0-5), three ETC1S base color components (5-bits each) and an ETC1S intensity table index (3-bits). We have a set of 10 precomputed tables (for each supported way of remapping the selectors from ETC1S->DXT1) for each selector_range/basecolor/inten_table possibility (6*32*8*10=15360 total tables).

- Each table entry has a DXT1 low/high endpoint values (either 5 or 6 bits) and an error value. But this is only for a single component, so we need to scan the 10 entries (for each possible way of remapping the selectors from ETC1S->DXT1) for all components, sum up their total R+G+B error, and use the selector remapping method that minimizes the overall error. (We can only select 1 way to remap the selectors, because there's only a single selector for each pixel.) The best way of remapping the selectors for R may not be the best way for G or B, so we need to try all 10 ways we support, compute the error for each, and select the best one that minimizes the overall error.

In code:

// Get the best selector range table entry to use for the ETC1S block:

const uint selector_range_table = g_etc1_to_dxt1_selector_range_index[low_selector][high_selector];

// Now get pointers to the precomputed tables for each component:

//[32][8][RANGES][MAPPING]

const etc1_to_dxt1_56_solution *pTable_r = &g_etc1_to_dxt_5[(inten_table * 32 + base_color.r) * (NUM_ETC1_TO_DXT1_SELECTOR_RANGES * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS) + selector_range_table * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS];

const etc1_to_dxt1_56_solution *pTable_g = &g_etc1_to_dxt_6[(inten_table * 32 + base_color.g) * (NUM_ETC1_TO_DXT1_SELECTOR_RANGES * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS) + selector_range_table * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS];

const etc1_to_dxt1_56_solution *pTable_b = &g_etc1_to_dxt_5[(inten_table * 32 + base_color.b) * (NUM_ETC1_TO_DXT1_SELECTOR_RANGES * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS) + selector_range_table * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS];

// Scan to find the best remapping table (from 10) to use:

uint best_err = UINT_MAX;

uint best_mapping = 0;

CRND_ASSERT(NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS == 10);

#define DO_ITER(m) { uint total_err = pTable_r[m].m_err + pTable_g[m].m_err + pTable_b[m].m_err; if (total_err < best_err) { best_err = total_err; best_mapping = m; } }

DO_ITER(0); DO_ITER(1); DO_ITER(2); DO_ITER(3); DO_ITER(4);

DO_ITER(5); DO_ITER(6); DO_ITER(7); DO_ITER(8); DO_ITER(9);

#undef DO_ITER

// Now create the DXT1 endpoints

uint l = dxt1_block::pack_unscaled_color(pTable_r[best_mapping].m_lo, pTable_g[best_mapping].m_lo, pTable_b[best_mapping].m_lo);

uint h = dxt1_block::pack_unscaled_color(pTable_r[best_mapping].m_hi, pTable_g[best_mapping].m_hi, pTable_b[best_mapping].m_hi);

// pSelector_xlat is used to translate the ETC1S selectors to DXT1 selectors

const uint8 *pSelectors_xlat = &g_etc1_to_dxt1_selector_mappings1[best_mapping][0];

if (l < h)

{

std::swap(l, h);

pSelectors_xlat = &g_etc1_to_dxt1_selector_mappings2[best_mapping][0];

}

pDst_block->set_low_color(static_cast<uint16>(l));

pDst_block->set_high_color(static_cast<uint16>(h));

// Now use pSelectors_xlat[] to translate the selectors and we're done

If the block only uses a single selector, it's a fixed color block and you can use a separate set of precomputed tables (like stb_dxt uses) to convert it to the optimal DXT1 color.

So that's it. It's a fast and simple process to convert ETC1S->DXT1. The results look very good, and are within a fraction of a dB between ETC1S and BC1. You can also use this process to convert ETC1S->BC7, etc.

Once you understand this process, almost everything else falls into place for the universal format. ETC1S->BC1 and ETC1S->PVRTC are the key transcoders, and all other formats use these basic ideas.

There are surely other "base" formats we could choose. I choose ETC1S because I already had a strong encoder for this format and because it's transcodable to BC1.

You can see the actual code here, in function convert_etc1_to_dxt1().

It's possible to add BC7-style pbits to ETC1S (1 or 3) to improve quality. Transcoders can decide to use these pbits, or not.

How to improve crunch's codebook generators

While writing Basis ETC I sat down and started to study the codebook generation process I used on crunch. crunch would create candidate representational vectors (for endpoints or selectors), clusterize these candidates (using top-down clusterization), assign blocks to the closest codebook entry, and then go and compute the best DXT1 endpoint or selectors to use for each cluster. That's basically it. Figuring out how to do this well on DXT1 took a lot of experimentation, so I didn't have the energy to go and improve it.

Here are the visualizations:

After studying the clusterizations visualized as massive PNG files I saw a lot of nonsensical things. The algorithm worked, but sometimes clusters would be surprisingly large (in 6D for endpoints or 16D space for selectors), leading to unrelated blocks being lumped into the same cluster.

To fix this, I started using Lloyd's algorithm at a higher level, so the codebook could be refined over several iterations:

1. Create candidate codebook (like crunch)
2. Reassign each input block to the best codebook entry (by trying them all and computing the error of each), creating a new clusterization.
3. Compute new codebook entries (by optimizing the endpoints or selecting the best selectors to use for each cluster factoring in the block endpoints).
4. Repeat steps 2-3 X times. Each iteration will lower the overall error.

You also need to insert steps to identify redundant codebook entries and delete them. If the codebook becomes too small, you can find the cluster with the worst error and split it into two or more clusters.

Also, whenever you decide to use a different endpoint or selector to code a block, you've changed the clusterization used and you should recompute the codebook (factoring in the actual clusterization). Optimizations like selector RDO change the final clusterization.

Monday, June 4, 2018

Better PVRTC encoding

I'm taking a quick break from RDO BC7. I've been working on it for too long and I need to mix things up.

I've been experimenting with high-quality PVRTC encoding for several years, off and on. I've finally found an algorithm that is simple and fast, that in most cases beats PVRTexTool's approach. (PVRTexTool is the "standard" high-quality production encoder for PVRTC. To my knowledge it's the best available.) In the cases I can find where PVRTexTool does better, the quality delta is low (<.5 dB).

I know PVRTC is doomed long term (ASTC is far better), but it's still pervasive on iOS devices.

Useful references:
http://roartindon.blogspot.com/2014/08/pvr-texture-compression-exploration.html
http://jcgt.org/published/0003/04/07/paper-lowres.pdf

It's a three phase algorithm:
1. Compute endpoints using van Waveren's approximation: For each 4x4 block compute the RGB(A) bounds of that block. Set the low endpoint to the floor() of the bounds (with correct rounding to 554), and set the high endpoint to the ceil() of the bounds (again with correct rounding to 555).

An alternative is to use Intensity Dilation (see the link to the paper), which may lead to better results. But this is far simpler and it's what Lim successfully uses in his real-time encoder.

One trick you can use for slightly higher quality is to try a pass with the low/high endpoints inverted (use the high bounds for the first endpoint with ceil(), and the low bounds for the second endpoint with floor()). Choose the ordering that minimizes the overall error. Once the endpoint order is set in place in PVRTC it can be difficult for this algorithm to change it (because all blocks influence all other blocks directly/indirectly).

2. Now go and select the optimal modulation values for each pixel using these endpoints (factoring in the PVRTC endpoint interpolation, of course).

The results at this point are usually a little better than PVRTexTool in "Lower" quality, at least visually. The results so far should be equivalent or slightly better than Lim's encoder (depending on how much you approximate the modulation value search).

Interestingly, the results up to this point are acceptable for some use cases already. The output is too banded and high contrast areas will be smeared out, but the distortion introduced up to this point is predictable and stable.

3. For each block in raster order: Now use 1x1 block least squares optimization (using normal equations) separately on each component to solve for the best low/high endpoints to use for each block. A single block impacts 7x7 pixels (or 3x3 blocks) in PVRTC 4bpp mode.

The surrounding endpoints, modulation values, and output pixels are constants, and the only unknowns are the endpoints, so this is fairly straightforward. This is just like how it's done in BC1 and BC7 encoders, except we're dealing with larger matrices (7x7 instead of 4x4) and we need to carefully factor in the endpoint interpolation.

For solving, the equation is Ax=b, where A is a 49x2 matrix (7x7 pixels=49), x is a 2x1 matrix (the low and high endpoint values we're solving for), and b is 49x1 matrix containing the desired output values (which are the desired RGB output pixel values minus the interpolated and weighted contribution from the surrounding constant endpoints). The A matrix contains the per-pixel modulation weights multiplied by the amount the endpoint influences the result (factoring in endpoint interpolation).

After you've done 1x1 least squares on each component, the results are rounded to 554/555. Then you find the optimal modulation values for the effected 7x7 block of pixels, and only accept the results if the overall error has been reduced.

You can "twiddle" the modulation values in various ways before doing the least squares calculations, just like BC1/BC7 encoders do. I've tried incrementing the lowest modulation value and/or decrementing the higher modulation value, and seeing if the results are any better. This works well.

Step 3 can be repeated multiple times to improve quality more. 3-5 refinement iterations seems to be enough. You can vary the block processing order for slightly higher quality.

There are definitely many other improvements, but this is the basic idea. Each step is simple, and all steps are vectorizable and threadable.

PVRTexTool uses 2x2 SVD, as far as I know, but this seems unnecessary, and seems to lead to noticeable stipple-like artifacts being introduced in many cases. (Check out the car door below.) Also, PVRTexTool's handling of gradients seems questionable (perhaps endpoint rounding issues?).

Quick example encodings:

Original:

PVRTexTool 4.19.0, "Very High Quality":

RGB Average Error: PSNR: 36.245, SSIM: 0.979442

Luma Error: PSNR: 36.841, SSIM: 0.984710

New encoder (using perceptual colorspace metrics, so it's trying to optimize for lower luma error):

RGB Average Error: PSNR: 36.728, SSIM: 0.976032

Luma Error: PSNR: 37.827, SSIM: 0.990251

Original:

PVRTexTool Very High:
RGB Average Error: PSNR: 41.809, SSIM: 0.993144
Luma Error: PSNR: 41.943, SSIM: 0.993875

New encoder (perceptual mode):
RGB Average Error: PSNR: 41.730, SSIM: 0.991800
Luma Error: PSNR: 43.419, SSIM: 0.997416

Original:

PVRTexTool high quality:
RGB Average Error: PSNR: 27.640, SSIM: 0.954125
Luma Error: PSNR: 30.292, SSIM: 0.964433

New encoder (RGB metrics):
RGB Average Error: PSNR: 29.523, SSIM: 0.957067
Luma Error: PSNR: 32.702, SSIM: 0.974145