Tuesday, June 12, 2018

Real-time PVRTC encoding for a universal GPU texture format system

Here's one way to support PVRTC in a universal GPU texture format system that transcodes from a block based format like ETC1S.

First, study this PVRTC code:
https://bitbucket.org/jthlim/pvrtccompressor/src/default/PvrTcEncoder.cpp

Unfortunately, this library has several key bugs, but its core texture encoding approach is sound for real-time use.

Don't use its decompressor, it's not bit accurate vs. the GPU and doesn't unpack alpha properly. Use this "official" decoder instead as a reference instead:

https://github.com/google/swiftshader/blob/master/third_party/PowerVR_SDK/Tools/PVRTDecompress.h

Function EncodeRgb4Bpp() has two passes:

1. The first pass computes RGB(A) bounding boxes for each 4x4 block: 

    for(int y = 0; y < blocks; ++y) { for(int x = 0; x < blocks; ++x) { ColorRgbBoundingBox cbb; CalculateBoundingBox(cbb, bitmap, x, y); PvrTcPacket* packet = packets + GetMortonNumber(x, y); packet->usePunchthroughAlpha = 0; packet->SetColorA(cbb.min); packet->SetColorB(cbb.max); } }
Most importantly, SetColorA() must floor and SetColorB() must ceil. Note that the alpha version of the code in this library (function EncodeRgba4Bpp()) is very wrong: it assumes alpha 7=255, which is incorrect (it's actually (7*2)*255/15 or 238). 

This pass can be done while decoding ETC1S blocks during transcoding. The endpoint/modulation values need to be saved to a temporary buffer.

It's possible to swap the low and high endpoints and get an encoding that results in less error (I believe because the endpoint encoding precision of blue isn't symmetrical - it's 4/5 not 5/5), but you have to encode the image twice so it doesn't seem worth the trouble.

2. Now that the per-block endpoints are computed, you can compute the per-pixel modulation values. This function is quite optimizable without requiring vector code (which doesn't work on the Web yet):

for(int y = 0; y < blocks; ++y) { for(int x = 0; x < blocks; ++x) { const unsigned char (*factor)[4] = PvrTcPacket::BILINEAR_FACTORS; const ColorRgba<unsigned char>* data = bitmap.GetData() + y * 4 * size + x * 4; uint32_t modulationData = 0; for(int py = 0; py < 4; ++py) { const int yOffset = (py < 2) ? -1 : 0; const int y0 = (y + yOffset) & blockMask; const int y1 = (y0+1) & blockMask; for(int px = 0; px < 4; ++px) { const int xOffset = (px < 2) ? -1 : 0; const int x0 = (x + xOffset) & blockMask; const int x1 = (x0+1) & blockMask; const PvrTcPacket* p0 = packets + GetMortonNumber(x0, y0); const PvrTcPacket* p1 = packets + GetMortonNumber(x1, y0); const PvrTcPacket* p2 = packets + GetMortonNumber(x0, y1); const PvrTcPacket* p3 = packets + GetMortonNumber(x1, y1); ColorRgb<int> ca = p0->GetColorRgbA() * (*factor)[0] + p1->GetColorRgbA() * (*factor)[1] + p2->GetColorRgbA() * (*factor)[2] + p3->GetColorRgbA() * (*factor)[3]; ColorRgb<int> cb = p0->GetColorRgbB() * (*factor)[0] + p1->GetColorRgbB() * (*factor)[1] + p2->GetColorRgbB() * (*factor)[2] + p3->GetColorRgbB() * (*factor)[3]; const ColorRgb<unsigned char>& pixel = data[py*size + px]; ColorRgb<int> d = cb - ca; ColorRgb<int> p{pixel.r*16, pixel.g*16, pixel.b*16}; ColorRgb<int> v = p - ca; // PVRTC uses weightings of 0, 3/8, 5/8 and 1 // The boundaries for these are 3/16, 1/2 (=8/16), 13/16 int projection = (v % d) * 16; int lengthSquared = d % d; if(projection > 3*lengthSquared) modulationData++; if(projection > 8*lengthSquared) modulationData++; if(projection > 13*lengthSquared) modulationData++; modulationData = BitUtility::RotateRight(modulationData, 2); factor++; } } PvrTcPacket* packet = packets + GetMortonNumber(x, y); packet->modulationData = modulationData; } }

The code above interpolates the endpoints in full RGB(A) space, which isn't necessary. You can sum each channel into a single value (like Luma, but just R+G+B), interpolate that instead (much faster in scalar code), then decide which modulation values to use in 1D space. Also, you can unroll the innermost px/py loops using macros or whatever.

Encoding from ETC1S simplifies things somewhat because, for each block, you can precompute the R+G+B values to use for each of the 4 possible input selectors.

That's basically it. If you combine this post with my previous one, you've got a nice real-time PVRTC encoder usable in WebAssembly/asm.js (i.e. it doesn't need vector ops to be fast). Quality is surprisingly good for a real-time encoder, especially if you add the optional 3rd pass described in my other post.

Opaque is tougher to handle, but the basic concepts are the same.

The encoder in this library doesn't support punch-through alpha, which is quite valuable and easy to encode in my testing. 

No comments:

Post a Comment