Friday, June 8, 2018

How to improve crunch's codebook generators

While writing Basis ETC I sat down and started to study the codebook generation process I used on crunch. crunch would create candidate representational vectors (for endpoints or selectors), clusterize these candidates (using top-down clusterization), assign blocks to the closest codebook entry, and then go and compute the best DXT1 endpoint or selectors to use for each cluster. That's basically it. Figuring out how to do this well on DXT1 took a lot of experimentation, so I didn't have the energy to go and improve it.

Here are the visualizations:



After studying the clusterizations visualized as massive PNG files I saw a lot of nonsensical things. The algorithm worked, but sometimes clusters would be surprisingly large (in 6D for endpoints or 16D space for selectors), leading to unrelated blocks being lumped into the same cluster.

To fix this, I started using Lloyd's algorithm at a higher level, so the codebook could be refined over several iterations:

1. Create candidate codebook (like crunch)
2. Reassign each input block to the best codebook entry (by trying them all and computing the error of each), creating a new clusterization.
3. Compute new codebook entries (by optimizing the endpoints or selecting the best selectors to use for each cluster factoring in the block endpoints).
4. Repeat steps 2-3 X times. Each iteration will lower the overall error.

You also need to insert steps to identify redundant codebook entries and delete them. If the codebook becomes too small, you can find the cluster with the worst error and split it into two or more clusters.

Also, whenever you decide to use a different endpoint or selector to code a block, you've changed the clusterization used and you should recompute the codebook (factoring in the actual clusterization). Optimizations like selector RDO change the final clusterization.

Monday, June 4, 2018

Better PVRTC encoding

I'm taking a quick break from RDO BC7. I've been working on it for too long and I need to mix things up.

I've been experimenting with high-quality PVRTC encoding for several years, off and on. I've finally found an algorithm that is simple and fast, that in most cases beats PVRTexTool's approach. (PVRTexTool is the "standard" high-quality production encoder for PVRTC. To my knowledge it's the best available.) In the cases I can find where PVRTexTool does better, the quality delta is low (<.5 dB).

I know PVRTC is doomed long term (ASTC is far better), but it's still pervasive on iOS devices.

Useful references:
http://roartindon.blogspot.com/2014/08/pvr-texture-compression-exploration.html
http://jcgt.org/published/0003/04/07/paper-lowres.pdf

It's a three phase algorithm:
1. Compute endpoints using van Waveren's approximation: For each 4x4 block compute the RGB(A) bounds of that block. Set the low endpoint to the floor() of the bounds (with correct rounding to 554), and set the high endpoint to the ceil() of the bounds (again with correct rounding to 555).

An alternative is to use Intensity Dilation (see the link to the paper), which may lead to better results. But this is far simpler and it's what Lim successfully uses in his real-time encoder.

One trick you can use for slightly higher quality is to try a pass with the low/high endpoints inverted (use the high bounds for the first endpoint with ceil(), and the low bounds for the second endpoint with floor()). Choose the ordering that minimizes the overall error. Once the endpoint order is set in place in PVRTC it can be difficult for this algorithm to change it (because all blocks influence all other blocks directly/indirectly).

2. Now go and select the optimal modulation values for each pixel using these endpoints (factoring in the PVRTC endpoint interpolation, of course).

The results at this point are usually a little better than PVRTexTool in "Lower" quality, at least visually. The results so far should be equivalent or slightly better than Lim's encoder (depending on how much you approximate the modulation value search).

Interestingly, the results up to this point are acceptable for some use cases already. The output is too banded and high contrast areas will be smeared out, but the distortion introduced up to this point is predictable and stable.

3. For each block in raster order: Now use 1x1 block least squares optimization (using normal equations) separately on each component to solve for the best low/high endpoints to use for each block. A single block impacts 7x7 pixels (or 3x3 blocks) in PVRTC 4bpp mode.

The surrounding endpoints, modulation values, and output pixels are constants, and the only unknowns are the endpoints, so this is fairly straightforward. This is just like how it's done in BC1 and BC7 encoders, except we're dealing with larger matrices (7x7 instead of 4x4) and we need to carefully factor in the endpoint interpolation.

For solving, the equation is Ax=b, where A is a 49x2 matrix (7x7 pixels=49), x is a 2x1 matrix (the low and high endpoint values we're solving for), and b is 49x1 matrix containing the desired output values (which are the desired RGB output pixel values minus the interpolated and weighted contribution from the surrounding constant endpoints). The A matrix contains the per-pixel modulation weights multiplied by the amount the endpoint influences the result (factoring in endpoint interpolation).

After you've done 1x1 least squares on each component, the results are rounded to 554/555. Then you find the optimal modulation values for the effected 7x7 block of pixels, and only accept the results if the overall error has been reduced.

You can "twiddle" the modulation values in various ways before doing the least squares calculations, just like BC1/BC7 encoders do. I've tried incrementing the lowest modulation value and/or decrementing the higher modulation value, and seeing if the results are any better. This works well.

Step 3 can be repeated multiple times to improve quality more. 3-5 refinement iterations seems to be enough. You can vary the block processing order for slightly higher quality.

There are definitely many other improvements, but this is the basic idea. Each step is simple, and all steps are vectorizable and threadable.

PVRTexTool uses 2x2 SVD, as far as I know, but this seems unnecessary, and seems to lead to noticeable stipple-like artifacts being introduced in many cases. (Check out the car door below.) Also, PVRTexTool's handling of gradients seems questionable (perhaps endpoint rounding issues?).

Quick example encodings:

Original:


PVRTexTool 4.19.0, "Very High Quality":
RGB Average Error: PSNR: 36.245, SSIM: 0.979442
Luma Error: PSNR: 36.841, SSIM: 0.984710


New encoder (using perceptual colorspace metrics, so it's trying to optimize for lower luma error):
RGB Average Error: PSNR: 36.728, SSIM: 0.976032
Luma Error: PSNR: 37.827, SSIM: 0.990251


Original:


PVRTexTool Very High:
RGB Average Error: PSNR: 41.809, SSIM: 0.993144
Luma Error: PSNR: 41.943, SSIM: 0.993875


New encoder (perceptual mode):
RGB Average Error: PSNR: 41.730, SSIM: 0.991800
Luma Error: PSNR: 43.419, SSIM: 0.997416


Original:

PVRTexTool high quality:
RGB Average Error: PSNR: 27.640, SSIM: 0.954125
Luma Error: PSNR: 30.292, SSIM: 0.964433


New encoder (RGB metrics):
RGB Average Error: PSNR: 29.523, SSIM: 0.957067
Luma Error: PSNR: 32.702, SSIM: 0.974145


Wednesday, May 30, 2018

xmen_1024 encoded to .basis at various bitrates

Here's xmen_1024 compressed at various bitrates to .basis. I show the output of two transcodes: ETC1 (which is the highest quality format in baseline .basis) and PVRTC (with clamp addressing).

Original:

Optimal ETC1 is 38.896 Y PSNR

Q 16: .717 bits/texel, ETC1 28.473 Y PSNR

ETC1:
PVRTC:

Q 64: .905 bits/texel, ETC1: 30.361 Y PSNR

ETC1:

PVRTC:

Q 128: 1.064 bits/texel, ETC1: 32.026 Y PSNR

ETC1:

PVRTC:

Q 192: 1.208 bits/texel, ETC1: 30.379 Y PSNR

ETC1:

 PVRTC:

Q 255: 1.362 bits/texel, ETC1: 34.630 Y PSNR

ETC1:
 PVRTC:

Basis universal GPU texture format examples

The .basis format is a lossy texture compression format roughly comparable to JPEG in size but designed specifically for GPU texture data. The format's main feature is that it can be efficiently transcoded to any other GPU texture format. We've written transcoders for BC1-5, ETC1, PVRTC, and BC7 so far. ASTC, ETC2 are coming. Transcoding complexity is similar to crunch's (my older open source texture compression tech for BC1-5). This is the first system to support a universal GPU texture format that is usable on the Web.

Excluding PVRTC, there are no complex pixel-level operations needed during transcoding. The transcoder's inner loop works at the block level (or 2x2 macroblock level) and involves simple operations (Huffman decoding, table lookups, endpoint/selector translation using small precomputed lookup tables). The PVRTC transcoder requires two passes and a temporary buffer to hold block endpoints. The PVRTC transcoder in this system is faster and simpler than any real-time PVRTC encoder I'm aware of.

I resized the kodim images to 512x512 (using a gamma correct windowed sinc filter) so they can be transcoded to PVRTC which only supports power of 2 texture dimensions. Resizing these textures to 512 actually makes these images more difficult to compress because the details are more dense spatially and artifacts stand out more.

The current single-threaded transcode times (for kodim01) on my 3.3GHz Xeon were:

ETC1 transcode time: 1.199494 ms
DXT1 transcode time: 2.198336 ms
BC7 transcode time: 2.801654 ms
DXT5A transcode time: 2.361919 ms
PVRTC1_4 transcode time: 2.756762 ms

These timings will get better as I optimize the transcoders. crunch's transcoding speed is roughly similar to .basis ETC1. The transcoder is usable on Web by cross compiling it to Javascript or WebAssembly.

For transparency support we internally store two texture slices, one for RGB and another for alpha. For ETC1 with alpha the user can either transcode to a texture twice as wide or high or use two separate textures. We support BC3-5 directly. For BC7, we currently only support opaque mode 6, but mode 4 or 5 is coming. For PVRTC we only support opaque 4bpp, but we know how to add alpha. PVRTC 2bpp opaque/alpha is also doable. ETC2 alpha block support will be easy.

These images are at quality level 255, around 1.5-2 bits/texel. The biggest quality constraint right now is the ETC1S format that these "baseline" .basis files use internally. Our plan is to add some optional extra texture data to the files to upgrade the quality for BC7/ASTC.

Some notable properties of this system:
  • This system is intended for Web use. It's a delicate balance between transcode times, quality, GPU format support, and encoder complexity. The transcoder step must be much faster than just using JPEG (or WebP, etc.) followed by a real-time GPU texture encoder, or it's not valuable. This system is basically existence proof that it's possible to build a universal GPU texture compression system. Long term, much higher quality solutions are possible.
  • This format trade offs quality to gain access to all GPU's and API's without having to store/distribute multiple files or encode multiple times. Quality isn't that great but it's very usable on photos, satellite photography, and rasterized map images. For some games it may not be acceptable, which is fine. The largest users of crunch aren't games at all.
  • The internal baseline format uses a subset of ETC1 (ETC1S), so transcoding to ETC1 is fastest. ETC1 is highest quality, followed by BC7, BC1, then PVRTC. The difference between ETC1 and BC1 is .1-.4 dB Y PSNR (slightly better for BC7). 
  • This system isn't full ETC1 quality because it disables 2x4/4x2 subblocks. We loose a few dB vs. optimal ETC1 due to this limitation, but we gain the ability to easily transcode to any other 4x4 block-based format. In a rate distortion sense using PSNR or SSIM full ETC1 support rarely makes sense in our testing anyway (i.e. there are almost always better uses of those bits vs. supporting flips or 4:4:4 individual mode colors).
  • Using ETC1S as an internal format allowed us to reuse our existing ETC1 encoder. It also allows others to easily build their own universal format encoders by re-purposing their existing ETC1 solutions. There's a large amount of value in a system that will work on any GPU or API, and we can improve the quality over time by extending it.
  • The PVRTC endpoint interpolation actually smooths out the ETC1 artifacts in a "nice" looking way. The PVRTC artifacts are definitely worse than any other format. The .basis->PVRTC transcoder favors speed and reliable behavior, not PSNR/SSIM. There are noticeable banding and low-pass artifacts. (Honestly, PVRTC is an unforgiving format and I'm surprised it looks as good as it does!) It should be possible to add dithering or smarter endpoint selection, but that would substantially slow transcoding down.
  • We have a customer using .basis ETC1 with normal maps on mobile, so I know they are usable. I doubt PVRTC would work well with normal maps, but the other formats should be usable.
  • Grayscale conversion is easy: we just convert the G channel. For DXT5/BC3 or BC7 we call the transcoder twice (once for RGB then again for alpha).
  • Newer iOS devices support ETC1 in hardware, but WebGL on iOS doesn't expose this format on these devices so we must support PVRTC. We weren't originally going to support PVRTC, but we had to due to this issue.
  • The PVRTC transcoder and decoder assumes wrap addressing is enabled, for compatibility with PVRTexTool. This can be disabled (and you can use clamp addressing when fetching from the shader). This can sometimes cause issues on the very edges of the texture (see the bottom of the xmen image, or the bottom of the hramp at the end).
  • Look at these images on your phone or tablet before making any quality judgments. On a iPhone even really low .basis quality levels can look surprisingly good. Vector quantization+GPU block compression artifacts are just different from JPEG's artifacts. 
Original:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

Original:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

Original:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:


BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

Original:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
Original:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
Original:
BC7:

 DXT1:
 DXT5A:
 ETC1:
 PVRTC: