Wednesday, September 7, 2016

ETC1->DXT1 encoding table error visualization

Here's are two visualizations of the overall DXT1 encoding error due to using this table, assuming each selector is used equally (which is not always true). This is the lookup table referred to in my previous post.

Each small 32x32 pixel tile in this image visualizes a R,G slice of the 3D lattice, there are 32 tiles for B (left to right), and there are 8 rows overall. The first row of tiles is for ETC intensity table 0, the second 1, etc.

First visualization, where the max error in each individual tile is scaled to white:


Second visualization, visualizing max overall encoding error relative to all tiles:


Hmm - the last row (representing ETC1 intensity table 7) is approximated the worst in DXT1.


Monday, September 5, 2016

More thoughts on a universal GPU texture interchange format

Just some random thoughts:

I still think the idea of a universal GPU texture compression standard is fascinating and useful. Something that can be efficiently transcoded to 2 or more major vendor formats, without sacrificing too much along the quality or compression ratio axes. Developers could just encode to this standard interchange format and ship to a large range of devices without worrying about whether GPU Y supports arcane texture format Z. (This isn't my idea, it's from Won Chun at RAD.)

Imagine, for example, a format that can be efficiently transcoded to ASTC, with an alternate mode in the transcoder that outputs BC7 as a fallback. Interestingly, imagine if this GPU texture interchange format looked a bit better (and/or transcoded more quickly) when transcoded into one of the GPU formats verses the other. This situation seems very possible in some of the designs of a universal format I've been thinking about.

Now imagine, in a few years time, a large set of universal GPU textures gets used and stored by developers, and distributed into the wild on the web. Graphics or rendering code samples even start getting distributed using this interchange format. A situation like this would apply pressure to the other GPU vendor with the inferior format to either dump their format or create a newer format more compatible with efficient transcoding.

To put it simply, a universal format could help fix this mess of GPU texture formats we have today.

Visualizing ETC1 texture compression

The ETC1 format consists of two block colors, two intensity table selectors, two mode bits ("diff" and "flip"), and 16 2-bit selectors. Here are some simple visualizations of what this encoded data looks like.

The original image (kodim14):


The ETC1 encoded image (using rg_etc1 in slow mode - modified to use perceptual colorspace metrics):


Error: Max:  63, Mean: 2.896, MSE: 19.284, RMSE: 4.391, PSNR: 35.279

Here's the selector image (the 2-bit selectors have been scaled up to 0-255):


Subblock 0's color, expanded to 8,8,8:



Subblock 0's intensity, scaled from 0-7 to 0-255:



Subblock 1's color, expanded to 8,8,8:


Subblock 1's intensity, scaled from 0-7 to 0-255:


The "diff" mode bits (white=differential mode, black=individual mode):


The "flip" mode bits (white=flipped):


Saturday, September 3, 2016

ETC1 block color clusterization experiment

Intro


ETC1 is a well thought out, elegant little GPU format. In my experience a few years ago writing a production quality block ETC1 encoder, I found it to be far less fiddly than DXT1. Both use 64-bits to represent a 4x4 texel block, or 4-bits per texel.

I've been very curious how hard it would be to add ETC1/2 support to crunch. Also, many people have asked about ETC1 support, which is guaranteed to be available on OpenGL ES 2.0 compatible Android devices. crunch currently only supports the DXT1/5/N (3DC) texture formats. crunch's higher level classes are highly specific to the DXT formats, so adding a new format is not trivial.

One of the trickier (and key) problems in adding a new GPU format to crunch is figuring out how to group blocks (using some form of cluster analysis) so they can share the same endpoints. GPU formats like DXT1 and ETC1 are riddled with block artifacts, and bad groupings can greatly amplify them. crunch for DXT has a endpoint clusterization algorithm that was refined over many tens of thousands of real-life game textures and satellite photography. I've just begun experimenting with ETC1, and so far I'm very impressed with how well behaved and versatile it is.

Note this experiment was conducted in a new data compression codebase I've been building, which is much larger than crunch's.

ETC1 Texture Compression


Unlike DXT1, which only supports 3 or 4 unique block colors, the ETC1 format supports up to 8 unique block colors. It divides up the block into either two 4x2 or 2x4 pixel "subblocks". A single "flip" bit controls whether or not the subblocks are oriented horizontally or vertically. Each subblock has 4 colors, for 8 total.

The 4 subblock colors are created by taking the subblock's base color and adding to it 4 grayscale colors from an intensity table. Each subblock has 3 bits which selects which intensity table to apply. The intensity tables are constant and part of the spec.

To encode the two block colors, ETC1 supports two modes: an "individual" mode, where each color is encoded to 4:4:4, or a "differential" mode, where the first color is 5:5:5 and the second color is a two's complement encoded 3:3:3 delta relative to the base color. The delta is applied before the base color is scaled to 8-bits.

From an encoding perspective, individual mode is most useful when the two subblocks have wildly different colors (favoring color diversity vs. encoding precision), and delta mode is most useful when encoding precision is more useful than diversity.

Each pixel is represented using 2-bit selectors, just like DXT1. Except in ETC1, the color selected depends on which subblock the pixel is within.

So that's ETC1 in a nutshell. In practice, from what I remember its quality is a little lower than DXT1, but not by much. Its artifacts look more pleasant to me than DXT1's (obviously subjective). Each ETC1 block is represented by 2 colorspace lines that are always parallel to the grayscale axis. By comparison, with DXT1, there's only a single line, but it can be in any direction, and perhaps that gives it a slight advantage.

ETC1 Endpoint Clusterization


The goal here is to figure out how to reduce the total number of unique endpoints (or block colors and intensity table indices) in an ETC1 encoded image without murdering the quality. This is just an early experiment, so let's try simplifying the ETC1 format itself to keep things simple. This experiment always use differential block color mode, with the delta color set to (0,0,0). So each subblock is represented using the same 5:5:5 color, and the same intensity table. The flip bit is always false. Obviously, this is going to lower quality, but let's see what happens. Note this simplified format is still 100% compatible with existing ETC1 decoders, we're just limiting ourselves to only using a simpler subset.

Here's the original image (kodim18 - because I remember this image being a pain to handle well in crunch for DXT1):


Here's the image encoded using high quality ETC1 compression (using rg_etc1, slow mode, perceptual colorspace metrics):


Delta:

Grayscale delta histogram:


Error: Max:  56, Mean: 2.827, MSE: 16.106, RMSE: 4.013, PSNR: 36.061

So the ETC1 encoding that takes advantage of all ETC1 features is 36.061 dB.

Here's the encoding using just diff mode, no flipping, with a (0,0,0) delta color:


Delta:


Grayscale delta histogram:


Max:  74, Mean: 3.638, MSE: 27.869, RMSE: 5.279, PSNR: 33.680

So we've lost 2.38 dB by limiting ourselves to this simpler subset of ETC1. The reduction in quality is obviously visible, but by no means fatal for the purposes of this quick experiment. 

In this experiment, each ETC1 block only contains 4 unique colors (or a single colorspace line, with "low" and "high" endpoints and 2 intermediate colors). Here's a visualization of the "low" and "high" endpoints in this image:



Now let's clusterize these block color endpoints, using 6D tree structured VQ (vector quantization) to perform the clusterization. The output of this step consists of a series of clusters, and each cluster contains one or more block indices. The idea is, blocks with similar endpoint vectors will be placed into the same cluster. This is a similar process used by crunch for DXT1. It's much like generating a RGB color palette from an array of image colors, except we're dealing with 6D vectors instead of 3D color vectors, and instead of using the output palette directly all we really care about is how the input vectors are grouped.

Here's a visualization of the cluster endpoint centroid vectors after generating 32 clusters:



Once we have the image organized into block clusters containing similar endpoints, use an internal helper class within rg_etc1 to find the near-optimal 5:5:5 endpoint and intensity table to represent all the pixels within each cluster. We can now create a ETC1-compatible texture by processing each block cluster and selecting the optimal selectors to use for each pixel.

Let's see what this texture looks like, and the PSNR, after limiting the number of unique endpoints.

ETC1 (subset) with 64 unique endpoints:



Error: Max: 110, Mean: 5.865, MSE: 70.233, RMSE: 8.380, PSNR: 29.665


ETC1 (subset) 256 unique endpoints:



Error: Max:  93, Mean: 4.624, MSE: 45.889, RMSE: 6.774, PSNR: 31.514


ETC1 (subset) 512 unique endpoints:



Error: Max:  87, Mean: 4.225, MSE: 38.411, RMSE: 6.198, PSNR: 32.286

ETC1 (subset) 1024 unique endpoints:



Error: Max:  87, Mean: 3.911, MSE: 32.967, RMSE: 5.742, PSNR: 32.950

ETC1 (subset) 4096 unique endpoints:



Error: Max:  87, Mean: 3.642, MSE: 28.037, RMSE: 5.295, PSNR: 33.654


Next Steps


This experiment shows one way to clusterize the endpoint optimization process in a limited subset of the ETC1 format. This first step must be mastered before crunch for ETC1 can be written.

The clusterization step outlined here isn't aware of flipping, or that each block can have 2 block colors, and we haven't even looked at the selectors yet. A production encoder will need to support more features of the ETC1 format. Note that crunch for DXT1 doesn't support 3 color blocks and works just fine, so it's possible we don't need to support every encoding feature.

Some next steps:

- Figure out how to best clusterize the full format. Expand the format subset to include two block colors, flipping, and both encodings.
Is 6D clusterization good enough - or is 12D needed?
- Selector clusterization 
- ETC1 specific refinement stages: refine endpoints based off the clusterized endpoints, then refine the clusterized endpoints based off the clusterized selectors, possibly repeat.
- crunch-style tiling ("macroblocking") will most likely be needed to get bitrate down to JPEG+real-time encoding competitive levels.
- ETC2 support

(Currently, I'm conducting these experiments in my spare time, in between VR and optimization contracts. If you're really interested in accelerating development of crunch for a specific GPU format please contact info@binomial.info.)

Monday, August 29, 2016

Friday, August 5, 2016

Brotli levels 0-10 vs. Oodle Kraken

For codec version info, compiler settings, etc. see this previous post.

This graph demonstrates that varying Brotli's compression level from 0 to 10 noticeably impacts its decompression throughput. (Level 11 is just too slow to complete the benchmark overnight.) As I expected, at none of these settings is it able to compete against Kraken.




Interestingly, it appears that at Brotli's lowest settings (0 and 1) it outputs compressed data that is extremely (and surprisingly) slow to decode. (I've highlighted these settings in yellow and green below.) I'm not sure if this is intentional or not, but with this kind of large slowdown I would avoid these Brotli settings (and use something like zlib or LZ4 instead if you need that much throughput).



Level Compressed Size
 0    2144016081
 1    2020173184
 2    1963448673
 3    1945877537
 4    1905601392
 5    1829657573
 6    1803865722
 7    1772564848
 8    1756332118
 9    1746959367
10    1671777094

Original 5374152762

Thursday, August 4, 2016

Few notes about the previous post

This rant is mostly directed at the commenters that claimed I hobbled the open source codecs (including my own!) by not selecting the "proper" settings:

Please look closely at the red dots. Those represent Kraken. Now, this is a log10/log2 graph (log10 on the throughput axis.) Kraken's decompressor is almost one order of magnitude faster than Brotli's. Specifically, it's around 5-8x faster, just from eyeing the graph. No amount of tweaking Brotli's settings is going to speed it up this much. Sorry everyone. I've benchmarked Brotli at settings 0-10 (11 is just too slow) overnight and I'll post them tomorrow, just to be sure.

There is only a single executable file. The codecs are statically linked into this executable. All open source codecs were compiled with Visual Studio 2015 with optimizations enabled. They all use the same exact compiler settings. I'll update the previous post tomorrow with the specific settings.

I'm not releasing my data corpus. Neither does Squeeze Chart. This is to prevent codec authors from tweaking their algorithms to perform well on a specific corpus while neglecting general purpose performance. It's just a large mix of data I found over time that was useful for developing and testing LZHAM. I didn't develop this corpus with any specific goals in mind, and it just happens to be useful as a compressor benchmark. (The reasoning goes: If it was good enough to tune LZHAM, it should be good enough for newer codecs.)