Just some random thoughts:
I still think the idea of a universal GPU texture compression standard is fascinating and useful. Something that can be efficiently transcoded to 2 or more major vendor formats, without sacrificing too much along the quality or compression ratio axes. Developers could just encode to this standard interchange format and ship to a large range of devices without worrying about whether GPU Y supports arcane texture format Z. (This isn't my idea, it's from Won Chun at RAD.)
Imagine, for example, a format that can be efficiently transcoded to ASTC, with an alternate mode in the transcoder that outputs BC7 as a fallback. Interestingly, imagine if this GPU texture interchange format looked a bit better (and/or transcoded more quickly) when transcoded into one of the GPU formats verses the other. This situation seems very possible in some of the designs of a universal format I've been thinking about.
Now imagine, in a few years time, a large set of universal GPU textures gets used and stored by developers, and distributed into the wild on the web. Graphics or rendering code samples even start getting distributed using this interchange format. A situation like this would apply pressure to the other GPU vendor with the inferior format to either dump their format or create a newer format more compatible with efficient transcoding.
To put it simply, a universal format could help fix this mess of GPU texture formats we have today.
Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Monday, September 5, 2016
Visualizing ETC1 texture compression
The ETC1 format consists of two block colors, two intensity table selectors, two mode bits ("diff" and "flip"), and 16 2-bit selectors. Here are some simple visualizations of what this encoded data looks like.
The original image (kodim14):
The ETC1 encoded image (using rg_etc1 in slow mode - modified to use perceptual colorspace metrics):
Here's the selector image (the 2-bit selectors have been scaled up to 0-255):
Subblock 0's intensity, scaled from 0-7 to 0-255:
Subblock 1's color, expanded to 8,8,8:
Subblock 1's intensity, scaled from 0-7 to 0-255:
The "flip" mode bits (white=flipped):
Saturday, September 3, 2016
ETC1 block color clusterization experiment
Intro
ETC1 is a well thought out, elegant little GPU format. In my experience a few years ago writing a production quality block ETC1 encoder, I found it to be far less fiddly than DXT1. Both use 64-bits to represent a 4x4 texel block, or 4-bits per texel.
I've been very curious how hard it would be to add ETC1/2 support to crunch. Also, many people have asked about ETC1 support, which is guaranteed to be available on OpenGL ES 2.0 compatible Android devices. crunch currently only supports the DXT1/5/N (3DC) texture formats. crunch's higher level classes are highly specific to the DXT formats, so adding a new format is not trivial.
One of the trickier (and key) problems in adding a new GPU format to crunch is figuring out how to group blocks (using some form of cluster analysis) so they can share the same endpoints. GPU formats like DXT1 and ETC1 are riddled with block artifacts, and bad groupings can greatly amplify them. crunch for DXT has a endpoint clusterization algorithm that was refined over many tens of thousands of real-life game textures and satellite photography. I've just begun experimenting with ETC1, and so far I'm very impressed with how well behaved and versatile it is.
Note this experiment was conducted in a new data compression codebase I've been building, which is much larger than crunch's.
ETC1 Texture Compression
Unlike DXT1, which only supports 3 or 4 unique block colors, the ETC1 format supports up to 8 unique block colors. It divides up the block into either two 4x2 or 2x4 pixel "subblocks". A single "flip" bit controls whether or not the subblocks are oriented horizontally or vertically. Each subblock has 4 colors, for 8 total.
The 4 subblock colors are created by taking the subblock's base color and adding to it 4 grayscale colors from an intensity table. Each subblock has 3 bits which selects which intensity table to apply. The intensity tables are constant and part of the spec.
To encode the two block colors, ETC1 supports two modes: an "individual" mode, where each color is encoded to 4:4:4, or a "differential" mode, where the first color is 5:5:5 and the second color is a two's complement encoded 3:3:3 delta relative to the base color. The delta is applied before the base color is scaled to 8-bits.
From an encoding perspective, individual mode is most useful when the two subblocks have wildly different colors (favoring color diversity vs. encoding precision), and delta mode is most useful when encoding precision is more useful than diversity.
Each pixel is represented using 2-bit selectors, just like DXT1. Except in ETC1, the color selected depends on which subblock the pixel is within.
So that's ETC1 in a nutshell. In practice, from what I remember its quality is a little lower than DXT1, but not by much. Its artifacts look more pleasant to me than DXT1's (obviously subjective). Each ETC1 block is represented by 2 colorspace lines that are always parallel to the grayscale axis. By comparison, with DXT1, there's only a single line, but it can be in any direction, and perhaps that gives it a slight advantage.
ETC1 Endpoint Clusterization
The goal here is to figure out how to reduce the total number of unique endpoints (or block colors and intensity table indices) in an ETC1 encoded image without murdering the quality. This is just an early experiment, so let's try simplifying the ETC1 format itself to keep things simple. This experiment always use differential block color mode, with the delta color set to (0,0,0). So each subblock is represented using the same 5:5:5 color, and the same intensity table. The flip bit is always false. Obviously, this is going to lower quality, but let's see what happens. Note this simplified format is still 100% compatible with existing ETC1 decoders, we're just limiting ourselves to only using a simpler subset.
Here's the original image (kodim18 - because I remember this image being a pain to handle well in crunch for DXT1):
Here's the image encoded using high quality ETC1 compression (using rg_etc1, slow mode, perceptual colorspace metrics):
Delta:
Error: Max: 56, Mean: 2.827, MSE: 16.106, RMSE: 4.013, PSNR: 36.061
So the ETC1 encoding that takes advantage of all ETC1 features is 36.061 dB.
Here's the encoding using just diff mode, no flipping, with a (0,0,0) delta color:
Delta:
So we've lost 2.38 dB by limiting ourselves to this simpler subset of ETC1. The reduction in quality is obviously visible, but by no means fatal for the purposes of this quick experiment.
In this experiment, each ETC1 block only contains 4 unique colors (or a single colorspace line, with "low" and "high" endpoints and 2 intermediate colors). Here's a visualization of the "low" and "high" endpoints in this image:
Now let's clusterize these block color endpoints, using 6D tree structured VQ (vector quantization) to perform the clusterization. The output of this step consists of a series of clusters, and each cluster contains one or more block indices. The idea is, blocks with similar endpoint vectors will be placed into the same cluster. This is a similar process used by crunch for DXT1. It's much like generating a RGB color palette from an array of image colors, except we're dealing with 6D vectors instead of 3D color vectors, and instead of using the output palette directly all we really care about is how the input vectors are grouped.
Here's a visualization of the cluster endpoint centroid vectors after generating 32 clusters:
Once we have the image organized into block clusters containing similar endpoints, use an internal helper class within rg_etc1 to find the near-optimal 5:5:5 endpoint and intensity table to represent all the pixels within each cluster. We can now create a ETC1-compatible texture by processing each block cluster and selecting the optimal selectors to use for each pixel.
Let's see what this texture looks like, and the PSNR, after limiting the number of unique endpoints.
ETC1 (subset) with 64 unique endpoints:
Error: Max: 110, Mean: 5.865, MSE: 70.233, RMSE: 8.380, PSNR: 29.665
ETC1 (subset) 256 unique endpoints:
Error: Max: 93, Mean: 4.624, MSE: 45.889, RMSE: 6.774, PSNR: 31.514
ETC1 (subset) 512 unique endpoints:
Error: Max: 87, Mean: 4.225, MSE: 38.411, RMSE: 6.198, PSNR: 32.286
ETC1 (subset) 1024 unique endpoints:
Error: Max: 87, Mean: 3.911, MSE: 32.967, RMSE: 5.742, PSNR: 32.950
ETC1 (subset) 4096 unique endpoints:
Error: Max: 87, Mean: 3.642, MSE: 28.037, RMSE: 5.295, PSNR: 33.654
Next Steps
This experiment shows one way to clusterize the endpoint optimization process in a limited subset of the ETC1 format. This first step must be mastered before crunch for ETC1 can be written.
The clusterization step outlined here isn't aware of flipping, or that each block can have 2 block colors, and we haven't even looked at the selectors yet. A production encoder will need to support more features of the ETC1 format. Note that crunch for DXT1 doesn't support 3 color blocks and works just fine, so it's possible we don't need to support every encoding feature.
Some next steps:
- Figure out how to best clusterize the full format. Expand the format subset to include two block colors, flipping, and both encodings.
Is 6D clusterization good enough - or is 12D needed?
- Selector clusterization
- ETC1 specific refinement stages: refine endpoints based off the clusterized endpoints, then refine the clusterized endpoints based off the clusterized selectors, possibly repeat.
- crunch-style tiling ("macroblocking") will most likely be needed to get bitrate down to JPEG+real-time encoding competitive levels.
- ETC2 support
- ETC2 support
(Currently, I'm conducting these experiments in my spare time, in between VR and optimization contracts. If you're really interested in accelerating development of crunch for a specific GPU format please contact info@binomial.info.)
Monday, August 29, 2016
Good article: Why software patents are evil
I have been attacked (at a time in my life when the last thing I needed was more stress!) by a patent holder before, so hey I hate software patents:
http://www.infoworld.com/article/2619609/open-source-software/why-software-patents-are-evil.html
http://www.infoworld.com/article/2619609/open-source-software/why-software-patents-are-evil.html
Friday, August 5, 2016
Brotli levels 0-10 vs. Oodle Kraken
For codec version info, compiler settings, etc. see this previous post.
This graph demonstrates that varying Brotli's compression level from 0 to 10 noticeably impacts its decompression throughput. (Level 11 is just too slow to complete the benchmark overnight.) As I expected, at none of these settings is it able to compete against Kraken.
Interestingly, it appears that at Brotli's lowest settings (0 and 1) it outputs compressed data that is extremely (and surprisingly) slow to decode. (I've highlighted these settings in yellow and green below.) I'm not sure if this is intentional or not, but with this kind of large slowdown I would avoid these Brotli settings (and use something like zlib or LZ4 instead if you need that much throughput).
Level Compressed Size
0 2144016081
1 2020173184
2 1963448673
3 1945877537
4 1905601392
5 1829657573
6 1803865722
7 1772564848
8 1756332118
9 1746959367
10 1671777094
Original 5374152762
This graph demonstrates that varying Brotli's compression level from 0 to 10 noticeably impacts its decompression throughput. (Level 11 is just too slow to complete the benchmark overnight.) As I expected, at none of these settings is it able to compete against Kraken.
Interestingly, it appears that at Brotli's lowest settings (0 and 1) it outputs compressed data that is extremely (and surprisingly) slow to decode. (I've highlighted these settings in yellow and green below.) I'm not sure if this is intentional or not, but with this kind of large slowdown I would avoid these Brotli settings (and use something like zlib or LZ4 instead if you need that much throughput).
Level Compressed Size
0 2144016081
1 2020173184
2 1963448673
3 1945877537
4 1905601392
5 1829657573
6 1803865722
7 1772564848
8 1756332118
9 1746959367
10 1671777094
Original 5374152762
Thursday, August 4, 2016
Few notes about the previous post
This rant is mostly directed at the commenters that claimed I hobbled the open source codecs (including my own!) by not selecting the "proper" settings:
Please look closely at the red dots. Those represent Kraken. Now, this is a log10/log2 graph (log10 on the throughput axis.) Kraken's decompressor is almost one order of magnitude faster than Brotli's. Specifically, it's around 5-8x faster, just from eyeing the graph. No amount of tweaking Brotli's settings is going to speed it up this much. Sorry everyone. I've benchmarked Brotli at settings 0-10 (11 is just too slow) overnight and I'll post them tomorrow, just to be sure.
There is only a single executable file. The codecs are statically linked into this executable. All open source codecs were compiled with Visual Studio 2015 with optimizations enabled. They all use the same exact compiler settings. I'll update the previous post tomorrow with the specific settings.
I'm not releasing my data corpus. Neither does Squeeze Chart. This is to prevent codec authors from tweaking their algorithms to perform well on a specific corpus while neglecting general purpose performance. It's just a large mix of data I found over time that was useful for developing and testing LZHAM. I didn't develop this corpus with any specific goals in mind, and it just happens to be useful as a compressor benchmark. (The reasoning goes: If it was good enough to tune LZHAM, it should be good enough for newer codecs.)
Please look closely at the red dots. Those represent Kraken. Now, this is a log10/log2 graph (log10 on the throughput axis.) Kraken's decompressor is almost one order of magnitude faster than Brotli's. Specifically, it's around 5-8x faster, just from eyeing the graph. No amount of tweaking Brotli's settings is going to speed it up this much. Sorry everyone. I've benchmarked Brotli at settings 0-10 (11 is just too slow) overnight and I'll post them tomorrow, just to be sure.
There is only a single executable file. The codecs are statically linked into this executable. All open source codecs were compiled with Visual Studio 2015 with optimizations enabled. They all use the same exact compiler settings. I'll update the previous post tomorrow with the specific settings.
I'm not releasing my data corpus. Neither does Squeeze Chart. This is to prevent codec authors from tweaking their algorithms to perform well on a specific corpus while neglecting general purpose performance. It's just a large mix of data I found over time that was useful for developing and testing LZHAM. I didn't develop this corpus with any specific goals in mind, and it just happens to be useful as a compressor benchmark. (The reasoning goes: If it was good enough to tune LZHAM, it should be good enough for newer codecs.)
Wednesday, August 3, 2016
RAD's ground breaking lossless compression product benchmarked
Intro
Progress in the practical lossless compression field has been painfully stagnant in recent years. The state of the art is now rapidly changing, with several new open source codecs announced in recent times (such as Brotli and Zstd) offering high ratios and fast decompression. Recently, RAD Game Tools released several new codecs as part of its Oodle data compression product.
My first impression after benchmarking these new codecs was "what the hell, this can't be right", and after running the benchmarks again and double checking everything my attitude changed to "this is the new state of the art, and open source codecs have a lot of catching up to do".
This post uses the same benchmarking tool and data corpus that I used in this one.
Updated Aug. 5th: Changed compiler optimization settings from /O2 to /OX and disabled exceptions, re-ran benchmark and regenerated graphs, added codec versions and benchmarking machine info.
Codec Settings
All benchmarking was done under Win 10 x64, 64-bit executable, in a single process with no multithreading.
- lz4 (v1.74): level 8, LZ4_compressHC2_limitedOutput() for compression, LZ4_decompress_safe() for decompression (Note: LZ4_decompress_fast() is faster, but it's inherently unsafe. I personally would never use the dangerous fast() version in a project.)
- lzham (lzham_codec_devel on github): level "uber", dict size 2^26
- brotli (latest on github as of Aug. 1, 2016): level 10, dict size 2^24
- bitknit (standalone library provided by RAD): BitKnit_Level_VeryHigh
- Zstd (v0.8.0): ZSTD_MAX_CLEVEL
- Oodle library version: v2.3.0
- Kraken: OodleLZ_CompressionLevel_Optimal2
- Mermaid: OodleLZ_CompressionLevel_Optimal2
- Selkie: OodleLZ_CompressionLevel_Optimal2
- zlib (v1.2.8): level 9
Data corpus used: LZHAM corpus, only files 1KB or larger, 22,827 total files
Benchmarking machine:
Compiler and optimization settings:
Visual Studio 2015 Community Edition, 14.0.25424.00 Update 3
Totals
Sorted by highest to lowest ratio:
brotli 1671777094
lzham 1676729104
kraken 1685750158
zstd 1686207733
bitknit 1707850562
mermaid 1834845751
zlib 1963751711
selkie 1989554820
lz4 2131656949
Ratio vs. Decompression Throughput - Overview
On the far left there's LZHAM (dark gray), which at this point is looking pretty slow. (For a time, it was the decompression speed leader of the high ratio codecs, being 2-3x faster than LZMA.) Moving roughly left to right, there's Brotli (brown), zlib (light blue), Zstd (dark green), BitKnit (dark blue), Kraken (red), then a cluster of very fast codecs (LZ4 - yellow, Selkie - purple, Medmaid - light green, and even a sprinkling of Kraken - red).
Notes:
- Kraken is just amazingly strong. It has a very high ratio with ground breaking decompression performance. There is nothing else like it in the open source world. Kraken's decompressor runs circles around the other high-ratio codecs (LZHAM, Brotli, Zstd) and is even faster than zlib!
- Mermaid and Selkie combine the best of both worlds, being as fast or faster than LZ4 to decompress, but with compression ratios competitive or better than zlib!
Ratio vs. Decompression Throughput - High decompression throughput (LZ4, Mermaid, Selkie)
* LZ4 note: LZ4's decompression performance depends on whether or not the data was compressed with the HC or non-HC version of the compressor. I used the HC version for this post, which appears to output data which decompresses a bit faster. I'm guessing it's because there's less compressed data to process in HC streams.
Ratio vs. Decompression Throughput - High ratio codecs
Ratio vs. Compression Throughput - All codecs
For the first time on my blog, here's a ratio vs. compression throughput scatter graph.
- LZHAM's compressor in single threaded mode is very slow. (Compression throughput was never a priority of LZHAM.) Brotli's compressor is also a slowpoke.
- Interestingly, most of the other compressors cluster closely together in the 5-10mg/sec region.
- zlib and lz4 are both very fast. lz4 isn't a surprise, but I'm a little surprised by how much zlib stands apart from the others.
- There's definitely room here for compression speed improvements in the other codecs.
Conclusion
The open source world should be inspired by the amazing progress RAD has made here. If you're working on a product that needs lossless compression, RAD's Oodle product offers a massive amount of value. There's nothing else out there like it. I can't stress how big of a deal RAD's new lossless codecs are. Just their existence clearly demonstrates that there is still room for large improvements in this field.
Thanks to RAD Game Tools for providing me with a drop of their Oodle Data Compression library for me to benchmark, and to Charles Bloom for providing feedback on the codec settings and benchmark approach.
Subscribe to:
Posts (Atom)
































