One possible potential (probably minor) optimization to ETC1 encoding: determine the principle axis of the entire texture, rotate the texture's RGB pixels (by treating them as 3D vectors) so this axis is aligned along the grayscale axis, then compress the texture as usual. The pixel shader can undo the rotation using a trivial handful of instructions.
ETC1 uses colorspace lines constrained to be parallel to the grayscale axis, which this optimization exploits.
Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Wednesday, September 14, 2016
etcpak
etcpak is a very fast, but low quality ETC1 (and a little bit of ETC2) compressor:
https://bitbucket.org/wolfpld/etcpak/wiki/Home
It's the fastest open source ETC1 encoder that I'm aware of.
Notice the lack of any PSNR/MSE/SSIM statistics anywhere (that I can see). Also, the developer doesn't seem to get that the other tools/libraries he compares his stuff against were optimized for quality, not raw speed. In particular, rg_etc1 (and crunch's ETC1 support) was tuned to compete against the reference encoder along both the quality and perf. axes.
Anyhow, there are some interesting things to learn from etcpak:
https://bitbucket.org/wolfpld/etcpak/wiki/Home
It's the fastest open source ETC1 encoder that I'm aware of.
Notice the lack of any PSNR/MSE/SSIM statistics anywhere (that I can see). Also, the developer doesn't seem to get that the other tools/libraries he compares his stuff against were optimized for quality, not raw speed. In particular, rg_etc1 (and crunch's ETC1 support) was tuned to compete against the reference encoder along both the quality and perf. axes.
Anyhow, there are some interesting things to learn from etcpak:
- Best quality doesn't always matter. It obviously depends on your use case. If you have 10 gigs of textures to compress then iteration speed can be very important.
- The value spectrum spans from highest quality/slow encode (to ship final assets) to crap quality/fast as hell encode (favoring iteration speed).
- Visually, the ETC1/2 formats are nicely forgiving. Even a low quality ETC1 encoder produces decent enough looking output for many use cases.
Sunday, September 11, 2016
Idea for next texture compression experiment
Right now, I've got a GPU texture in a simple ETC1 subset that is easily converted to most other GPU formats:
Base color: 15-bits, 5:5:5 RGB
Intensity table index: 3-bits
Selectors: 2-bits/texel
Most importantly, this is a "single subset" encoding, using BC7 terminology. BC7 supports between 1-3 subsets per block. A subset is just a colorspace line represented by two R,G,B endpoint colors.
This format is easily converted to DXT1 using a table lookup. It's also the "base" of the universal GPU texture format I've been thinking about, because it's the data needed for DXT1 support. The next step is to experiment with attempting to refine this base data to better take advantage of the full ETC1 specification. So let's try adding two subsets to each block, with two partitions (again using BC7 terminology), top/bottom or left/right, which are supported by both ETC1 and BC7.
For example, we can code this base color, then delta code the 2 subset colors relative to this base. We'll also add a couple more intensity indices, which can be delta coded against the base index. Another bit can indicate which ETC1 block color encoding "mode" should be used (individual 4:4:4 4:4:4 or differential 5:5:5 3:3:3) to represent the subset colors in the output block.
In DXT1 mode, we can ignore this extra delta coded data and just convert the basic (single subset) base format. In ETC1/BC7/ASTC modes, we can use the extra information to support 2 subsets and 2 partitions.
Currently, the idea is to share the same selector indices between the single subset (DXT1) and two subset (BC7/ASTC/full ETC1) encodings. This will constrain how well this idea works, but I think it's worth trying out.
To add more quality to the 2 subset mode, we can delta code (maybe with some fancy per-pixel prediction) another array of selectors in some way. We can also add support for more partitions (derived from BC7's or ASTC's), too.
Base color: 15-bits, 5:5:5 RGB
Intensity table index: 3-bits
Selectors: 2-bits/texel
Most importantly, this is a "single subset" encoding, using BC7 terminology. BC7 supports between 1-3 subsets per block. A subset is just a colorspace line represented by two R,G,B endpoint colors.
This format is easily converted to DXT1 using a table lookup. It's also the "base" of the universal GPU texture format I've been thinking about, because it's the data needed for DXT1 support. The next step is to experiment with attempting to refine this base data to better take advantage of the full ETC1 specification. So let's try adding two subsets to each block, with two partitions (again using BC7 terminology), top/bottom or left/right, which are supported by both ETC1 and BC7.
For example, we can code this base color, then delta code the 2 subset colors relative to this base. We'll also add a couple more intensity indices, which can be delta coded against the base index. Another bit can indicate which ETC1 block color encoding "mode" should be used (individual 4:4:4 4:4:4 or differential 5:5:5 3:3:3) to represent the subset colors in the output block.
In DXT1 mode, we can ignore this extra delta coded data and just convert the basic (single subset) base format. In ETC1/BC7/ASTC modes, we can use the extra information to support 2 subsets and 2 partitions.
Currently, the idea is to share the same selector indices between the single subset (DXT1) and two subset (BC7/ASTC/full ETC1) encodings. This will constrain how well this idea works, but I think it's worth trying out.
To add more quality to the 2 subset mode, we can delta code (maybe with some fancy per-pixel prediction) another array of selectors in some way. We can also add support for more partitions (derived from BC7's or ASTC's), too.
Saturday, September 10, 2016
Hierarchical clustering
One of the key algorithms in crunch is determining how to group together block endpoints into clusters. Crunch uses a bottom up clustering approach at the 8x8 pixel (or 2x2 DXTn block) "macroblock" level, then it switches to top down. The top down method is extremely sensitive to the vectors chosen to represent each block during the clusterization step. The algorithm crunch uses to compute representative vectors (used only during clusterization) was refined and tweaked over time. Badly chosen representative vectors cause the clustering step to product crappy clusters (i.e. nasty artifacts).
Anyhow, an alternative approach would be entirely bottom up. I think this method could require less tweaking. Some reading:
https://en.wikipedia.org/wiki/Hierarchical_clustering
https://onlinecourses.science.psu.edu/stat505/node/143
Also Google "agglomerative hierarchical clustering". Here's a Youtube video describing it.
Anyhow, an alternative approach would be entirely bottom up. I think this method could require less tweaking. Some reading:
https://en.wikipedia.org/wiki/Hierarchical_clustering
https://onlinecourses.science.psu.edu/stat505/node/143
Also Google "agglomerative hierarchical clustering". Here's a Youtube video describing it.
Friday, September 9, 2016
Few more random thoughts on a "universal" GPU texture format
In my experiments, a simple but usable subset of ETC1 can be easily converted to DXT1, BC7, and ATC. And after studying the standard, it very much looks like the full ETC1 format can be converted into BC7 with very little loss. (And when I say "converted", I mean using very little CPU, just basically some table lookup operations over the endpoint and selector entries.)
ASTC seems to be (at first glance) around as powerful as BC7, so converting the full ETC1 format to ASTC with very little loss should be possible. (Unfortunately ASTC is so dense and complex that I don't have time to determine this for sure yet.)
So I'm pretty confident now that a universal format could be compatible with ASTC, BC7, DXT1, ETC1, and ATC. The only other major format that I can't fit into this scheme easily is my old nemesis, PVRTC.
Obviously this format won't look as good compared to a dedicated, single format encoder's output. So what? There are many valuable use cases that don't require super high quality levels. This scheme purposely trades off a drop in quality for gains in interchange.
Additionally, with a crunch-style encoding method, only the endpoint (and possibly the selector) codebook entries (of which there are usually only hundreds, possibly up to a few thousand in a single texture) would need to be converted to the target format. So the GPU format conversion step doesn't actually need to be insanely fast.
Another idea is to just unify ASTC and BC7, two very high quality formats. The drop in quality due to unification would be relatively much less significant with this combination. (But how valuable is this combo?)
ASTC seems to be (at first glance) around as powerful as BC7, so converting the full ETC1 format to ASTC with very little loss should be possible. (Unfortunately ASTC is so dense and complex that I don't have time to determine this for sure yet.)
So I'm pretty confident now that a universal format could be compatible with ASTC, BC7, DXT1, ETC1, and ATC. The only other major format that I can't fit into this scheme easily is my old nemesis, PVRTC.
Obviously this format won't look as good compared to a dedicated, single format encoder's output. So what? There are many valuable use cases that don't require super high quality levels. This scheme purposely trades off a drop in quality for gains in interchange.
Additionally, with a crunch-style encoding method, only the endpoint (and possibly the selector) codebook entries (of which there are usually only hundreds, possibly up to a few thousand in a single texture) would need to be converted to the target format. So the GPU format conversion step doesn't actually need to be insanely fast.
Another idea is to just unify ASTC and BC7, two very high quality formats. The drop in quality due to unification would be relatively much less significant with this combination. (But how valuable is this combo?)
Some memories
I remember a few years ago at one company, I was explaining and showing one of my early graphics API tracing/replaying demos (on a really cool 1st person game made by some company in Europe) to a couple "senior" engineers there. I described my plan and showed them the demo.
Both of them said it wasn't interesting, and implied I should stop now and not show what I was working on to the public.
Thanks to these two engineers, I knew for sure I had something valuable! And it turned out, this tool (and tools like it) was very useful and valuable to developers. I later showed this tool to the public and received amazingly positive feedback.
I had learned from many previous experiences that, at this particular company, resistance to new ideas was usually a sign. The harder they resisted, the more useful and interesting the technology probably was. The company had horribly stagnated, and the engineers there were, as a group, optimizing for yearly stack ranking slots (and their bonuses) and not for the actual needs of the company.
Both of them said it wasn't interesting, and implied I should stop now and not show what I was working on to the public.
Thanks to these two engineers, I knew for sure I had something valuable! And it turned out, this tool (and tools like it) was very useful and valuable to developers. I later showed this tool to the public and received amazingly positive feedback.
Wednesday, September 7, 2016
ETC1->DXT1 encoding table error visualization
Here's are two visualizations of the overall DXT1 encoding error due to using this table, assuming each selector is used equally (which is not always true). This is the lookup table referred to in my previous post.
Each small 32x32 pixel tile in this image visualizes a R,G slice of the 3D lattice, there are 32 tiles for B (left to right), and there are 8 rows overall. The first row of tiles is for ETC intensity table 0, the second 1, etc.
First visualization, where the max error in each individual tile is scaled to white:
Second visualization, visualizing max overall encoding error relative to all tiles:
Hmm - the last row (representing ETC1 intensity table 7) is approximated the worst in DXT1.
Each small 32x32 pixel tile in this image visualizes a R,G slice of the 3D lattice, there are 32 tiles for B (left to right), and there are 8 rows overall. The first row of tiles is for ETC intensity table 0, the second 1, etc.
First visualization, where the max error in each individual tile is scaled to white:
Second visualization, visualizing max overall encoding error relative to all tiles:
Hmm - the last row (representing ETC1 intensity table 7) is approximated the worst in DXT1.
Monday, September 5, 2016
More thoughts on a universal GPU texture interchange format
Just some random thoughts:
I still think the idea of a universal GPU texture compression standard is fascinating and useful. Something that can be efficiently transcoded to 2 or more major vendor formats, without sacrificing too much along the quality or compression ratio axes. Developers could just encode to this standard interchange format and ship to a large range of devices without worrying about whether GPU Y supports arcane texture format Z. (This isn't my idea, it's from Won Chun at RAD.)
Imagine, for example, a format that can be efficiently transcoded to ASTC, with an alternate mode in the transcoder that outputs BC7 as a fallback. Interestingly, imagine if this GPU texture interchange format looked a bit better (and/or transcoded more quickly) when transcoded into one of the GPU formats verses the other. This situation seems very possible in some of the designs of a universal format I've been thinking about.
Now imagine, in a few years time, a large set of universal GPU textures gets used and stored by developers, and distributed into the wild on the web. Graphics or rendering code samples even start getting distributed using this interchange format. A situation like this would apply pressure to the other GPU vendor with the inferior format to either dump their format or create a newer format more compatible with efficient transcoding.
To put it simply, a universal format could help fix this mess of GPU texture formats we have today.
I still think the idea of a universal GPU texture compression standard is fascinating and useful. Something that can be efficiently transcoded to 2 or more major vendor formats, without sacrificing too much along the quality or compression ratio axes. Developers could just encode to this standard interchange format and ship to a large range of devices without worrying about whether GPU Y supports arcane texture format Z. (This isn't my idea, it's from Won Chun at RAD.)
Imagine, for example, a format that can be efficiently transcoded to ASTC, with an alternate mode in the transcoder that outputs BC7 as a fallback. Interestingly, imagine if this GPU texture interchange format looked a bit better (and/or transcoded more quickly) when transcoded into one of the GPU formats verses the other. This situation seems very possible in some of the designs of a universal format I've been thinking about.
Now imagine, in a few years time, a large set of universal GPU textures gets used and stored by developers, and distributed into the wild on the web. Graphics or rendering code samples even start getting distributed using this interchange format. A situation like this would apply pressure to the other GPU vendor with the inferior format to either dump their format or create a newer format more compatible with efficient transcoding.
To put it simply, a universal format could help fix this mess of GPU texture formats we have today.
Visualizing ETC1 texture compression
The ETC1 format consists of two block colors, two intensity table selectors, two mode bits ("diff" and "flip"), and 16 2-bit selectors. Here are some simple visualizations of what this encoded data looks like.
The original image (kodim14):
The ETC1 encoded image (using rg_etc1 in slow mode - modified to use perceptual colorspace metrics):
Here's the selector image (the 2-bit selectors have been scaled up to 0-255):
Subblock 0's intensity, scaled from 0-7 to 0-255:
Subblock 1's color, expanded to 8,8,8:
Subblock 1's intensity, scaled from 0-7 to 0-255:
The "flip" mode bits (white=flipped):
Saturday, September 3, 2016
ETC1 block color clusterization experiment
Intro
ETC1 is a well thought out, elegant little GPU format. In my experience a few years ago writing a production quality block ETC1 encoder, I found it to be far less fiddly than DXT1. Both use 64-bits to represent a 4x4 texel block, or 4-bits per texel.
I've been very curious how hard it would be to add ETC1/2 support to crunch. Also, many people have asked about ETC1 support, which is guaranteed to be available on OpenGL ES 2.0 compatible Android devices. crunch currently only supports the DXT1/5/N (3DC) texture formats. crunch's higher level classes are highly specific to the DXT formats, so adding a new format is not trivial.
One of the trickier (and key) problems in adding a new GPU format to crunch is figuring out how to group blocks (using some form of cluster analysis) so they can share the same endpoints. GPU formats like DXT1 and ETC1 are riddled with block artifacts, and bad groupings can greatly amplify them. crunch for DXT has a endpoint clusterization algorithm that was refined over many tens of thousands of real-life game textures and satellite photography. I've just begun experimenting with ETC1, and so far I'm very impressed with how well behaved and versatile it is.
Note this experiment was conducted in a new data compression codebase I've been building, which is much larger than crunch's.
ETC1 Texture Compression
Unlike DXT1, which only supports 3 or 4 unique block colors, the ETC1 format supports up to 8 unique block colors. It divides up the block into either two 4x2 or 2x4 pixel "subblocks". A single "flip" bit controls whether or not the subblocks are oriented horizontally or vertically. Each subblock has 4 colors, for 8 total.
The 4 subblock colors are created by taking the subblock's base color and adding to it 4 grayscale colors from an intensity table. Each subblock has 3 bits which selects which intensity table to apply. The intensity tables are constant and part of the spec.
To encode the two block colors, ETC1 supports two modes: an "individual" mode, where each color is encoded to 4:4:4, or a "differential" mode, where the first color is 5:5:5 and the second color is a two's complement encoded 3:3:3 delta relative to the base color. The delta is applied before the base color is scaled to 8-bits.
From an encoding perspective, individual mode is most useful when the two subblocks have wildly different colors (favoring color diversity vs. encoding precision), and delta mode is most useful when encoding precision is more useful than diversity.
Each pixel is represented using 2-bit selectors, just like DXT1. Except in ETC1, the color selected depends on which subblock the pixel is within.
So that's ETC1 in a nutshell. In practice, from what I remember its quality is a little lower than DXT1, but not by much. Its artifacts look more pleasant to me than DXT1's (obviously subjective). Each ETC1 block is represented by 2 colorspace lines that are always parallel to the grayscale axis. By comparison, with DXT1, there's only a single line, but it can be in any direction, and perhaps that gives it a slight advantage.
ETC1 Endpoint Clusterization
The goal here is to figure out how to reduce the total number of unique endpoints (or block colors and intensity table indices) in an ETC1 encoded image without murdering the quality. This is just an early experiment, so let's try simplifying the ETC1 format itself to keep things simple. This experiment always use differential block color mode, with the delta color set to (0,0,0). So each subblock is represented using the same 5:5:5 color, and the same intensity table. The flip bit is always false. Obviously, this is going to lower quality, but let's see what happens. Note this simplified format is still 100% compatible with existing ETC1 decoders, we're just limiting ourselves to only using a simpler subset.
Here's the original image (kodim18 - because I remember this image being a pain to handle well in crunch for DXT1):
Here's the image encoded using high quality ETC1 compression (using rg_etc1, slow mode, perceptual colorspace metrics):
Delta:
Error: Max: 56, Mean: 2.827, MSE: 16.106, RMSE: 4.013, PSNR: 36.061
So the ETC1 encoding that takes advantage of all ETC1 features is 36.061 dB.
Here's the encoding using just diff mode, no flipping, with a (0,0,0) delta color:
Delta:
So we've lost 2.38 dB by limiting ourselves to this simpler subset of ETC1. The reduction in quality is obviously visible, but by no means fatal for the purposes of this quick experiment.
In this experiment, each ETC1 block only contains 4 unique colors (or a single colorspace line, with "low" and "high" endpoints and 2 intermediate colors). Here's a visualization of the "low" and "high" endpoints in this image:
Now let's clusterize these block color endpoints, using 6D tree structured VQ (vector quantization) to perform the clusterization. The output of this step consists of a series of clusters, and each cluster contains one or more block indices. The idea is, blocks with similar endpoint vectors will be placed into the same cluster. This is a similar process used by crunch for DXT1. It's much like generating a RGB color palette from an array of image colors, except we're dealing with 6D vectors instead of 3D color vectors, and instead of using the output palette directly all we really care about is how the input vectors are grouped.
Here's a visualization of the cluster endpoint centroid vectors after generating 32 clusters:
Once we have the image organized into block clusters containing similar endpoints, use an internal helper class within rg_etc1 to find the near-optimal 5:5:5 endpoint and intensity table to represent all the pixels within each cluster. We can now create a ETC1-compatible texture by processing each block cluster and selecting the optimal selectors to use for each pixel.
Let's see what this texture looks like, and the PSNR, after limiting the number of unique endpoints.
ETC1 (subset) with 64 unique endpoints:
Error: Max: 110, Mean: 5.865, MSE: 70.233, RMSE: 8.380, PSNR: 29.665
ETC1 (subset) 256 unique endpoints:
Error: Max: 93, Mean: 4.624, MSE: 45.889, RMSE: 6.774, PSNR: 31.514
ETC1 (subset) 512 unique endpoints:
Error: Max: 87, Mean: 4.225, MSE: 38.411, RMSE: 6.198, PSNR: 32.286
ETC1 (subset) 1024 unique endpoints:
Error: Max: 87, Mean: 3.911, MSE: 32.967, RMSE: 5.742, PSNR: 32.950
ETC1 (subset) 4096 unique endpoints:
Error: Max: 87, Mean: 3.642, MSE: 28.037, RMSE: 5.295, PSNR: 33.654
Next Steps
This experiment shows one way to clusterize the endpoint optimization process in a limited subset of the ETC1 format. This first step must be mastered before crunch for ETC1 can be written.
The clusterization step outlined here isn't aware of flipping, or that each block can have 2 block colors, and we haven't even looked at the selectors yet. A production encoder will need to support more features of the ETC1 format. Note that crunch for DXT1 doesn't support 3 color blocks and works just fine, so it's possible we don't need to support every encoding feature.
Some next steps:
- Figure out how to best clusterize the full format. Expand the format subset to include two block colors, flipping, and both encodings.
Is 6D clusterization good enough - or is 12D needed?
- Selector clusterization
- ETC1 specific refinement stages: refine endpoints based off the clusterized endpoints, then refine the clusterized endpoints based off the clusterized selectors, possibly repeat.
- crunch-style tiling ("macroblocking") will most likely be needed to get bitrate down to JPEG+real-time encoding competitive levels.
- ETC2 support
- ETC2 support
(Currently, I'm conducting these experiments in my spare time, in between VR and optimization contracts. If you're really interested in accelerating development of crunch for a specific GPU format please contact info@binomial.info.)
Monday, August 29, 2016
Good article: Why software patents are evil
I have been attacked (at a time in my life when the last thing I needed was more stress!) by a patent holder before, so hey I hate software patents:
http://www.infoworld.com/article/2619609/open-source-software/why-software-patents-are-evil.html
http://www.infoworld.com/article/2619609/open-source-software/why-software-patents-are-evil.html
Friday, August 5, 2016
Brotli levels 0-10 vs. Oodle Kraken
For codec version info, compiler settings, etc. see this previous post.
This graph demonstrates that varying Brotli's compression level from 0 to 10 noticeably impacts its decompression throughput. (Level 11 is just too slow to complete the benchmark overnight.) As I expected, at none of these settings is it able to compete against Kraken.
Interestingly, it appears that at Brotli's lowest settings (0 and 1) it outputs compressed data that is extremely (and surprisingly) slow to decode. (I've highlighted these settings in yellow and green below.) I'm not sure if this is intentional or not, but with this kind of large slowdown I would avoid these Brotli settings (and use something like zlib or LZ4 instead if you need that much throughput).
Level Compressed Size
0 2144016081
1 2020173184
2 1963448673
3 1945877537
4 1905601392
5 1829657573
6 1803865722
7 1772564848
8 1756332118
9 1746959367
10 1671777094
Original 5374152762
This graph demonstrates that varying Brotli's compression level from 0 to 10 noticeably impacts its decompression throughput. (Level 11 is just too slow to complete the benchmark overnight.) As I expected, at none of these settings is it able to compete against Kraken.
Interestingly, it appears that at Brotli's lowest settings (0 and 1) it outputs compressed data that is extremely (and surprisingly) slow to decode. (I've highlighted these settings in yellow and green below.) I'm not sure if this is intentional or not, but with this kind of large slowdown I would avoid these Brotli settings (and use something like zlib or LZ4 instead if you need that much throughput).
Level Compressed Size
0 2144016081
1 2020173184
2 1963448673
3 1945877537
4 1905601392
5 1829657573
6 1803865722
7 1772564848
8 1756332118
9 1746959367
10 1671777094
Original 5374152762
Thursday, August 4, 2016
Few notes about the previous post
This rant is mostly directed at the commenters that claimed I hobbled the open source codecs (including my own!) by not selecting the "proper" settings:
Please look closely at the red dots. Those represent Kraken. Now, this is a log10/log2 graph (log10 on the throughput axis.) Kraken's decompressor is almost one order of magnitude faster than Brotli's. Specifically, it's around 5-8x faster, just from eyeing the graph. No amount of tweaking Brotli's settings is going to speed it up this much. Sorry everyone. I've benchmarked Brotli at settings 0-10 (11 is just too slow) overnight and I'll post them tomorrow, just to be sure.
There is only a single executable file. The codecs are statically linked into this executable. All open source codecs were compiled with Visual Studio 2015 with optimizations enabled. They all use the same exact compiler settings. I'll update the previous post tomorrow with the specific settings.
I'm not releasing my data corpus. Neither does Squeeze Chart. This is to prevent codec authors from tweaking their algorithms to perform well on a specific corpus while neglecting general purpose performance. It's just a large mix of data I found over time that was useful for developing and testing LZHAM. I didn't develop this corpus with any specific goals in mind, and it just happens to be useful as a compressor benchmark. (The reasoning goes: If it was good enough to tune LZHAM, it should be good enough for newer codecs.)
Please look closely at the red dots. Those represent Kraken. Now, this is a log10/log2 graph (log10 on the throughput axis.) Kraken's decompressor is almost one order of magnitude faster than Brotli's. Specifically, it's around 5-8x faster, just from eyeing the graph. No amount of tweaking Brotli's settings is going to speed it up this much. Sorry everyone. I've benchmarked Brotli at settings 0-10 (11 is just too slow) overnight and I'll post them tomorrow, just to be sure.
There is only a single executable file. The codecs are statically linked into this executable. All open source codecs were compiled with Visual Studio 2015 with optimizations enabled. They all use the same exact compiler settings. I'll update the previous post tomorrow with the specific settings.
I'm not releasing my data corpus. Neither does Squeeze Chart. This is to prevent codec authors from tweaking their algorithms to perform well on a specific corpus while neglecting general purpose performance. It's just a large mix of data I found over time that was useful for developing and testing LZHAM. I didn't develop this corpus with any specific goals in mind, and it just happens to be useful as a compressor benchmark. (The reasoning goes: If it was good enough to tune LZHAM, it should be good enough for newer codecs.)
Wednesday, August 3, 2016
RAD's ground breaking lossless compression product benchmarked
Intro
Progress in the practical lossless compression field has been painfully stagnant in recent years. The state of the art is now rapidly changing, with several new open source codecs announced in recent times (such as Brotli and Zstd) offering high ratios and fast decompression. Recently, RAD Game Tools released several new codecs as part of its Oodle data compression product.
My first impression after benchmarking these new codecs was "what the hell, this can't be right", and after running the benchmarks again and double checking everything my attitude changed to "this is the new state of the art, and open source codecs have a lot of catching up to do".
This post uses the same benchmarking tool and data corpus that I used in this one.
Updated Aug. 5th: Changed compiler optimization settings from /O2 to /OX and disabled exceptions, re-ran benchmark and regenerated graphs, added codec versions and benchmarking machine info.
Codec Settings
All benchmarking was done under Win 10 x64, 64-bit executable, in a single process with no multithreading.
- lz4 (v1.74): level 8, LZ4_compressHC2_limitedOutput() for compression, LZ4_decompress_safe() for decompression (Note: LZ4_decompress_fast() is faster, but it's inherently unsafe. I personally would never use the dangerous fast() version in a project.)
- lzham (lzham_codec_devel on github): level "uber", dict size 2^26
- brotli (latest on github as of Aug. 1, 2016): level 10, dict size 2^24
- bitknit (standalone library provided by RAD): BitKnit_Level_VeryHigh
- Zstd (v0.8.0): ZSTD_MAX_CLEVEL
- Oodle library version: v2.3.0
- Kraken: OodleLZ_CompressionLevel_Optimal2
- Mermaid: OodleLZ_CompressionLevel_Optimal2
- Selkie: OodleLZ_CompressionLevel_Optimal2
- zlib (v1.2.8): level 9
Data corpus used: LZHAM corpus, only files 1KB or larger, 22,827 total files
Benchmarking machine:
Compiler and optimization settings:
Visual Studio 2015 Community Edition, 14.0.25424.00 Update 3
Totals
Sorted by highest to lowest ratio:
brotli 1671777094
lzham 1676729104
kraken 1685750158
zstd 1686207733
bitknit 1707850562
mermaid 1834845751
zlib 1963751711
selkie 1989554820
lz4 2131656949
Ratio vs. Decompression Throughput - Overview
On the far left there's LZHAM (dark gray), which at this point is looking pretty slow. (For a time, it was the decompression speed leader of the high ratio codecs, being 2-3x faster than LZMA.) Moving roughly left to right, there's Brotli (brown), zlib (light blue), Zstd (dark green), BitKnit (dark blue), Kraken (red), then a cluster of very fast codecs (LZ4 - yellow, Selkie - purple, Medmaid - light green, and even a sprinkling of Kraken - red).
Notes:
- Kraken is just amazingly strong. It has a very high ratio with ground breaking decompression performance. There is nothing else like it in the open source world. Kraken's decompressor runs circles around the other high-ratio codecs (LZHAM, Brotli, Zstd) and is even faster than zlib!
- Mermaid and Selkie combine the best of both worlds, being as fast or faster than LZ4 to decompress, but with compression ratios competitive or better than zlib!
Ratio vs. Decompression Throughput - High decompression throughput (LZ4, Mermaid, Selkie)
* LZ4 note: LZ4's decompression performance depends on whether or not the data was compressed with the HC or non-HC version of the compressor. I used the HC version for this post, which appears to output data which decompresses a bit faster. I'm guessing it's because there's less compressed data to process in HC streams.
Ratio vs. Decompression Throughput - High ratio codecs
Ratio vs. Compression Throughput - All codecs
For the first time on my blog, here's a ratio vs. compression throughput scatter graph.
- LZHAM's compressor in single threaded mode is very slow. (Compression throughput was never a priority of LZHAM.) Brotli's compressor is also a slowpoke.
- Interestingly, most of the other compressors cluster closely together in the 5-10mg/sec region.
- zlib and lz4 are both very fast. lz4 isn't a surprise, but I'm a little surprised by how much zlib stands apart from the others.
- There's definitely room here for compression speed improvements in the other codecs.
Conclusion
The open source world should be inspired by the amazing progress RAD has made here. If you're working on a product that needs lossless compression, RAD's Oodle product offers a massive amount of value. There's nothing else out there like it. I can't stress how big of a deal RAD's new lossless codecs are. Just their existence clearly demonstrates that there is still room for large improvements in this field.
Thanks to RAD Game Tools for providing me with a drop of their Oodle Data Compression library for me to benchmark, and to Charles Bloom for providing feedback on the codec settings and benchmark approach.
lz4hc vs. lz4 performance on the LZHAM test corpus
Both use LZ4_decompress_safe(). lz4hc uses LZ4_compressHC2_limitedOutput(), lz4 uses LZ4_compress_limitedOutput().
22,827 total files, all files >= 1KB.
total 5374152762
lz4hc 2199213331
lz4 2575990728
Sunday, July 31, 2016
New lossless compression benchmarks on the way
I've been benchmarking several new lossless codecs from Rad Game Tools: Kraken, Selkie, and Mermaid. (How does Rad think up these odd but cool sounding names?) Stay tuned!
Thursday, July 21, 2016
enet networking library
I switched over all the low-level networking in a VR app I've been working on to enet today. It's a UDP-based networking library that supports optional reliable and in order packet delivery, packet fragmentation (so very large packets can be sent over UDP), and multiple channels.
The API was super easy to use, the code is written in C, and the thing just works. Compiling it was as easy as dropping the .C files into the project and hitting Build.
I love libraries like this.
The API was super easy to use, the code is written in C, and the thing just works. Compiling it was as easy as dropping the .C files into the project and hitting Build.
I love libraries like this.
Sunday, July 3, 2016
Welcome to "The Hunger Games"
Pretty much required reading if you're going to work (and stand out!) at a self-organizing company:
The Hunger Games
The Hunger Games
Sunday, June 5, 2016
What a company-wide "reorg" looks like in a flat, manager-less company
Working at a bunch of companies over the years has given me a lot of interesting perspective. I really enjoy trying to describe how processes in top-down companies can be done in non-hierarchical ("no boss" or self-organizing) companies. Let's try to describe, say to a hierarchical company employee, what a company-wide "reorg" could look like in a no-manager company.
I first heard the word "reorg" in relation to how Microsoft periodically reorganizes its corporate structure to "better align the company to new corporate-level goals and strategies". (That's a joke.) Issuing a company-wide reorg in a hierarchical company is very much a executive-level decision. It's a top-down directive that the company intentionally follows, like a military maneuver. It just happens, you know about it as an employee, and you must go with the flow.
But what does a deep reorg actually look like in a non-hierarchical, manager-less company? The CEO can't just come cruising in totally reorg'ing the place. (Remember, the CEO is not your boss in companies like this!) Such a traumatic "mass adjustment of resources" is just not in the culture. (Small-scale "horse trades" occur all the time in manager-less companies. I'm talking about a deep, planned reorganization that impacts a large chunk of the company.)
Well, here's one way you can re-org a manager-less company. This approach assumes the company actor(s) attempting to pull the reorg off have the power to form new teams and make internal/external hires.
First, you need to form a small team around some new product or technology. Do it just below the radar (internally). It needs to show promise and be a rising-star type project. You should work to get as much strategic press exposure about this new team's work as possible.
Next, you start internally recruiting and externally hiring for that new team. You optimize the external hiring process to streamline it, to accept some candidates as contractors (who you may eventually hire) and some as immediate full-time employees. For the internal recruits, you only hire those internal developers who are the most passionate about the new project's goals or its technology. Hiring on the new team must be done carefully, because it's ultimately part of a greater company-wide sorting and reorganization process.
If the new project becomes large enough, it creates a rift of sorts in the organization. The new team gets more power and size over time. An entire ecosystem of other friendly teams can form around the new team. The company self-organizes itself into a market of teams around the new project, and a block of other "deprecated teams" who may not be aligned with the reorg's goals.
These deprecated teams can be reduced in size by letting go of internal developers over time. Anyone the company doesn't want long-term can be quickly moved onto a deprecated team. To minimize shock to the deprecated team's product (which may need to remain live), the team can fall back to possibly cheaper external contractors as it internally shrinks. Ultimately, the product can be put on long-term life support with minimal internal cost.
Now, if you are a developer in a company like this, and you want to survive the reorg, you should be asking yourself right about now "am I on a deprecated team?". If you are, you better learn the company's new religion quickly or you may be pushed out. (Or, you need to visibly work on background projects that support post-reorg goals or needs.)
If you are a senior team member in this scenario, and you want your team to not become deprecated, you need to quickly figure out how to transition your product into the "new era" so it remains relevant.
I first heard the word "reorg" in relation to how Microsoft periodically reorganizes its corporate structure to "better align the company to new corporate-level goals and strategies". (That's a joke.) Issuing a company-wide reorg in a hierarchical company is very much a executive-level decision. It's a top-down directive that the company intentionally follows, like a military maneuver. It just happens, you know about it as an employee, and you must go with the flow.
But what does a deep reorg actually look like in a non-hierarchical, manager-less company? The CEO can't just come cruising in totally reorg'ing the place. (Remember, the CEO is not your boss in companies like this!) Such a traumatic "mass adjustment of resources" is just not in the culture. (Small-scale "horse trades" occur all the time in manager-less companies. I'm talking about a deep, planned reorganization that impacts a large chunk of the company.)
Well, here's one way you can re-org a manager-less company. This approach assumes the company actor(s) attempting to pull the reorg off have the power to form new teams and make internal/external hires.
First, you need to form a small team around some new product or technology. Do it just below the radar (internally). It needs to show promise and be a rising-star type project. You should work to get as much strategic press exposure about this new team's work as possible.
Next, you start internally recruiting and externally hiring for that new team. You optimize the external hiring process to streamline it, to accept some candidates as contractors (who you may eventually hire) and some as immediate full-time employees. For the internal recruits, you only hire those internal developers who are the most passionate about the new project's goals or its technology. Hiring on the new team must be done carefully, because it's ultimately part of a greater company-wide sorting and reorganization process.
If the new project becomes large enough, it creates a rift of sorts in the organization. The new team gets more power and size over time. An entire ecosystem of other friendly teams can form around the new team. The company self-organizes itself into a market of teams around the new project, and a block of other "deprecated teams" who may not be aligned with the reorg's goals.
These deprecated teams can be reduced in size by letting go of internal developers over time. Anyone the company doesn't want long-term can be quickly moved onto a deprecated team. To minimize shock to the deprecated team's product (which may need to remain live), the team can fall back to possibly cheaper external contractors as it internally shrinks. Ultimately, the product can be put on long-term life support with minimal internal cost.
Now, if you are a developer in a company like this, and you want to survive the reorg, you should be asking yourself right about now "am I on a deprecated team?". If you are, you better learn the company's new religion quickly or you may be pushed out. (Or, you need to visibly work on background projects that support post-reorg goals or needs.)
If you are a senior team member in this scenario, and you want your team to not become deprecated, you need to quickly figure out how to transition your product into the "new era" so it remains relevant.
Thursday, May 19, 2016
We Need to Collectively Renegotiate
I'm sitting here watching season 2 of Halt and Catch Fire. This season wipes the slate mostly clean and starts over at an early 80's garage-style software service startup in Texas. At first, I pushed back at the idea of a real-time online gaming service using early 80's Commodore-era computer, disk storage and modem technology. Then I realized, everything they are showing here was more or less technologically feasible, or at worst was at the very edge of that era's hardware/software technology.
While watching this I had another interesting realization. Lots of my previous posts are really my way of telling every full-time software engineer I can reach to basically "wake up".
Let's mentally model the current employment situation as a 2D simulation. See all those little dots? Those are the full-time software developers working at corporations. Let's hit fast forward. Wow, that's weird! All these super valuable programmers keep going to and from the same bland corporate company nodes to work every morning. Their working conditions sometimes really suck and they are generally underpaid. These corporations have even been known to illegally cooperate with each other (i.e. conspire) to keep compensation to a minimum.
We've been interacting with lots of clients, some very well known in their fields, and most paint a similar picture: Their view is that too many engineers are "locked up" inside these corporations. It's actually very hard to find good software developers. There is room in the system for more software consultants, little consulting companies with amazing programmers like Blue Shift.
So here's my idea:
Now let's try upping the communication, empathy, independent organization and trust levels across all these agents in the simulation and see what happens. A bunch of smaller companies pop up and start offering their services to a potentially huge array of clients. They can negotiate for the best pay and conditions possible in this changed economy.
To pull this off in the real world, what we need to do is start talking, trusting, and cooperating with each other much more, especially across teams and companies. We all have a common interest here that totally transcends pretty much any corporate NDA. Collectively, we as software engineers have way too much power and value in the system to be working as atomized individuals competing with each other for scraps.
We can leave these corporations to form our own consulting or product companies. This will force the market to reorganize itself. Do this and working conditions and compensation levels can be organically pressured upwards. We actually have the power to do this if we would just organize and communicate more effectively.
Personally I believe even just a small number of programmers doing this can have a surprising economic and perhaps even a cultural impact.
In practice, doing this isn't that hard. I've started three companies so far, in between working at various companies. The first one did very early deferred shading research for Microsoft, the second one created crunch, and the third (Binomial) is consulting oriented.
To start: While still employed, work on building a community of other engineers at various companies. Up your visibility by making sure your code and work is easily found online, attend every event you can, give presentations, teach and help people, and be as public as possible. Save up 6 months or whatever of finances, find some friends and make the leap.
And if you fail? No big deal, just sign up for another full-time gig for a while. One that likely pays more, because this collective renegotiation strategy results in higher average wages, and because by changing companies you might even get a raise for being more experienced now!
To find clients, tap into your network and offer your services. This will free up amazing teams to work across companies, instead of them being locked up inside a few corporate fortresses.
Another fallback strategy if your new company fails is to get acqui-hired by a larger company, skipping the ridiculous interview process many companies use. Just develop a cool piece of technology that you think one or more companies would be interested in.
(So, think I'm crazy? I have a bunch of detailed mental models here, built up over time by working at several large strategically placed software companies in several states. This can work. We just need to organize better and teach each other how to do it.)
While watching this I had another interesting realization. Lots of my previous posts are really my way of telling every full-time software engineer I can reach to basically "wake up".
Let's mentally model the current employment situation as a 2D simulation. See all those little dots? Those are the full-time software developers working at corporations. Let's hit fast forward. Wow, that's weird! All these super valuable programmers keep going to and from the same bland corporate company nodes to work every morning. Their working conditions sometimes really suck and they are generally underpaid. These corporations have even been known to illegally cooperate with each other (i.e. conspire) to keep compensation to a minimum.
We've been interacting with lots of clients, some very well known in their fields, and most paint a similar picture: Their view is that too many engineers are "locked up" inside these corporations. It's actually very hard to find good software developers. There is room in the system for more software consultants, little consulting companies with amazing programmers like Blue Shift.
So here's my idea:
Now let's try upping the communication, empathy, independent organization and trust levels across all these agents in the simulation and see what happens. A bunch of smaller companies pop up and start offering their services to a potentially huge array of clients. They can negotiate for the best pay and conditions possible in this changed economy.
To pull this off in the real world, what we need to do is start talking, trusting, and cooperating with each other much more, especially across teams and companies. We all have a common interest here that totally transcends pretty much any corporate NDA. Collectively, we as software engineers have way too much power and value in the system to be working as atomized individuals competing with each other for scraps.
We can leave these corporations to form our own consulting or product companies. This will force the market to reorganize itself. Do this and working conditions and compensation levels can be organically pressured upwards. We actually have the power to do this if we would just organize and communicate more effectively.
Personally I believe even just a small number of programmers doing this can have a surprising economic and perhaps even a cultural impact.
In practice, doing this isn't that hard. I've started three companies so far, in between working at various companies. The first one did very early deferred shading research for Microsoft, the second one created crunch, and the third (Binomial) is consulting oriented.
To start: While still employed, work on building a community of other engineers at various companies. Up your visibility by making sure your code and work is easily found online, attend every event you can, give presentations, teach and help people, and be as public as possible. Save up 6 months or whatever of finances, find some friends and make the leap.
And if you fail? No big deal, just sign up for another full-time gig for a while. One that likely pays more, because this collective renegotiation strategy results in higher average wages, and because by changing companies you might even get a raise for being more experienced now!
To find clients, tap into your network and offer your services. This will free up amazing teams to work across companies, instead of them being locked up inside a few corporate fortresses.
Another fallback strategy if your new company fails is to get acqui-hired by a larger company, skipping the ridiculous interview process many companies use. Just develop a cool piece of technology that you think one or more companies would be interested in.
(So, think I'm crazy? I have a bunch of detailed mental models here, built up over time by working at several large strategically placed software companies in several states. This can work. We just need to organize better and teach each other how to do it.)
Subscribe to:
Posts (Atom)




































