Sunday, January 18, 2015

LZHAM v1.0 vs. LZMA relative decompression rate on 262 Unity asset bundle files

Going to be honest here, graphing this stuff in a way that makes sense and is useful is tricky and time consuming. I really like the way Rad does it with their lossless data compression library, Oodle. I'm going to try making graphs like theirs.

Anyhow, here's a quick graph showing the relative speedup of LZHAM vs. LZMA on 262 uncompressed Unity asset bundle files (160MB total) from a single game. The bundle data consists of a mix of MP3 audio, Unity meshes, anims, PVRTC4 texture data, and lots of small misc. serialized Unity asset files.

The bundles are on the X axis (sorted by decompression speedup, from slowest to fastest), and the actual relative speedup is Y (higher is better for LZHAM). The blue line is the relative speedup (or slowdown on the first 5 files, each between 54-16612 bytes - these files are either very small or with pretty high ratios). Blue line at 1.0=same decompression rate as LZMA.

The red line represents each bundle's compression ratio relative to LZMA's. Above 1.0 means LZMA was better (smaller compressed file), and below 1.0 means LZHAM was better. LZHAM tracks LZMA's ratio pretty well, and equals or beats it in many cases. (I did put LZHAM's compressor into its slowest/best parsing mode to generate this data.)

I make no claims this is the best way to visualize a codec's perf relative to another, I'm just experimenting and trying to gain insight into LZHAM's actual performance relative to LZMA.

This was on a Core i7, x86 build, LZHAM options: -h12 -x (fastest/least frequent Hufftable updating, up to 4 solutions per node parsing).


Saturday, January 17, 2015

Good lossless codec API design

Lossless codec design and implementation seems to be somewhat of a black art. I've seen many potentially good lossless codecs come out with almost useless interfaces (or none at all). Here are some attributes I've seen of good codecs:

- If you want others to use your codec in their apps, don't just provide a single command line executable with a awkward as hell command line interface. Support static libraries and SO's/DLL's, otherwise it's not useful to a large number of potential customers no matter how cool your codec is.

- Minimum number of source files, preferably in all-C.
If you use C++, don't rely on a ton of 3rd party crap like boost, etc. It just needs to compile out of the box.

Related: Programmers are generally a lazy bunch and hate mucking around with build systems, make files, etc. Make it easy for others to build your stuff, or just copy & paste into your project. Programmers will gladly sacrifice some things (such as raw perf, features, format compatibility, etc. - see stb_image) if it's fast and trivial to plop your code into their project. Don't rely on a ton of weird macros that must be configured by a custom build system.

- Even if the codec is C++, provide the interface in pure C so the codec can be trivially interfaced to other languages.

- You must support Linux, because some customers only run Linux and they will refuse to use Wine to run your codec (even if it works fine).

- Must provide heap alloc callbacks, so the caller can redirect all allocations to their own system.

- Support a "compile as ANSI C" mode, so it's easy to get your codec minimally working on new platforms. The user can fill in the platform specific stuff (atomics, threading, etc.) later, if needed.

Related: If you use threads, support basic pthreads and don't use funky stuff like pthread spinlocks (because platforms like OSX don't support them). Basic pthreads is portable across many platforms (even Win32 with a library like pthreads-win32, but just natively support Win32 too because it's trivial).

- Don't assume you can go allocate a single huge 256MB+ block on the heap. On mobile platforms this isn't a hot idea. Allocate smaller blocks, or ideally just 1 block and manage the heap yourself, or don't use heaps.

- Streaming support, to minimize memory consumption on small devices. Very important in the mobile world.

- Expose a brutally simple API for memory to memory compression.

- Support a zlib-compatible API. It's a standard, everybody knows it, and it just works. If you support this API, it becomes almost trivial to plop your codec into other existing programs. This allows you to also leverage the existing body of zlib docs/knowledge.

- Support in-place memory to memory decompression, if you can, for use in very memory constrained environments.

- Single threaded performance is still important: Codecs which depend on using tons of cores (for either comp or decomp) to be practical aren't useful on many mobile devices.

- In many practical use cases, the user doesn't give a hoot about compression performance at all. They are compressing once and distributing the resulting compressed data many times, and only decompressing in their app. So expose optional parameters to allow the user to tune your codec's internal models to their data, like LZMA does. Don't worry about the extra time needed to compress, we have the cloud and 40+ core boxes.

- Provide a "reinit()" API for your codec, so the user can reuse all those expensive heap allocations you've made on the first init on subsequent blocks.

- Deal gracefully with already compressed, or incompressible data. Don't expand it, except by a tiny amount, and don't slow to a crawl. Related: don't fall over on very compressible data, or data containing repeated strings, etc.

- Communicate the intended use cases and assumptions up front:
Is it a super fast but low ratio codec that massively trades off ratio for speed?
Is it a symmetrical codec, i.e. is compression throughput roughly equal to decompression?
Is it a asymmetric codec, where (typically) compression time is longer than decompression time?
Is the codec useful on tiny or small blocks, or is it intended to be used on large solid blocks of data?
Does your codec require a zillion cores or massive amounts of RAM to be practical at all?

- Test and tune your codec on mobile and console devices. You'll be surprised at the dismally low performance available vs. even mid-range x86 devices. These are the platforms that benefit greatly from data compression systems, so by ignoring this you're locking out a lot of potential customers of your codec. The world is not just x86.

One some CPU's, stuff like int divides, variable length int shifts, and L2 cache misses are surprisingly expensive. On some platforms, CPU load hit stores can crush performance on seemingly decent looking code.

- Beware relying on floating point math in a lossless codec. Different compilers can do different things when optimizing FP math expressions, possibly resulting in compressed outputs which are compiler dependent.

- Test your codec to death on a wide variety of data, then test it again. Random failures are the kiss of death. If your codec is designed for game data then download a bunch of games on Steam, unpack the game data (using typically user provided unpack/modding tools) then add the data to your test corpus.

- Carefully communicate up front about forwards/backwards compatibility.

- Make sure your codec can be built using Emscripten for Javascript compatibility. Or just provide a native Javascript decoder.

- Make sure your compressor supports a 100% deterministic mode, so with the same source data and compressor settings you always get the exact same compressed data every time. This allows integrating your codec into build systems that intelligently check for file modifications.

- "Fuzz" test your codec by randomly flipping bits of compressed data, inserting/removing bits, etc. and make sure your decompressor will never crash or overwrite memory. If your decompressor can crash, make sure you document this so the caller can check the integrity of the compressed data before decompression. Consider providing two decompressors, one that is optimized for raw performance (but can crash), and another hardened decompressor that can't possibly crash on corrupted inputs.

Related: Try to design your bitstream format so the decompressor can detect corruption as quickly as possible. I've seen codecs fall into the "let's output 4GB+ of all-0's" mode on trivially corrupted inputs.

- Your decompressor shouldn't try to read beyond the end of valid input streams, even by a single byte. (In other words, when your decompressor in streaming mode says it needs more bytes to make forward progress, it better mean it.) Otherwise 100% zlib compatibility is out, and trying to read beyond the end makes it harder to use your decompressor on non-seekable input streams. (This can be a tricky requirement to implement in some decoder designs, which is why it's here.)

- Gracefully handle error conditions, especially out of memory conditions. Return a real error code, don't just terminate. Have enough error status codes so developers can easily debug why something went wrong in the field.

- Don't just focus on textual data. The Large Text Compression Benchmark is cool and all, but many customers don't have much text (if any) to distribute or store.

Off topic, but related: The pure Context Mixing based codecs are technically interesting, but every time I try them on real-life game data they are just insanely slow, use massive amounts of RAM, and aren't competitive at all ratio-wise against good old LZ based codecs like LZMA. I'm not claiming CM algorithms can't be improved, but I think the CM devs focus way too much on text (or bog standard formats like JPEG) and not enough on handling arbitrary "wild" binary data.

- Allow the user to optionally disable adler32/crc32/etc. checks during decompression, so they can do it themselves (or not). Computing checksums can be surprisingly expensive.


Also, think about your codec's strengths and weaknesses, and how it will be used in practice. It's doubtful that one codec will be good for all real-world use cases. Some example use cases I've seen from the video game world:

- If a game is displaying a static loading screen, the codec probably has access to almost the entire machine's CPU(s) and possibly a good chunk of temporary memory. The decompressor must be able to keep up with the data provider's (DVD/BlueRay/network) rate, otherwise it'll be the bottleneck. As long as the codec's consumption rate is greater or equal to the provider's data rate, it can use up a ton of CPU (because it won't be the pipeline's bottleneck). A high ratio, heavy CPU, potentially threaded codec is excellent in this case.

- If a game is streaming assets in the background during gameplay, the codec probably doesn't have a lot of CPU available. The decompressor should be optimized for low memory consumption, high performance, low CPU cache overhead, etc. It's fine if the ratio is lower than the best achievable, because streaming systems are tolerant of high latencies.

- Network packet compression: Typically requires a symmetrical, low startup overhead codec that can do a good enough job on small to tiny binary packets. Codecs that support static data models tuned specifically to a particular game's data packets are especially useful here.

- Content updates: Codecs which support delta compression (patching) are especially useful here. New assets generally resemble old ones.


Finally, in many games I've worked on or seen, the vast majority of distributed data falls into a few big buckets: Audio, textures, meshes, animations, executable, compiled shaders, video. The rest of the game's data (scripts, protodata, misc serialized objects) forms a long tail (lots of tiny files and a small percent of the total). It can pay off to support optimizations for these specific data types.

More LZHAM v1.0 progress

I'm currently seeing overall decompression speedups around 1.8x - 3.8x faster vs. previous LZHAM releases on Unity asset bundle files. On relatively incompressible files (like MP3's), it's around 2-2.3x faster, and 30-40% faster on enwik9. This is on a Core i7, I'll have statistics on iOS devices early next week.

I'm removing some experimental stuff in LZHAM that adds little to no real value:

- Got rid of all the Polar coding stuff and some misc. leftover params (like the cacheline modeling stuff)

- No more literal prediction, or delta literal predictions, for slightly faster decoding, lower memory usage, and faster initialization of the decompressor.

- Reduced is_match contexts from 768 to 12. The loss in ratio was a fraction of a percent (if any), but the decompressor can be initialized more quickly and the inner loop is slightly simplified because the prev/prev_prev decoded chars don't need to be tracked any more.

Just 2 main Huffman tables for literals/delta literals now, instead of 128 tables (!) like the previous releases. The tiny improvement in ratio (if any on many files) just didn't justify all the extra complexity. The decompressor's performance is now more stable (i.e. not so dependent on the data being compressed) and I don't need to worry about optimizing the initialization of a zillion Huff tables during the decoder's init.

I'm adding several optional, but extremely useful comp/decomp params:

// Controls tradeoff between ratio and decompression throughput. 0=default, or [1,LZHAM_MAX_TABLE_UPDATE_RATE], higher=faster but lower ratio.
lzham_uint32 m_table_update_rate;

m_table_update_rate is a higher level/simpler way of controlling these 2 optional params:

// Advanced settings - set to 0 if you don't care.
// def=64, typical range 12-128, controls the max interval between table updates, higher=longer interval between updates (faster decode/lower ratio)
lzham_uint32 m_table_max_update_interval;

// def=16, 8 or higher, scaled by 8, controls the slowing of the update update freq, higher=more rapid slowing (faster decode/lower ratio), 8=no slowing at all.
lzham_uint32 m_table_update_interval_slow_rate;

These parameters allow the user to tune the scheduling of the Huffman table updates. The out of the box defaults now cause much less frequent table updating than previous releases. The overall ratio change from the slowest (more frequent) to fastest setting is around 1%. The speed difference during decompression from the slowest to fastest setting is around 2-3x.

Next up: going to generate some CSV files to make some nice graphs, then the iOS and (eventually) Android ports.

Friday, January 16, 2015

LZHAM v1.0 progress

Ported to OSX, and exposed several new compression/decompression parameters to allow the user to configure some of the codec's inner workings: literal/delta_literal bitmasks (or number of literal/delta literal bits - less versatile but simpler), the Huff table max interval between updates, and the rate at which the Huff table update interval slows between updates. These settings are absolutely critical to the decompressor's performance, memory, and CPU cache utilization.

The very early results are promising: 25-30% faster decoding (Core i7) and much less memory usage (still determining) by just tuning the settings (less tables/slower updating), with relatively little impact on compression ratio. The ratio reduction is only a fraction of 1% on the few files I've tested. (Disclaimer: I've only just got this working. These results do make sense -- it takes a bunch of CPU to update the Huff decode tables.)

Also, by reducing the # of Huff tables the decompressor shouldn't bog down nearly so much on mostly incompressible data. The user can currently select between 1-64 tables (separately for literals and delta literals, for up to 128 total tables). The codec supports prediction orders between 0-2, with 2 programmable predictor bitmasks for literals/delta_literals. (I'm not really sure exposing separate masks for literals vs. delta literals is useful, but after my experience optimizing LZMA's options with Unity asset data I'm now leaning to just exposing all sorts of stuff and let the caller figure it out.)

I'm also going to expose dictionary position related bitmasks to feed into the various predictions, just like LZMA, because they are valuable on real-life game data.

Annoyingly, when I lower the compression ratio decompression can get dramatically faster. I believe this has to do with a different mix of the decode loop exercised by the lower ratio bitstream, but I'm not really sure yet (and I don't remember if I figured out why 3+ years ago). I'll be writing a guide on how to tune the various settings to speed up LZHAM's decompressor.

On the downside, the user has more knobs to turn to make max use of the codec.


Wednesday, January 14, 2015

More LZHAM notes

It's been a while since I've made any major changes to LZHAM (except for minor cmake related stuff). This was a codec I wrote over a few nights and weekends while I was also working my day job. I eventually had to let active dev on LZHAM go to sleep because I got "sidetracked" shipping Portal 2. The codec's alpha has already been successfully deployed in several products, such as Planetside 2 and Titanfall, which isn't bad for a few nights of R&D and implementation work.

I covered what I was thinking of doing with LZHAM in this blog post. I have more interest in improving it again. For the types of products I'm now working on, what matters a lot is the title's retention rate, from first starting the product to the customer actually getting into real gameplay. Slow downloads or updates, loading screens, etc. equals lost users. Lost users=lower monetization. We actually measure the retention rate of every aspect of this in the field. So things like background downloading, streaming, proper organization of asset data into Unity asset bundles, and of course good data compression matter massively to us.

Anyhow, some ideas for LZHAM decompression startup and throughput improvements which I can do pretty quickly:

- After much testing on our game data, I now realize I underestimated how useful the various LZMA settings are. Right now LZHAM always uses the upper 3 MSB's of the prev. two literals for literal/delta literal contexts. Allow the user to control all of this: which prev. literal(s) (if any), say up to 8 bytes back, and which bits from those literals, separately for each type of prediction (literals/delta literals).

- In my quest to get LZHAM's ratio up to be similar to LZMA I made several tradeoffs which can greatly impact decompression perf, especially on uncompressible data. Right now the codec must always init and manage 64*2 Huffman tables. Allow the user to reduce or even increase the # of tables.

- LZHAM was designed for "solid" compression, where you give the codec dozens to hundreds of MB's containing many assets, and you don't restart/reinit the codec in between assets. It's like a slow to start drag racer. So it can suck on small files.

I'm not honestly 100% sure what to do about this yet that won't kill decompression perf. The way LZHAM updates Huffman tables seems like an albatross here. Amortized over many MB's it typically works fine, but on small files they can't be updated (adapted) quickly enough. Less tables are probably good here.

I could just integrate something like miniz into the codec, and try using it on each internal compressor block and using whatever is better. But that seems horrible.

- The Huffman table update frequency needs to be better tuned. If I can't think of anything smarter, allow the user to control the update schedule.

Note if you are very serious about fast, high ratio compression and decompression, Rad's Oodle product is very good. Given what I know about it, it's the best (fastest, highest compression, and most scalable/portable) production class lossless codec I know of.

Friday, January 9, 2015

Improved Unity asset bundle file compression

Download Times Matter


Sean Cooper, Ryan Inselmann and I have been building a custom lossless archiver designed specifically for Unity asset bundle files. The archiver itself uses several well-known techniques in the hardcore archiving/game repacking world. This post is mostly about how we've begun to tune the LZMA settings used by the archiver to be more effective on Unity asset bundle data. We'll cover the actual archiver in a later post.

LZMA has several knobs you can turn to potentially improve compression:
http://stackoverflow.com/questions/3057171/lzma-compression-settings-details

The LZHAM codec (my faster to decode alternative to LZMA) isn't easily tunable in the same way as LZMA yet, which is a flaw. I totally regret this because tuning these options works:

Total asset bundle asset data (iOS): 197,514,230
Unity's built-in compression: 91,692,327
Our archiver, un-tuned LZMA: 76,049,530
Our archiver, tuned LZMA (48 option trials): 75,231,764

Our archiver, tuned LZMA (225 option trials): 74,154,817

LZ codecs with untunable models/settings are much less interesting to me now. (Yet one more thing to work on in LZHAM.)

Here are the optimal settings we've found for each of our Unity asset classes on iOS. So for example, textures are compressed using different LZMA options (3, 3, 3) vs. animation clips (1, 2, 2).

Best LZMA settings after trying all 225 options. (In case of ties the compressor just selects the lowest lc,lp,pb settings - I just placed a triply nested for() loop around calling LZMA and it chooses the first best as the "best".)

Class (id): lc lp pb

GameObject (1): 8 0 0
Light (108): 2 2 2
Animation (111): 8 2 2
MonoScript (115): 2 0 0
LineRenderer (120): 0 0 0
SphereCollider (135): 0 0 0
SkinnedMeshRenderer (137): 8 4 2
AssetBundle (142): 0 2 2
WindZone (182): 0 0 0
ParticleSystem (198): 0 2 3
ParticleSystemRenderer (199): 8 3 3
Camera (20): 0 2 2
Material (21): 3 0 1
SpriteRenderer (212): 0 0 0
MeshRenderer (23): 8 4 2
Texture2D (28): 8 2 3
MeshFilter (33): 8 4 1
Transform (4): 6 2 2
Mesh (43): 0 0 1
MeshCollider (64): 1 4 1
BoxCollider (65): 7 4 2
AnimationClip (74): 0 2 2
AudioSource (82): 0 0 0
AudioClip (83): 8 0 0
Avatar (90): 2 0 2
AnimatorController (91): 2 0 2
? (95): 7 4 1
TrailRenderer (96): 0 0 0

Tuesday, January 6, 2015

Utils/tools to help transition to OSX from Windows

I've developed software under Kubuntu and Windows for years now. I found adapting to Kubuntu from a Windows world to be amazingly easy (ignoring the random Linux driver installation headaches). After a few days mucking around with some keyboard shortcuts and relearning Bash and I was all set. 

These days, the center of gravity in the mobile development world is OSX due to iOS's market share, so it's time I bit the bullet and dive into the Apple world.

I'll admit, transitioning to OSX has been mind numbingly painful at times. (Apple, why do you persist on using such a wonky keyboard layout? Where's my ctrl+left??! No alt+tab?? wtf? ARGH.) I mentioned this to Matt Pritchard, a long time OSX/iOS developer, and he passed along a list of OSX utilities and tools that can help make the transition easier for long-time Windows developers.