Thursday, October 29, 2015

Compression is Compilation

This post's title was inspired by the title of Paul Khuong's blog post "Refactoring With LZ77: Compression Is Compilation (?)". I doubt this idea is new, it's just the best viewpoint (or mental model) I think I need to have while working on LZHAM's successor. I've explained it to several developers and haven't been laughed out of the room yet.

I think the usual description of LZ coding given is a very mechanical (and boring) one, and dwells too much on details like match finding and entropy coding.

To create LZHAM I had to rethink the entire way I thought about (or mentally modeled) LZ77 codecs, basically from a compression centric viewpoint to a compiler+VM centric viewpoint. Before, while working on Deflate compatible codecs optimized for specialized processors like X360's, I focused intensely on lower level things like Huffman coding, match finding, simplistic parsing algorithms (greedy vs. lazy matching), and CPU specific decompressor optimizations.

My new viewpoint is that a LZ compressor is a compiler. The output data from a LZ compressor is an executable program consisting of very efficiently encoded instructions for a sort of virtual CPU. The decompressor is a type of virtual machine that decodes and executes the instructions encoded in the compressed data in order to precisely decode the original source data. The VM's output is the decompressed data. The compiler/compressor's goal is to produce a program that when executed results in the source data.

A modern LZ compressor like LZMA or LZHAM has a backend "Instruction selection" optimizer: "Typically, instruction selection is implemented with a backwards dynamic programming algorithm which computes the "optimal" tiling for each point starting from the end of the program and based from there." If you understand this then you will easily understand LZ "Optimal Parsing", which is more or less the same thing. (Charles Bloom and Yann Collet have some good posts on optimal parsing. There are also several decent thread replies on optimal parsing at encode.ru.)

The endless LZ77-based variants have a wide variety of "CPU" instruction sets, registers, and global state. They use a wide variety of entropy coding backends (raw bits, arithmetic, Huffman, or new fangled things like ANS/FSE) to encode instructions. Examples of some significant LZ77 variants that change up the instruction set and/or VM "registers" are ROLZ or LZP.

Also, in modern LZ77 implementations there is still a large amount of redundancy in how the instructions are selected and coded. The vanilla LZ codespace is just too flexible. Charles Bloom is one of the trail blazers in publishing observations about the inefficiencies and opportunities in LZ compression. His LAM ("Literals After Mismatches") exclusion concept is one example.

Some of these varieties are designed to tradeoff compression ratio (i.e. instruction encoding efficiency) for extremely high compression and decompression (i.e. VM execution) throughput, like LZ4 or Snappy. And some varieties very carefully focus on balancing compression ratio vs. decompression throughput (with slow, usually offline, compressors) like LZMA, LZHAM, and brotli.

I think LZHAM is useful at all because it has an instruction set optimizer (near-optimal parsing using dynamic programming), a carefully chosen instruction set and state "registers" (basically LZMA's), with a fast entropy coding scheme that blends Huffman and bitwise arithmetic coding in a way that trades off a tiny amount of ratio for higher symbol decoding throughput (i.e. VM execution speed) vs. LZMA. LZHAM shows one way to compete against another, well entrenched codec: choose one important axis to improve upon (say decompression performance), but keep the same (or better) compression ratio as your competitor. If all other things are equal, the new codec is "better". (Side note: Another way to compete against another codec is to clone its API and high-level behavior and semantics as closely as you can. This is what miniz and LZHAM do vs. zlib. These codecs even try to clone zlib's obscure flush semantics.)

At a even higher level, data compression (not just LZ77) can also be viewed as a type of Machine Learning followed by Entropy Coding. LZ77 is an intensely practical/efficient compromise that constrains the implementation in just the right way: byte level prediction/matching, with a high average amount of bytes output per each decoded LZ instruction during decompression.

LZ codecs like zlib, LZ4, LZMA, brotli, LZHAM, Tornado, and Rad Game Tool's Oodle product all strike some of the most interesting, and currently best (or very nearly the best) compromises and choices between ratio, various important measures of throughput, memory consumption, code size, optimization level, bug support, platform support, library update frequency, etc.

Benchmarks like Squash and the Large Text Compression Benchmark show the performance of a variety of codecs (not just LZ based ones), executed on different types of CPU's with different categories and sizes of source data.

Game developers need to pitch in and contribute samples of their shipping data to the public domain, so LZ codec authors can make implementation and optimization decisions tuned to their unique datasets. We need a "Game Data Compression Benchmark" site.

Tuesday, October 27, 2015

Quick EXE patching experiment: bsdiff/xdelta vs. lzham

This is just a couple data points, so I'm not going to read much into this. I'm just trying to get a feel for how far away LZHAM's current support for patch generation is vs. other well known tools.

Unity46.exe -> Unity52.exe:

Unity46.exe: 30476872
Unity52.exe: 41259480
lzham_codec_devel (1): 8737015 
lzham_codec_devel (2): 8605360
bsdiff: 12320793
xdelta: 14233171

Unity510.exe -> Unity512.exe:

Unity510.exe: 52680480
Unity512.exe: 52740568
lzham_codec_devel (1): 5042088
lzham_codec_devel (2): 4736124
bsdiff: 3810343
xdelta: 11945751 

bsdiff: Windows version, based off bsdiff 4.1
xdelta: -9 (max compression)
lzham 1 settings: -m4 -x8 -d27
lzham 2 settings: -o -e -b -h4 -m4 -x16 -d27

Note about lzham settings: 2 is ridiculously slow (and because of -x16 required rebuilding the compressor DLL), but it shows the highest ratios lzham is capable of at the moment.

Quick and dirty conclusions:
xdelta isn't doing well here.

When the old/new files resemble each other (Unity510.exe vs. Unity512.exe), bsdiff beats LZHAM by a significant margin.

On the other hand, when the files aren't so similar (Unity46.exe vs. Unity52.exe), the tables are turned and LZHAM beats bsdiff. Hmm.

Monday, October 26, 2015

Patching related papers/links

I've been reading up on any available patching (delta compression) related literature:

Papers:

A Linear Time Constant Space Differencing Algorithm

Differential Compression of Executable Code (behind paywall)

Engineering a Differencing and Compression Data Format

Matching with Mismatches and Assorted Applications

Naive Differences of Executable Code

ZDelta: An efficient delta compression tool

File System Support for Delta Compression

Algorithms for Delta Compression and Remote File Synchronization

In-Place Reconstruction of Delta Compressed Files

Delta Algorithms: An Empirical Analysis

Lossless Compression Handbook - Chapter on LZ77 Delta Compressors

Compressing Differences of Executable Code 

Programs/links:

vcdiffRFC-3284

bsdiff

minibsdiff - bsdiff in library form

xdelta

jdiff

Microsoft's binary patching tools (mspatcha, mspatchc)

Microsoft Delta Compression API

LZX Delta Compression

edelta

Generic Diff Format Specification

Compression Related Projects - With links to various patching tools

zdelta - Based off zlib

rdelta

diffball delta compression framework

libxdiff

Google: Software Updates: Courgette

On Binary Delta Algorithms

File patching using data compression with flushing

This neat idea is from Alexander Suvorov at Unity. There's a similar approach described in "Data Compression: The Complete Reference", section 8.14.2 File Differencing: VCDIFF, but it's not described in a way that is easily adapted to existing codecs. (Note any errors below are due to me, not Alexander.)

This simple file patching (or differential compression) technique is compatible with any deterministic lossless codec that supports flushing during compression, like zlib and LZHAM. It doesn't require the codec to explicitly support patch file generation, just flushing during compression. (See the LZHAM_FULL_FLUSH etc. enums in lzham.h.)

Let's say you want to transmit a new version of a file (let's call it file B) to a remote client, and the remote client already has the first version (file A). Here's how you can create a patch file which the remote client can use to recover file B from file A.

Assumptions: Both the local and remote clients have the same codec, use the same codec settings, the compressor is deterministic, and flushing doesn't reset the compressor's internal state (it just compresses all input provided so far). For LZ, the dictionary size is at least filesize(A)+filesize(B) bytes.

1. The remote client compresses file A and flushes the compressor, resulting in a blob of compressed data (let's call this "CA").

Of course, it's probable the remote client already has the compressed version of file A, in which case you skip this step.

2. The local client also compresses file A and flushes the compressor, resulting in a blob of compressed data which is the same as CA in step 1. Then it immediately compresses file B (using the same compressor instance), resulting in another blob of compressed data (CB). 

The flush doesn't reset the compressor's internal state, it just causes it to output as much compressed data as possible aligned to a byte boundary, so when it compresses file B it'll take advantage of matches against file A.

Said in another way: You compress a total of filesize(A)+filesize(B) bytes using one instance of the compressor (the compressor's internal state is not reset in between files, for LZ this means the sliding dictionary is not reset during the flush), but flush between A and B. The compressor flush cleanly separates the compressed data needed to decompress file A from the compressed data needed to decompress file B.

3. Key step: Only transmit CB to the remote client (i.e. only the compressed data occurring immediately after the flush). The local client doesn't need to transmit CA because the remote client has already computed it in step 1.

4. Now the remote client combines compressed blobs CA and CB into a single compressed blob and decompresses the entire thing. It ignores the first filesize(A) decompressed bytes, resulting in file B. That's it.

Several downsides to this idea: The remote client must compress file A (costing CPU and/or storage), the dictionary size must be large enough to encompass both files (costing RAM), and the remote client must decompress both files. These requirements could be too expensive for mobile with large files.

Patch file generation using LZHAM is simpler because it explicitly supports seed dictionaries. You set the seed dictionary to file A, compress file B, resulting in a compressed patch file that recovers B from A. This still can require too much RAM for mobile, because you need a dictionary of at least filesize(A) bytes.

Wednesday, June 10, 2015

A simple way to reduce memory pressure on Unity iOS (and maybe Android) titles

Here's how to control the tradeoff between memory collection frequency and Mono memory headroom in Unity titles. I've tested this on iOS but it should work on any platform that uses LibGC to manage the garbage collected heap. (It would be nice if Unity exposed this in a simple way..)

The garbage collector used by Unity (LibGC - the Boehm collector) is quite greedy (its OS memory footprint only increases over time), and it allocates more RAM than strictly needed to reduce the frequency of collections. By default LibGC uses a setting that grows the heap by approx. 33% more RAM than needed at any time:

GC_API GC_word GC_free_space_divisor;

/* We try to make sure that we allocate at     */
/* least N/GC_free_space_divisor bytes between */
/* collections, where N is the heap size plus  */
/* a rough estimate of the root set size.      */
/* Initially, GC_free_space_divisor = 3.       */
/* Increasing its value will use less space    */
/* but more collection time.  Decreasing it    */
/* will appreciably decrease collection time   */
/* at the expense of space.                    */
/* GC_free_space_divisor = 1 will effectively  */
/* disable collections.                        */                                            

The LibGC internal function GC_collect_or_expand() bumps up the # of blocks to get from the OS by a factor controlled by this global variable. So if we increase this global (GC_free_space_divisor) LibGC should use less memory. Luckily, you don't need Unity source code to change this LibGC variable because it's global. (Note: At least on iOS. On other platforms with dynamic libraries changing this sucker may be trickier - but doable.)

In our game (Dungeon Boss), without changing this global, our Mono reserved heap size is 74.5MB in the first level. Setting this global to 16 at the dawn of time (from our main script's Start() method) reduces this to 61.1MB, for a savings of ~13.3MB of precious RAM. The collection frequency is increased, because the Mono heap will have less headroom to grow between collections, but it's still quite playable. 

Bumping up this divisor to 64 saves ~23MB of RAM, but collections occur several times per second (obviously unplayable).

To change this global in Unity iOS projects, you'll need to add a .C/CPP (or .M/.MM) file into your project with this helper:

typedef unsigned long GC_word;
extern GC_word GC_free_space_divisor;

void bfutils_SetLibGCFreeSpaceDivisor(int divisor)
{
   GC_free_space_divisor = divisor;
}

While you're doing this, you can also add these two externs so you can directly monitor LibGC's memory usage (directly bypassing Unity's Profiler class, which I think only works in Development builds):

extern "C" size_t GC_get_heap_size();
extern "C" size_t GC_get_free_bytes();

Now in your .CS code somewhere you can call this method:

[DllImport("__Internal")] private static extern void bfutils_SetLibGCFreeSpaceDivisor(int divisor);

public static void ConfigureLibGC()
{
// The default divisor is 3. Anything higher saves memory, but causes more frequent collections.
bfutils_SetLibGCFreeSpaceDivisor(16);
}

Just remember, if you change this setting LibGC will collect more frequently. But if your code uses aggressive object pooling (like it should) this shouldn't be a big deal. 

Also note that this global only impacts the amount of Mono heap memory headroom that LibGC tries to keep around. This global won't help you if you spike up your Mono heap's size by temporarily allocating very large blocks on the Mono heap. Large temp allocations should be avoided on the Mono heap because once LibGC gets its tentacles on some OS memory it doesn't ever let it go.


Monday, June 1, 2015

iOS Memory Pressure Hell: Apple's 64-bit requirement+Unity's IL2CPP tech

Apple's 64-bit requirement combined with the newness of Unity's promising new IL2CPP technology (which you must use to meet Apple's 64-bit requirement) is costing us a serious amount of memory pain:


IL2CPP uses around 30-39MB more RAM for our 32-bit build. We've been struggling with memory on iOS for a while (as documented here). I honestly don't know where to find 30MB more RAM right now, which means our title on 512MB devices (like the iPhone 4s and iPad Mini) is now broken. Crap.

Some details:

Apple is requiring app updates to support 64-bit executables by June 1st. We've known and have been preparing to ship a 64-bit build of our first product for several months. The only way to create a 64-bit iOS build is to use Unity's new IL2CPP technology, which compiles C# code to C++:

The Future of Scripting in Unity
An Introduction to IL2CPP Internals
Unity 4.6.2 iOS 64-bit Support

The additional memory utilized by IL2CPP builds vs. the previous Mono-AOT approach is ~10MB for our code with an empty scene, and >30MB with everything running.

We've also tested the latest patch, Unity 4.6.5p4 (which is supposed to use less memory on iOS) with the same results.

Thursday, May 21, 2015

Lessons learned while fixing memory leaks in our first Unity title

After several man months of tool making, instrumenting and compiling our own custom Mono DLL, and crawling through 5k-30k node heap allocation graphs in gephi, our first Unity title (Dungeon Boss for iOS/Android) is now no longer leaking significant amounts of Mono heap memory. Last year, our uptime on 512MB iOS devices was 15-20 minutes, now it's hours.

It can be very easy to construct complex systems in C# which have degenerate (continually increasing) memory behavior over time, even though everything else seems fine. We label such systems as "leaking", although they don't actually leak memory in the C/C++ sense. All it takes is a single accidental strong reference somewhere to mistakenly "leak" huge amounts of objects over time. It can be a daunting task (even for the original authors) to discover how to fix a large system written in a garbage collected language so it doesn't leak.

Here's a brain dump of what we learned during the painful process of figuring this out:

- Monitor your Unity app's memory usage as early as possible during development. Mobile devices have some pretty harsh memory restrictions (see here to get an idea for iOS), so be sure to test on real devices early and often.

Be prepared for some serious pain if you only develop and test in the Unity editor for months on end.

- On iOS your app will receive low memory warnings when the system comes under memory pressure. (Note that iOS can be very chatty about issuing these warnings.) It can be helpful to log these warnings to your game's server (along with the amount of used client memory), to help do post-mortem analysis of why your app is randomly dying in the field.

- Our (unofficial) low end iOS devices are iPhone 4/4s and iPad Mini 1st gen (512MB devices). If our total allocated memory (according to XCode's Memory Monitor) exceeds approx. 200MB for sustained periods of time it'll eventually be ruthlessly terminated by the kernel. Ideally, don't use more than 150-170MB on these devices.

- In Unity, the Mono (C#) heap is managed by the Boehm garbage collector. This is basically a C/C++-style heap with a garbage collector bolted on top of it.

Allocating memory is not cheap in this system. The version of Mono that Unity uses is pretty dated, so if you've been using the Microsoft .NET runtime for C# development then consider yourself horribly spoiled.

Treat the C# heap like a very precious resource, and study what C/C++ programmers do to avoid constantly allocating/freeing blocks such as using custom object pools. Avoid using the heap by preferring C# struct's vs classes, avoid boxing, use StringBuilder when messing with strings, etc.

- In complex systems written in C# the careful use of weak references (or containers of weak references) can be extremely useful to avoid creating endless chains of strong object references. We had to switch several systems from strong to weak references in key places to make them stable, and discovering which systems to change can be very tricky.

- Determine up front the exact lifetime of your objects, and exactly when objects should no longer be referenced in your system. Don't just assume the garbage collector will automagically take care of things for you.

- The Boehm collector's OS memory footprint only stabilizes or increases over time, as far as we can tell. This means you should be very careful about allocating large temporary buffers or objects on the Mono heap. Doing so could unnecessarily bump up your Mono's memory footprint, which will decrease the amount of memory "headroom" your app will have iOS. Basically, once Mono grabs OS memory it greedily holds onto it until the end of time, and this memory can't be used for other things such as textures, the Unity C/C++ heap, etc.

- Be very careful about using Unity's WWW class to download large archives or bundles, because this class may store the downloaded data in the mono heap. This is actually a serious problem for us, because we download compressed Unity asset bundles during the user's first game session and this class was causing our app's mono memory footprint to be increased by 30-40MB. This seriously reduced our app's memory headroom during the user's first session (which in a free to play game is pretty critical to get right).

- The Boehm collector grows its OS memory allocation so it has enough internal heap headroom to avoid collecting too frequently. You must factor this headroom into account when budgeting your C# memory, i.e. if your budget calls for 25MB of C# memory then the actual amount of memory consumed at the OS level will be significantly larger (approximately 40-50MB in our experience).

- It's possible to force the Boehm collector used by Unity to never allocate more than a set amount of OS memory (see here) by calling GC_set_max_heap_size() very early during app initialization. Note that if you do this and your C# heap leaks your app will eventually just abort once the heap is full.

It may be possible to call this API over time to carefully bump up your app's Mono heap size as needed, but we haven't tried this yet.

- If your app leaks, and you can't figure out how to fix all the leaks, then an alternative solution that may be temporarily acceptable is to relentlessly free up as much memory as possible by optimizing assets, switching from PVRTC 4bpp to 2bbp, lowering sound and music bitrates, etc. This will give your app the memory headroom it needs to run for a reasonable period of time before the OS kills it.

If the user can play 20 levels per hour, and you leak 1MB per level, then you'll need to find 20MB of memory somewhere to run one hour, etc. It can be far simpler to optimize some textures then track down memory leaks in large C# codebases.

- Design your code to avoid creating tons of temporary objects that trigger frequent collections. One of our menu dialogs was accidently triggering a collection every 2-4 frames on iOS, which was crushing our performance.

- We used the Daikon Forge UI library. This library has several very serious memory leaks. We'll try to submit these fixes back to the author, but I think the product is now more or less dead (so email me if you would like the fixes).

- Add some debug statistics to your game, along with the usual FPS display, and make sure this stuff works on your target devices:

Current total OS memory allocated (see here for iOS)

Total Mono heap used and reserved (You can retrieve this from Unity's Profiler class. Note this class returns all 0's in non-dev builds.)

Total number of GC's so far, number of frames since last GC, or average # of frames and seconds between GC's (you can infer when a GC occurs by monitoring the Mono heap's used size every Update() - when it decreases since the last Update() you can assume a GC has occured sometime recently)

- From a developer's perspective the iOS memory API and tool situation is a ridiculous mess:
http://gamesfromwithin.com/i-want-my-memory-apple
http://liam.flookes.com/wp/2012/05/03/finding-ios-memory/

While monitoring our app's memory consumption on iOS, we had to observe and make sense of statistics from XCode's Memory Monitor, from Instruments, from the Mach kernel API's, and from Unity. It can be very difficult to make sense of all this crap.

At the end of the day, we trusted Unity's statistics the most because we understood exactly how these statistics were being computed.

- Instrument your game's key classes to track the # of live objects present in the system at any one time, and display this information somewhere easily visible to developers when running on device. Increment a global static counter in your object's constructor, and decrement in your C# destructor (this method is automatically called when your object's memory is actually reclaimed by the GC).

- On iOS, don't be shy about using PVRTC 2bpp textures. They look surprisingly good vs. 4bpp, and this format can save you a large amount of memory. We wound up using 2bpp on all textures except for effects and UI sprites.

- The built-in Unity memory profiler works pretty well on iOS over USB. It's not that useful for tracking down narly C# scripting leaks, but it can be invaluable for tracking down asset problems.

- Here's our app's current memory usage on iOS from late last week. Most of this data is from Unity's built-in iOS memory profiler.

- Remember that leaks in C# code can propagate downstream and cause apparent leaks on the Unity C/C++ heap or asset leaks.

- It can be helpful to mentally model the mono heap as a complex directed graph, where the nodes are individual allocations and the edges are strong references. Anything directly or indirectly referenced from a static root (either global, or on the stack, etc.) won't be collected. In a large system with many leaks, try not to waste time fixing references to leaf nodes in this graph. Attack the problem as high up near the roots as you possibly can.

On the other hand, if you are under a large amount of time pressure to get a fix in right now, it can be easier to just fix the worst leaks (in terms of # of bytes leaked per level or whatever) by selectively null'ing out key references to leafier parts of the graph you know shouldn't be growing between levels. We wrote custom tools to help us determine the worst offenders to spend time on, sorted by which function the allocation occurred in. Fixing these leaks can buy you enough time to properly fix the problem.