Monday, November 2, 2015

A few observations about Unity

I've only been at Unity Technologies for a month, and obviously I have much to learn. Here are a few things I've picked up so far:

High programmer empathy, low ego developers tend to be successful here.

Interesting company attributes: Distributed, open source style development model, team oriented, high cooperation. Automated testing, good processes, build tools, etc. are very highly valued. "Elegant code" is a serious concept at Unity.

Hiring: Flexible. Unity has local offices almost everywhere, giving it great freedom in hiring.

Saturday, October 31, 2015

15 Reasons why developers won't use your awesome codec

Getting other developers to use your code in their products is surprisingly difficult.  A lot of this applies to open source development in general:

1. "Nobody's ever been fired for choosing IBM."
The codec is perceived to be "too new" (i.e. less than 5-10 or whatever years old), and the developer is just afraid to adopt it.

2. Inertia: The developer just has a super favorite codec they believe is useful for every scenario and they've already hooked it up, so they don't want to make the investment into switching to something else.

3. Politics: The developer irrationally refuses to even look at your codec because you work for the same company as they do.

4. Perceived lack of support: If the codec fails in the field, who's gonna help us debug it?

(I believe this is one reason why Rad's compression products are perceived to offer enough value to be worth purchasing.)

5. Linus hates C++: The codec is written in C++, so it's obviously crap.

6. Bad packaging: It's not wrapped up into a trivial to compile and use library, with a dead simple C-style API.

7. Too exotic: The developer just doesn't understand it.

8. Untested on mobile: It's been designed and tested on a desktop CPU, with unknown mobile performance (if it compiles/works at all).

9. Memory usage: It's perceived to use too much RAM.

10. Executable size: The codec's executable size is perceived to be too large.

11. Lacks some feature(s): The codec doesn't support streaming, or the decompressor doesn't support in-place decompression.

12. Performance: The thing is #1 on the compression ratio charts, but it's just stupendously slow.

13. Robustness: Your codec failed us once and so we no longer trust it.

14. Licensing issues: GPL, LGPL, etc. is the kiss of death.

15. Patent FUD: A patent troll has mentioned that your codec may be infringing on one of their patents, so the world is now afraid to use it. It doesn't matter if your codec doesn't actually infringe.

Thursday, October 29, 2015

Grab bag list of LZHAM improvements

Putting this somewhere so I don't forget it:

Patching specific:

- Allow the decompressor to refer to a static dictionary anywhere in memory (i.e. no memcpy() needed into the decompressor's dictionary buffer)

- bsdiff is so simple looking. Why does it work so well when patching very similar files?

- For patching, investigate allowing the compressor to emit "delta rep" matches, or rep matches with an optional plus/minus distances.

- Use Matt Mahoney's FV tool on patch file scenarios.

- Add a visualization mode to LZHAM's compressor, like FV.

General improvements:

- Symbol price accuracy: the parsing jobs uses approximate symbol statistics that are locked at the beginning of blocks. Examine how inaccurate these statistics really are.

- Try to SIMD the Huffman table update routes.

- The codec is too focused on the totally general case, which is streaming. Many useful scenarios do not involve streaming at all.

Add a "small file" optimization mode to LZHAM's compressor, which can be used as a hint to the compressor that it's OK to try encoding the first block in multiple ways.

Add a decompressor variant that doesn't support streaming at all, to reduce the # of decompressor basic blocks (idea from John Brooks).

- Add a compile time option that reduces the size of the decompressor as much as possible, even if it sacrifices perf.

- The big Huffman tables (like for literals, delta literals, match len/dist categories) are all initialized to symbol frequencies of 1's, which is wasteful for things like text files. Implement some way for the compressor to have control over this, like escape codes to jam small regions of the symbol table frequencies to 1's, or perhaps configuration bits at the start of the stream.

- LZHAM's Huffman table decoding fastbits (symbol decoding acceleration) table is too large on very small streams (an observation due to Charles Bloom). The decoder's should start with small tables and grow them over time.

- From John Brooks: Combine Deflate's idea of storing compressed Huffman table codelengths in the data stream with LZHAM's current approach of rebuilding the tables over time. At the start of streams, use compressed codelengths, then switch to dynamic.

- Add a configuration bit (global or per block?) to completely disable rep matches, which I believe will help a small amount on text files. Have the compressor try this optimization out.

- Steal the idea of global configuration settings from LZMA that tweak some of its prediction models, so the user can call LZHAM multiple times with different settings and choose the smallest results.

- There are many compromise decisions in LZHAM. For example, the decompressor state machine transition table can't be varied, the bitwise arithmetic coder's adaption rate can't be varied, and the Huffman table update interval is only user controllable. Allowing the compressor to optimize along these axis can result in gains.

- Further profile and improve LZHAM's small block perf (reduce startup cost, increase throughput near start of streams).

Platform specific:

- Get Emscripten compilation of the decompressor working, for Javascript support

- Deeply examine and optimize the generated assembly of the decompressor on ARM

Traumatic:

- Just dump arithmetic and Huffman coding and just switch to something like rANS. I think I'll need to do this to ultimately compete against Brotli.

Very long term pie in the sky stuff:

I have a frighteningly complex ROLZ branch of LZHAM alpha somewhere that works. Re-evaluate it.

Compression is Compilation

This post's title was inspired by the title of Paul Khuong's blog post "Refactoring With LZ77: Compression Is Compilation (?)". I doubt this idea is new, it's just the best viewpoint (or mental model) I think I need to have while working on LZHAM's successor. I've explained it to several developers and haven't been laughed out of the room yet.

I think the usual description of LZ coding given is a very mechanical (and boring) one, and dwells too much on details like match finding and entropy coding.

To create LZHAM I had to rethink the entire way I thought about (or mentally modeled) LZ77 codecs, basically from a compression centric viewpoint to a compiler+VM centric viewpoint. Before, while working on Deflate compatible codecs optimized for specialized processors like X360's, I focused intensely on lower level things like Huffman coding, match finding, simplistic parsing algorithms (greedy vs. lazy matching), and CPU specific decompressor optimizations.

My new viewpoint is that a LZ compressor is a compiler. The output data from a LZ compressor is an executable program consisting of very efficiently encoded instructions for a sort of virtual CPU. The decompressor is a type of virtual machine that decodes and executes the instructions encoded in the compressed data in order to precisely decode the original source data. The VM's output is the decompressed data. The compiler/compressor's goal is to produce a program that when executed results in the source data.

A modern LZ compressor like LZMA or LZHAM has a backend "Instruction selection" optimizer: "Typically, instruction selection is implemented with a backwards dynamic programming algorithm which computes the "optimal" tiling for each point starting from the end of the program and based from there." If you understand this then you will easily understand LZ "Optimal Parsing", which is more or less the same thing. (Charles Bloom and Yann Collet have some good posts on optimal parsing. There are also several decent thread replies on optimal parsing at encode.ru.)

The endless LZ77-based variants have a wide variety of "CPU" instruction sets, registers, and global state. They use a wide variety of entropy coding backends (raw bits, arithmetic, Huffman, or new fangled things like ANS/FSE) to encode instructions. Examples of some significant LZ77 variants that change up the instruction set and/or VM "registers" are ROLZ or LZP.

Also, in modern LZ77 implementations there is still a large amount of redundancy in how the instructions are selected and coded. The vanilla LZ codespace is just too flexible. Charles Bloom is one of the trail blazers in publishing observations about the inefficiencies and opportunities in LZ compression. His LAM ("Literals After Mismatches") exclusion concept is one example.

Some of these varieties are designed to tradeoff compression ratio (i.e. instruction encoding efficiency) for extremely high compression and decompression (i.e. VM execution) throughput, like LZ4 or Snappy. And some varieties very carefully focus on balancing compression ratio vs. decompression throughput (with slow, usually offline, compressors) like LZMA, LZHAM, and brotli.

I think LZHAM is useful at all because it has an instruction set optimizer (near-optimal parsing using dynamic programming), a carefully chosen instruction set and state "registers" (basically LZMA's), with a fast entropy coding scheme that blends Huffman and bitwise arithmetic coding in a way that trades off a tiny amount of ratio for higher symbol decoding throughput (i.e. VM execution speed) vs. LZMA. LZHAM shows one way to compete against another, well entrenched codec: choose one important axis to improve upon (say decompression performance), but keep the same (or better) compression ratio as your competitor. If all other things are equal, the new codec is "better". (Side note: Another way to compete against another codec is to clone its API and high-level behavior and semantics as closely as you can. This is what miniz and LZHAM do vs. zlib. These codecs even try to clone zlib's obscure flush semantics.)

At a even higher level, data compression (not just LZ77) can also be viewed as a type of Machine Learning followed by Entropy Coding. LZ77 is an intensely practical/efficient compromise that constrains the implementation in just the right way: byte level prediction/matching, with a high average amount of bytes output per each decoded LZ instruction during decompression.

LZ codecs like zlib, LZ4, LZMA, brotli, LZHAM, Tornado, and Rad Game Tool's Oodle product all strike some of the most interesting, and currently best (or very nearly the best) compromises and choices between ratio, various important measures of throughput, memory consumption, code size, optimization level, bug support, platform support, library update frequency, etc.

Benchmarks like Squash and the Large Text Compression Benchmark show the performance of a variety of codecs (not just LZ based ones), executed on different types of CPU's with different categories and sizes of source data.

Game developers need to pitch in and contribute samples of their shipping data to the public domain, so LZ codec authors can make implementation and optimization decisions tuned to their unique datasets. We need a "Game Data Compression Benchmark" site.

Tuesday, October 27, 2015

Quick EXE patching experiment: bsdiff/xdelta vs. lzham

This is just a couple data points, so I'm not going to read much into this. I'm just trying to get a feel for how far away LZHAM's current support for patch generation is vs. other well known tools.

Unity46.exe -> Unity52.exe:

Unity46.exe: 30476872
Unity52.exe: 41259480
lzham_codec_devel (1): 8737015 
lzham_codec_devel (2): 8605360
bsdiff: 12320793
xdelta: 14233171

Unity510.exe -> Unity512.exe:

Unity510.exe: 52680480
Unity512.exe: 52740568
lzham_codec_devel (1): 5042088
lzham_codec_devel (2): 4736124
bsdiff: 3810343
xdelta: 11945751 

bsdiff: Windows version, based off bsdiff 4.1
xdelta: -9 (max compression)
lzham 1 settings: -m4 -x8 -d27
lzham 2 settings: -o -e -b -h4 -m4 -x16 -d27

Note about lzham settings: 2 is ridiculously slow (and because of -x16 required rebuilding the compressor DLL), but it shows the highest ratios lzham is capable of at the moment.

Quick and dirty conclusions:
xdelta isn't doing well here.

When the old/new files resemble each other (Unity510.exe vs. Unity512.exe), bsdiff beats LZHAM by a significant margin.

On the other hand, when the files aren't so similar (Unity46.exe vs. Unity52.exe), the tables are turned and LZHAM beats bsdiff. Hmm.

Monday, October 26, 2015

Patching related papers/links

I've been reading up on any available patching (delta compression) related literature:

Papers:

A Linear Time Constant Space Differencing Algorithm

Differential Compression of Executable Code (behind paywall)

Engineering a Differencing and Compression Data Format

Matching with Mismatches and Assorted Applications

Naive Differences of Executable Code

ZDelta: An efficient delta compression tool

File System Support for Delta Compression

Algorithms for Delta Compression and Remote File Synchronization

In-Place Reconstruction of Delta Compressed Files

Delta Algorithms: An Empirical Analysis

Lossless Compression Handbook - Chapter on LZ77 Delta Compressors

Compressing Differences of Executable Code 

Programs/links:

vcdiffRFC-3284

bsdiff

minibsdiff - bsdiff in library form

xdelta

jdiff

Microsoft's binary patching tools (mspatcha, mspatchc)

Microsoft Delta Compression API

LZX Delta Compression

edelta

Generic Diff Format Specification

Compression Related Projects - With links to various patching tools

zdelta - Based off zlib

rdelta

diffball delta compression framework

libxdiff

Google: Software Updates: Courgette

On Binary Delta Algorithms

File patching using data compression with flushing

This neat idea is from Alexander Suvorov at Unity. There's a similar approach described in "Data Compression: The Complete Reference", section 8.14.2 File Differencing: VCDIFF, but it's not described in a way that is easily adapted to existing codecs. (Note any errors below are due to me, not Alexander.)

This simple file patching (or differential compression) technique is compatible with any deterministic lossless codec that supports flushing during compression, like zlib and LZHAM. It doesn't require the codec to explicitly support patch file generation, just flushing during compression. (See the LZHAM_FULL_FLUSH etc. enums in lzham.h.)

Let's say you want to transmit a new version of a file (let's call it file B) to a remote client, and the remote client already has the first version (file A). Here's how you can create a patch file which the remote client can use to recover file B from file A.

Assumptions: Both the local and remote clients have the same codec, use the same codec settings, the compressor is deterministic, and flushing doesn't reset the compressor's internal state (it just compresses all input provided so far). For LZ, the dictionary size is at least filesize(A)+filesize(B) bytes.

1. The remote client compresses file A and flushes the compressor, resulting in a blob of compressed data (let's call this "CA").

Of course, it's probable the remote client already has the compressed version of file A, in which case you skip this step.

2. The local client also compresses file A and flushes the compressor, resulting in a blob of compressed data which is the same as CA in step 1. Then it immediately compresses file B (using the same compressor instance), resulting in another blob of compressed data (CB). 

The flush doesn't reset the compressor's internal state, it just causes it to output as much compressed data as possible aligned to a byte boundary, so when it compresses file B it'll take advantage of matches against file A.

Said in another way: You compress a total of filesize(A)+filesize(B) bytes using one instance of the compressor (the compressor's internal state is not reset in between files, for LZ this means the sliding dictionary is not reset during the flush), but flush between A and B. The compressor flush cleanly separates the compressed data needed to decompress file A from the compressed data needed to decompress file B.

3. Key step: Only transmit CB to the remote client (i.e. only the compressed data occurring immediately after the flush). The local client doesn't need to transmit CA because the remote client has already computed it in step 1.

4. Now the remote client combines compressed blobs CA and CB into a single compressed blob and decompresses the entire thing. It ignores the first filesize(A) decompressed bytes, resulting in file B. That's it.

Several downsides to this idea: The remote client must compress file A (costing CPU and/or storage), the dictionary size must be large enough to encompass both files (costing RAM), and the remote client must decompress both files. These requirements could be too expensive for mobile with large files.

Patch file generation using LZHAM is simpler because it explicitly supports seed dictionaries. You set the seed dictionary to file A, compress file B, resulting in a compressed patch file that recovers B from A. This still can require too much RAM for mobile, because you need a dictionary of at least filesize(A) bytes.

Wednesday, June 10, 2015

A simple way to reduce memory pressure on Unity iOS (and maybe Android) titles

Here's how to control the tradeoff between memory collection frequency and Mono memory headroom in Unity titles. I've tested this on iOS but it should work on any platform that uses LibGC to manage the garbage collected heap. (It would be nice if Unity exposed this in a simple way..)

The garbage collector used by Unity (LibGC - the Boehm collector) is quite greedy (its OS memory footprint only increases over time), and it allocates more RAM than strictly needed to reduce the frequency of collections. By default LibGC uses a setting that grows the heap by approx. 33% more RAM than needed at any time:

GC_API GC_word GC_free_space_divisor;

/* We try to make sure that we allocate at     */
/* least N/GC_free_space_divisor bytes between */
/* collections, where N is the heap size plus  */
/* a rough estimate of the root set size.      */
/* Initially, GC_free_space_divisor = 3.       */
/* Increasing its value will use less space    */
/* but more collection time.  Decreasing it    */
/* will appreciably decrease collection time   */
/* at the expense of space.                    */
/* GC_free_space_divisor = 1 will effectively  */
/* disable collections.                        */                                            

The LibGC internal function GC_collect_or_expand() bumps up the # of blocks to get from the OS by a factor controlled by this global variable. So if we increase this global (GC_free_space_divisor) LibGC should use less memory. Luckily, you don't need Unity source code to change this LibGC variable because it's global. (Note: At least on iOS. On other platforms with dynamic libraries changing this sucker may be trickier - but doable.)

In our game (Dungeon Boss), without changing this global, our Mono reserved heap size is 74.5MB in the first level. Setting this global to 16 at the dawn of time (from our main script's Start() method) reduces this to 61.1MB, for a savings of ~13.3MB of precious RAM. The collection frequency is increased, because the Mono heap will have less headroom to grow between collections, but it's still quite playable. 

Bumping up this divisor to 64 saves ~23MB of RAM, but collections occur several times per second (obviously unplayable).

To change this global in Unity iOS projects, you'll need to add a .C/CPP (or .M/.MM) file into your project with this helper:

typedef unsigned long GC_word;
extern GC_word GC_free_space_divisor;

void bfutils_SetLibGCFreeSpaceDivisor(int divisor)
{
   GC_free_space_divisor = divisor;
}

While you're doing this, you can also add these two externs so you can directly monitor LibGC's memory usage (directly bypassing Unity's Profiler class, which I think only works in Development builds):

extern "C" size_t GC_get_heap_size();
extern "C" size_t GC_get_free_bytes();

Now in your .CS code somewhere you can call this method:

[DllImport("__Internal")] private static extern void bfutils_SetLibGCFreeSpaceDivisor(int divisor);

public static void ConfigureLibGC()
{
// The default divisor is 3. Anything higher saves memory, but causes more frequent collections.
bfutils_SetLibGCFreeSpaceDivisor(16);
}

Just remember, if you change this setting LibGC will collect more frequently. But if your code uses aggressive object pooling (like it should) this shouldn't be a big deal. 

Also note that this global only impacts the amount of Mono heap memory headroom that LibGC tries to keep around. This global won't help you if you spike up your Mono heap's size by temporarily allocating very large blocks on the Mono heap. Large temp allocations should be avoided on the Mono heap because once LibGC gets its tentacles on some OS memory it doesn't ever let it go.


Monday, June 1, 2015

iOS Memory Pressure Hell: Apple's 64-bit requirement+Unity's IL2CPP tech

Apple's 64-bit requirement combined with the newness of Unity's promising new IL2CPP technology (which you must use to meet Apple's 64-bit requirement) is costing us a serious amount of memory pain:


IL2CPP uses around 30-39MB more RAM for our 32-bit build. We've been struggling with memory on iOS for a while (as documented here). I honestly don't know where to find 30MB more RAM right now, which means our title on 512MB devices (like the iPhone 4s and iPad Mini) is now broken. Crap.

Some details:

Apple is requiring app updates to support 64-bit executables by June 1st. We've known and have been preparing to ship a 64-bit build of our first product for several months. The only way to create a 64-bit iOS build is to use Unity's new IL2CPP technology, which compiles C# code to C++:

The Future of Scripting in Unity
An Introduction to IL2CPP Internals
Unity 4.6.2 iOS 64-bit Support

The additional memory utilized by IL2CPP builds vs. the previous Mono-AOT approach is ~10MB for our code with an empty scene, and >30MB with everything running.

We've also tested the latest patch, Unity 4.6.5p4 (which is supposed to use less memory on iOS) with the same results.

Thursday, May 21, 2015

Lessons learned while fixing memory leaks in our first Unity title

After several man months of tool making, instrumenting and compiling our own custom Mono DLL, and crawling through 5k-30k node heap allocation graphs in gephi, our first Unity title (Dungeon Boss for iOS/Android) is now no longer leaking significant amounts of Mono heap memory. Last year, our uptime on 512MB iOS devices was 15-20 minutes, now it's hours.

It can be very easy to construct complex systems in C# which have degenerate (continually increasing) memory behavior over time, even though everything else seems fine. We label such systems as "leaking", although they don't actually leak memory in the C/C++ sense. All it takes is a single accidental strong reference somewhere to mistakenly "leak" huge amounts of objects over time. It can be a daunting task (even for the original authors) to discover how to fix a large system written in a garbage collected language so it doesn't leak.

Here's a brain dump of what we learned during the painful process of figuring this out:

- Monitor your Unity app's memory usage as early as possible during development. Mobile devices have some pretty harsh memory restrictions (see here to get an idea for iOS), so be sure to test on real devices early and often.

Be prepared for some serious pain if you only develop and test in the Unity editor for months on end.

- On iOS your app will receive low memory warnings when the system comes under memory pressure. (Note that iOS can be very chatty about issuing these warnings.) It can be helpful to log these warnings to your game's server (along with the amount of used client memory), to help do post-mortem analysis of why your app is randomly dying in the field.

- Our (unofficial) low end iOS devices are iPhone 4/4s and iPad Mini 1st gen (512MB devices). If our total allocated memory (according to XCode's Memory Monitor) exceeds approx. 200MB for sustained periods of time it'll eventually be ruthlessly terminated by the kernel. Ideally, don't use more than 150-170MB on these devices.

- In Unity, the Mono (C#) heap is managed by the Boehm garbage collector. This is basically a C/C++-style heap with a garbage collector bolted on top of it.

Allocating memory is not cheap in this system. The version of Mono that Unity uses is pretty dated, so if you've been using the Microsoft .NET runtime for C# development then consider yourself horribly spoiled.

Treat the C# heap like a very precious resource, and study what C/C++ programmers do to avoid constantly allocating/freeing blocks such as using custom object pools. Avoid using the heap by preferring C# struct's vs classes, avoid boxing, use StringBuilder when messing with strings, etc.

- In complex systems written in C# the careful use of weak references (or containers of weak references) can be extremely useful to avoid creating endless chains of strong object references. We had to switch several systems from strong to weak references in key places to make them stable, and discovering which systems to change can be very tricky.

- Determine up front the exact lifetime of your objects, and exactly when objects should no longer be referenced in your system. Don't just assume the garbage collector will automagically take care of things for you.

- The Boehm collector's OS memory footprint only stabilizes or increases over time, as far as we can tell. This means you should be very careful about allocating large temporary buffers or objects on the Mono heap. Doing so could unnecessarily bump up your Mono's memory footprint, which will decrease the amount of memory "headroom" your app will have iOS. Basically, once Mono grabs OS memory it greedily holds onto it until the end of time, and this memory can't be used for other things such as textures, the Unity C/C++ heap, etc.

- Be very careful about using Unity's WWW class to download large archives or bundles, because this class may store the downloaded data in the mono heap. This is actually a serious problem for us, because we download compressed Unity asset bundles during the user's first game session and this class was causing our app's mono memory footprint to be increased by 30-40MB. This seriously reduced our app's memory headroom during the user's first session (which in a free to play game is pretty critical to get right).

- The Boehm collector grows its OS memory allocation so it has enough internal heap headroom to avoid collecting too frequently. You must factor this headroom into account when budgeting your C# memory, i.e. if your budget calls for 25MB of C# memory then the actual amount of memory consumed at the OS level will be significantly larger (approximately 40-50MB in our experience).

- It's possible to force the Boehm collector used by Unity to never allocate more than a set amount of OS memory (see here) by calling GC_set_max_heap_size() very early during app initialization. Note that if you do this and your C# heap leaks your app will eventually just abort once the heap is full.

It may be possible to call this API over time to carefully bump up your app's Mono heap size as needed, but we haven't tried this yet.

- If your app leaks, and you can't figure out how to fix all the leaks, then an alternative solution that may be temporarily acceptable is to relentlessly free up as much memory as possible by optimizing assets, switching from PVRTC 4bpp to 2bbp, lowering sound and music bitrates, etc. This will give your app the memory headroom it needs to run for a reasonable period of time before the OS kills it.

If the user can play 20 levels per hour, and you leak 1MB per level, then you'll need to find 20MB of memory somewhere to run one hour, etc. It can be far simpler to optimize some textures then track down memory leaks in large C# codebases.

- Design your code to avoid creating tons of temporary objects that trigger frequent collections. One of our menu dialogs was accidently triggering a collection every 2-4 frames on iOS, which was crushing our performance.

- We used the Daikon Forge UI library. This library has several very serious memory leaks. We'll try to submit these fixes back to the author, but I think the product is now more or less dead (so email me if you would like the fixes).

- Add some debug statistics to your game, along with the usual FPS display, and make sure this stuff works on your target devices:

Current total OS memory allocated (see here for iOS)

Total Mono heap used and reserved (You can retrieve this from Unity's Profiler class. Note this class returns all 0's in non-dev builds.)

Total number of GC's so far, number of frames since last GC, or average # of frames and seconds between GC's (you can infer when a GC occurs by monitoring the Mono heap's used size every Update() - when it decreases since the last Update() you can assume a GC has occured sometime recently)

- From a developer's perspective the iOS memory API and tool situation is a ridiculous mess:
http://gamesfromwithin.com/i-want-my-memory-apple
http://liam.flookes.com/wp/2012/05/03/finding-ios-memory/

While monitoring our app's memory consumption on iOS, we had to observe and make sense of statistics from XCode's Memory Monitor, from Instruments, from the Mach kernel API's, and from Unity. It can be very difficult to make sense of all this crap.

At the end of the day, we trusted Unity's statistics the most because we understood exactly how these statistics were being computed.

- Instrument your game's key classes to track the # of live objects present in the system at any one time, and display this information somewhere easily visible to developers when running on device. Increment a global static counter in your object's constructor, and decrement in your C# destructor (this method is automatically called when your object's memory is actually reclaimed by the GC).

- On iOS, don't be shy about using PVRTC 2bpp textures. They look surprisingly good vs. 4bpp, and this format can save you a large amount of memory. We wound up using 2bpp on all textures except for effects and UI sprites.

- The built-in Unity memory profiler works pretty well on iOS over USB. It's not that useful for tracking down narly C# scripting leaks, but it can be invaluable for tracking down asset problems.

- Here's our app's current memory usage on iOS from late last week. Most of this data is from Unity's built-in iOS memory profiler.

- Remember that leaks in C# code can propagate downstream and cause apparent leaks on the Unity C/C++ heap or asset leaks.

- It can be helpful to mentally model the mono heap as a complex directed graph, where the nodes are individual allocations and the edges are strong references. Anything directly or indirectly referenced from a static root (either global, or on the stack, etc.) won't be collected. In a large system with many leaks, try not to waste time fixing references to leaf nodes in this graph. Attack the problem as high up near the roots as you possibly can.

On the other hand, if you are under a large amount of time pressure to get a fix in right now, it can be easier to just fix the worst leaks (in terms of # of bytes leaked per level or whatever) by selectively null'ing out key references to leafier parts of the graph you know shouldn't be growing between levels. We wrote custom tools to help us determine the worst offenders to spend time on, sorted by which function the allocation occurred in. Fixing these leaks can buy you enough time to properly fix the problem.

Wednesday, May 20, 2015

The great github migration

I've just migrated most of my projects from the sinking ship that is Google Code, but the wikis and readme.md files aren't setup yet:

https://github.com/richgel999/rg-etc1
https://github.com/richgel999/fxdis-d3d1x
https://github.com/richgel999/crunch
https://github.com/richgel999/picojpeg
https://github.com/richgel999/imageresampler
https://github.com/richgel999/miniz
https://github.com/richgel999/jpeg-compressor

The wiki data was auto-migrated into the "wiki" branch.

I haven't migrated LZHAM alpha (used by Planetside 2) yet. Still deciding if I should just archive it somewhere instead because it's a dead branch (LZHAM now lives here, and the much faster compressing development branch is here).

Friday, May 15, 2015

LZHAM decompressor optimization ideas

John Brooks (CEO of Blue Shift Inc.) and I were discussing LZHAM's current decompression performance vs. Google's Brotli codec and we came up with these ideas:

- Basic block optimization:
The current decompressor's inner loop is a general purpose coroutine based design, so a single implementation can handle both streaming and non-streaming scenarios. This is hurting perf. because the coroutine structure consists of a huge switch statement, which causes the compiler to dump locals to registers (and read them back in) a lot.

I'm going to add an alternative non-coroutine version of the inner loop that won't support streaming to optimize the very common memory to memory scenario. (LZHAM already has a few non-streaming optimizations in there, but it still uses a huge switch statement that breaks up the inner loop into tons of basic blocks.)

- LZHAM doesn't have any SIMD optimizations in its Huffman routines. I've been hesitant to use SIMD code anywhere in LZHAM because it complicates testing, but some of the Huffman routines should be easy to optimize with SIMD code.

- Finally, LZHAM's small block performance suffers vs. LZMA, Brotli, or zlib because it must compute Huffman tables on the fly at a fairly high frequency near the beginning of streams. There's a simple solution to this, which is to use precomputed Huffman table codelengths at the start of streams, then switch to dynamically updated Huffman tables once it makes sense to do so.

I already have the code to do this stuff (from the original LZHAM prototype), but it would require breaking v1.0 format compatibility. (And I'm not going to break format compatibility - if/when I do the new thing will have a different name.)


Wednesday, May 13, 2015

Dungeon Boss's current memory footprint on iOS

A snapshot of our current memory footprint on iOS, using Unity's built-in memory profiling tool (thanks to Sean Cooper):

Mono       59
Unity      53
GfxDriver  34.4
Textures   29.4
Animations 23.8
Meshes     16.8
FMOD       13
Profiler   12.8
Audio      11.1


Mono heap's allocator is a greedy SOB, and in practice only around 40-50% of its allocated memory contains persistent C# objects. We are going to try tweaking or modifying it soon to be less greedy, because we need the memory.

Also, the Unity heap is relatively huge now so we're going to poke around in there and see what's going in.

Sunday, May 10, 2015

Industrial strength Mono memory analysis tools for large scale Unity games

We've been investing a bunch of time into creating a set of Mono (C#) memory leak and performance analysis tools, which we've been using to help us ship our first title (Dungeon Boss). Here's a high-level description of how the tools work and what we can currently do with them:

First, we have a custom instrumented version of mono.dll that captures and records a full transaction log of all mono and lower level libgc (Boehm collector) heap activity. It records to the log almost all internal malloc's/free's, mono allocs, and which mono allocs are collected during each GC. A full backtrace is stored with each log record.

We also record full RAM dumps at each GC, as well as the location of all static roots, thread stacks, etc. (The full RAM dumps may seem excessive, but they are extremely compressible with LZ4 and are key to understanding the relationships between allocations.)

We've also instrumented our game to record custom events to the log file: At menu, level start/end, encounter start, start of Update(), etc. 

A typical workflow is to run the game in a Windows standalone build using our instrumented version of mono.dll, which generates a large binary transaction log. We then post process this log using a C# tool named "HeapBoss", which spews out a huge .JSON file and a number of binary heap RAM snapshots. We then explore and continue processing all this data using an interactive C++ command line tool named "HeapMan".

Here's a list of things we can currently do once we have a heap transaction log and GC RAM snapshots:

- Log exploration and statistical analysis:
Dump the log command index of all GC's, the ID's of all custom events, dump all allocs or GC frees of a specific type, etc.

- Blame list construction: We can replay the transaction log up to a specific GC, then recreate the state of the mono heap at that particular GC. We can then construct a "blame list" of those C# functions which are responsible for each allocation. We use a manually created leaf function name exclusion list (consisting of about 50 regex patterns) to exclude the deepest functions from each backtrace which are too low level (or internal) to be interesting or useful for blame list construction. 

This method is useful for getting a picture of the top consumers of Mono heap memory at a specific spot in the game. We output this list to a .CSV file.

- Growth over time ("leak") analysis: We replay the transaction log up to spot A, create a heap state object, then continue playing up to spot B and create another heap state object. We then construct a blame list for each state object, diff them, then record the results to a .CSV file.

This allows us to see which functions have grown or shrunk their allocations over time. We've found many leaks this way. (Of course, they aren't leaks in the C/C++ sense. In C#, it's easy to construct systems which have degenerate memory behavior over time.)

- Various queries: Find all heap objects of type X. Find all objects on the heap which potentially have a pointer to a specific object (or address). Examine an object's memory at a specific GC and determine which objects it may potentially point to. Find the allocation that includes address X, if any. Find the vtable at address X, etc.

Note our tools don't know where the pointers are really located in a type, so when we examine the raw memory of an object instance it's possible to mistake some random bits for a valid object pointer. We do our best to exclude pointers which aren't aligned, or don't point to the beginning of another allocated object, etc. In practice this approach is surprisingly reliable.

- Root analysis to help find unintended leaks: Given an object, recursively find all objects which reference it, then find all the objects which refer to those objects, etc. until you discover all the roots which directly or indirectly refer to the specific objects. Output the results as a .DOT file and import into gephi for visualization and deeper analysis. (graphviz is another option, but it doesn't scale to large graphs as well as gephi.)

- Heap transaction videos: Create a PNG image visualizing the active heap allocations and synchronize this video with the game. Helps to better understand how the title is utilizing/abusing heap memory at different spots during gameplay.

What we'll be doing next:

- "Churn" analysis: Create a histogram of the blame function for each GC free, sort by the # of frees or # of bytes freed to identify those functions which are creating the most garbage over time.

- Automatically identify the parts of the heap graph that always grow over time. Right now, we do this manually by first doing a growth analysis, then finding which types are responsible for the growth, then finding the instance(s) which grow over time.


Friday, May 8, 2015

Garbage collected systems must have good tools

Hey game engine developers: Please don't release garbage collected systems without solid tools.

We need the ability to take full heap snapshots, visualize allocation graphs, do shortest path analysis from roots to individual objects, etc. If you don't provide tools like this then it can be extremely painful to ship reliable large scale software that uses garbage collection.



Wednesday, April 29, 2015

LZHAM v1.0 is back on the Pareto frontier for decompression speed

LZHAM v1.0 is doing pretty good on this benchmark:

http://mattmahoney.net/dc/text.html
lzham 1.0 -d29 -x 25,002,070  202,237,199  191,600 s  202,428,799  1096  6.6 7800 LZ77 70

Thanks to Michael Crogan for giving me the heads up.

Wednesday, March 11, 2015

Graphing Heap Memory Allocations

We've instrumented the version of Mono that Unity v4.6 uses so it can create a full heap transaction log, as well as full heap memory snapshots after all GC's. With this data we can build complete graphs of all objects on the Mono heap, either going "down" starting from the static roots, or "up" from a single object. Note this heap visualization method is relevant to C/C++ too, because at its core this version of Mono actually uses the Boehm Collector for C/C++.

To build this graph, we recorded a heap transaction log and post-GC memory snapshots of our product running a single dungeon and exiting to the game's menu. I then played back this transaction log with our analysis tool ("HeapBoss") and stopped playing back the log right after the dungeon was exited. I had the tool start building the graph at a single ScriptableUXObject object on the heap (instances of this type are leaking after each dungeon). The tool then found all references to this object recursively, up to any roots and exported the graph to a .dot file. Gephi is used to visualize the graph, using the Yifan Hu algorithm to optimize the layout.

Using this visualization method we quickly determined why these objects are leaking, which was easy once we could visualize the chain of references in various containers back up to the UXManager.







Saturday, February 21, 2015

A Telemetry-style visualization of LZHAM v1.1's multithreaded compressor

I'm a big fan of Rad's Telemetry profiler, which I used at Valve for various Linux profiling tasks, but it's unfortunately out of reach to independent developers. So I'm just making my own mini version of Telemetry, using imgui for the UI. For most uses Telemetry itself is actually overkill anyway. I may open source this if I can find the time to polish it more.

I've optimized LZHAM v1.1's compressor for more efficient multithreading vs. v1.0, but it's still doesn't scale quite as much as I was hoping. Average utilization on Core i7 Gulftown (6 cores) is only around 50-75%. Previously, the main thread would wait for all parsing jobs to complete before coding, v1.1 can now overlap parsing and coding. I also optimized the match finder so it initializes more quickly at the beginning of each block, and the finder jobs are a little faster.

Here are the results on LZHAM v1.1 (currently on github here), limited to 6 total threads, after a few hours of messing around learning imgui. Not all the time in compress_block_internal() has been marked up for profiling, but it's a start:


Thread 0 is the main thread, while threads 1-5 are controlled by the task pool. The flow of time is from left to right, and this view visualizes approx. 616.4ms.

At the beginning of a block, the compressor allocated 5 total threads for match finding and 2 total threads for parsing.

The major operations:

- Thread 0 (the caller's thread) controls everything. compress_block_internal() is where each 512KB block is compressed. find_all_matches() prepares the data structures used by the parallel match finder, kicks off a bunch of find_matches_mt() tasks to find matches for the entire block, then it finds all the nearest len2 matches for the entire block. After this is done it begins parsing (which can be done in parallel) and coding (which must be done serially).

The main thread proceeds to process the 512KB input block in groups of (typically) 3KB "parse chunks". Several 3KB chunks can be parsed in parallel within a single group, but all work within a group must finish before proceeding to the next group.

- Coding is always done by thread 0.

Thread 0 waits for each chunk in a group to be parsed before it codes the results. Coding must proceed sequentially, which is why it's on thread 0. The first chunk in a group is always parsed on thread 0, while the other chunks in the group may be parsed using jobs on the thread pool. The parse group size is dynamically increased as the match finders finish up in order to keep the thread pool's utilization high.

- Threads 1-5 are controlled by the thread pool. In LZHAM v1.1, there are only two task types: match finding, and parsing. The parsers consume the match finder's output, and the main thread consumes the parser's output. Coding is downstream of everything.

The match finders may sometimes spin and then sleep if they get too far ahead of the finders, which happens more than I would like (match finding is usually the bottleneck). This can cause the main thread to stall as it tries to code each chunk in a group.

Here's a zoom in of the above graph, showing the parsers having to wait in various places:


Some of the issues I can see with LZHAM v1.1's multithreading:
- find_all_matches() is single threaded, and the world stops until this function finishes, limiting parallelization.
- find_len2_matches() should be a task.
- It's not visualized here, but the adler32() call (done once per block) can also be a task. Currently, it's done once on thread 0 after the finder tasks are kicked off.
- The finder tasks should take roughly the same amount of time to execute, but it's clear that the finder job on thread 4 took much longer than the others.

Here's an example of a very unbalanced situation on thread 3. These long finder tasks need to be split up into much smaller tasks.


- There's some serial work in stop_encoding(), which takes around 5ms. This function interleaves the arithmetic and Huffman codes into a single output bitstream.

v1.1 is basically ready, I just need to push it to the main github repo. I'm going to refine the profiler and then use it to tune LZHAM v1.1's multithreading more.

Sunday, February 8, 2015

More LZHAM small block optimizations

Improved tiny block performance (i.e. calling lzham_decompress_memory() on a 88 byte string) by 3.3x by profiling the decompression of 400k blocks and focusing on the hotspots.

The top bottleneck was related to initializing the default Huffman tables, as I expected. At decomp startup, all the symbols have a frequency of 1, and the min/max code sizes are either equal (if the table's num_syms==pow2) or only differ by 1 (for non-pow2 tables). So this case was easy to optimize throughout the Huffman decoder table construction code.

The next major bottleneck were some calls to malloc()/free() (up to a total of 34K worth, excluding the dictionary when using unbuffered decompression). I fixed this by adding a malloc_context parameter to any object or container that allocated/freed memory (which was a big pain), then allowing the user to optionally specify a fixed-size memory arena when they create the malloc context. The allocator functions in lzham_mem.cpp then try to allocate from this arena, which just treats it as a simple stack. Only the decompressor uses an arena, because its allocation patterns are very simple.

I won't be pushing these changes up until a lot more testing. I should probably make a branch.

Saturday, February 7, 2015

Upcoming LZHAM decompressor optimizations

All of these changes are on github:

Over the previous week I've increased the decompressor's average speed by roughly 1.5%-2%, by enabling Huffman decoding acceleration tables on the distance LSB symbol table.

I'm currently testing several changes which decrease the time it takes for the decompressor to initialize, and increases avg. throughput on tiny (<10KB) streams:

  • The initial set of Huffman tables are very simple because the symbol frequencies are all 1's. The decompressor now quickly detects this case and bypasses the more general canonical Huffman codesize computation step.
  • The decompressor bypasses the construction of Huffman decoding acceleration tables near the beginning of the stream. It computes the approx. cost of constructing the accel. tables and decoding the next X symbols using those tables, verses just decoding the next X symbols without accel. tables, and chooses whatever is predicted to be cheapest. The Huff decode acceleration tables can be rather big (up to 2048 entries), so this is only a win at the beginning of streams (when the tables are rapidly updated).

x86 timing results on a 64 byte test file (file to file decompression, 64MB dictionary):

Before:
Total time including init, I/O, decompression: .146ms
Decompression only time: .040ms

After:
Total time including init, I/O, decompression: .120ms
Decompression only time: .020ms

x86 timing results on a 3087 byte test file:

Before:
Total: .269ms
Decomp only: .170ms

After:
Total: .230ms
Decomp only: .132ms

These optimizations should significantly reduce the time it takes to reinit() and restart decompression (hopefully around 30-50%). I'm going to profile the decompressor over the time it takes to startup and decode the first few KB next.