Wednesday, May 13, 2015

Dungeon Boss's current memory footprint on iOS

A snapshot of our current memory footprint on iOS, using Unity's built-in memory profiling tool (thanks to Sean Cooper):

Mono       59
Unity      53
GfxDriver  34.4
Textures   29.4
Animations 23.8
Meshes     16.8
FMOD       13
Profiler   12.8
Audio      11.1


Mono heap's allocator is a greedy SOB, and in practice only around 40-50% of its allocated memory contains persistent C# objects. We are going to try tweaking or modifying it soon to be less greedy, because we need the memory.

Also, the Unity heap is relatively huge now so we're going to poke around in there and see what's going in.

Sunday, May 10, 2015

Industrial strength Mono memory analysis tools for large scale Unity games

We've been investing a bunch of time into creating a set of Mono (C#) memory leak and performance analysis tools, which we've been using to help us ship our first title (Dungeon Boss). Here's a high-level description of how the tools work and what we can currently do with them:

First, we have a custom instrumented version of mono.dll that captures and records a full transaction log of all mono and lower level libgc (Boehm collector) heap activity. It records to the log almost all internal malloc's/free's, mono allocs, and which mono allocs are collected during each GC. A full backtrace is stored with each log record.

We also record full RAM dumps at each GC, as well as the location of all static roots, thread stacks, etc. (The full RAM dumps may seem excessive, but they are extremely compressible with LZ4 and are key to understanding the relationships between allocations.)

We've also instrumented our game to record custom events to the log file: At menu, level start/end, encounter start, start of Update(), etc. 

A typical workflow is to run the game in a Windows standalone build using our instrumented version of mono.dll, which generates a large binary transaction log. We then post process this log using a C# tool named "HeapBoss", which spews out a huge .JSON file and a number of binary heap RAM snapshots. We then explore and continue processing all this data using an interactive C++ command line tool named "HeapMan".

Here's a list of things we can currently do once we have a heap transaction log and GC RAM snapshots:

- Log exploration and statistical analysis:
Dump the log command index of all GC's, the ID's of all custom events, dump all allocs or GC frees of a specific type, etc.

- Blame list construction: We can replay the transaction log up to a specific GC, then recreate the state of the mono heap at that particular GC. We can then construct a "blame list" of those C# functions which are responsible for each allocation. We use a manually created leaf function name exclusion list (consisting of about 50 regex patterns) to exclude the deepest functions from each backtrace which are too low level (or internal) to be interesting or useful for blame list construction. 

This method is useful for getting a picture of the top consumers of Mono heap memory at a specific spot in the game. We output this list to a .CSV file.

- Growth over time ("leak") analysis: We replay the transaction log up to spot A, create a heap state object, then continue playing up to spot B and create another heap state object. We then construct a blame list for each state object, diff them, then record the results to a .CSV file.

This allows us to see which functions have grown or shrunk their allocations over time. We've found many leaks this way. (Of course, they aren't leaks in the C/C++ sense. In C#, it's easy to construct systems which have degenerate memory behavior over time.)

- Various queries: Find all heap objects of type X. Find all objects on the heap which potentially have a pointer to a specific object (or address). Examine an object's memory at a specific GC and determine which objects it may potentially point to. Find the allocation that includes address X, if any. Find the vtable at address X, etc.

Note our tools don't know where the pointers are really located in a type, so when we examine the raw memory of an object instance it's possible to mistake some random bits for a valid object pointer. We do our best to exclude pointers which aren't aligned, or don't point to the beginning of another allocated object, etc. In practice this approach is surprisingly reliable.

- Root analysis to help find unintended leaks: Given an object, recursively find all objects which reference it, then find all the objects which refer to those objects, etc. until you discover all the roots which directly or indirectly refer to the specific objects. Output the results as a .DOT file and import into gephi for visualization and deeper analysis. (graphviz is another option, but it doesn't scale to large graphs as well as gephi.)

- Heap transaction videos: Create a PNG image visualizing the active heap allocations and synchronize this video with the game. Helps to better understand how the title is utilizing/abusing heap memory at different spots during gameplay.

What we'll be doing next:

- "Churn" analysis: Create a histogram of the blame function for each GC free, sort by the # of frees or # of bytes freed to identify those functions which are creating the most garbage over time.

- Automatically identify the parts of the heap graph that always grow over time. Right now, we do this manually by first doing a growth analysis, then finding which types are responsible for the growth, then finding the instance(s) which grow over time.


Friday, May 8, 2015

Garbage collected systems must have good tools

Hey game engine developers: Please don't release garbage collected systems without solid tools.

We need the ability to take full heap snapshots, visualize allocation graphs, do shortest path analysis from roots to individual objects, etc. If you don't provide tools like this then it can be extremely painful to ship reliable large scale software that uses garbage collection.



Wednesday, April 29, 2015

LZHAM v1.0 is back on the Pareto frontier for decompression speed

LZHAM v1.0 is doing pretty good on this benchmark:

http://mattmahoney.net/dc/text.html
lzham 1.0 -d29 -x 25,002,070  202,237,199  191,600 s  202,428,799  1096  6.6 7800 LZ77 70

Thanks to Michael Crogan for giving me the heads up.

Wednesday, March 11, 2015

Graphing Heap Memory Allocations

We've instrumented the version of Mono that Unity v4.6 uses so it can create a full heap transaction log, as well as full heap memory snapshots after all GC's. With this data we can build complete graphs of all objects on the Mono heap, either going "down" starting from the static roots, or "up" from a single object. Note this heap visualization method is relevant to C/C++ too, because at its core this version of Mono actually uses the Boehm Collector for C/C++.

To build this graph, we recorded a heap transaction log and post-GC memory snapshots of our product running a single dungeon and exiting to the game's menu. I then played back this transaction log with our analysis tool ("HeapBoss") and stopped playing back the log right after the dungeon was exited. I had the tool start building the graph at a single ScriptableUXObject object on the heap (instances of this type are leaking after each dungeon). The tool then found all references to this object recursively, up to any roots and exported the graph to a .dot file. Gephi is used to visualize the graph, using the Yifan Hu algorithm to optimize the layout.

Using this visualization method we quickly determined why these objects are leaking, which was easy once we could visualize the chain of references in various containers back up to the UXManager.







Saturday, February 21, 2015

A Telemetry-style visualization of LZHAM v1.1's multithreaded compressor

I'm a big fan of Rad's Telemetry profiler, which I used at Valve for various Linux profiling tasks, but it's unfortunately out of reach to independent developers. So I'm just making my own mini version of Telemetry, using imgui for the UI. For most uses Telemetry itself is actually overkill anyway. I may open source this if I can find the time to polish it more.

I've optimized LZHAM v1.1's compressor for more efficient multithreading vs. v1.0, but it's still doesn't scale quite as much as I was hoping. Average utilization on Core i7 Gulftown (6 cores) is only around 50-75%. Previously, the main thread would wait for all parsing jobs to complete before coding, v1.1 can now overlap parsing and coding. I also optimized the match finder so it initializes more quickly at the beginning of each block, and the finder jobs are a little faster.

Here are the results on LZHAM v1.1 (currently on github here), limited to 6 total threads, after a few hours of messing around learning imgui. Not all the time in compress_block_internal() has been marked up for profiling, but it's a start:


Thread 0 is the main thread, while threads 1-5 are controlled by the task pool. The flow of time is from left to right, and this view visualizes approx. 616.4ms.

At the beginning of a block, the compressor allocated 5 total threads for match finding and 2 total threads for parsing.

The major operations:

- Thread 0 (the caller's thread) controls everything. compress_block_internal() is where each 512KB block is compressed. find_all_matches() prepares the data structures used by the parallel match finder, kicks off a bunch of find_matches_mt() tasks to find matches for the entire block, then it finds all the nearest len2 matches for the entire block. After this is done it begins parsing (which can be done in parallel) and coding (which must be done serially).

The main thread proceeds to process the 512KB input block in groups of (typically) 3KB "parse chunks". Several 3KB chunks can be parsed in parallel within a single group, but all work within a group must finish before proceeding to the next group.

- Coding is always done by thread 0.

Thread 0 waits for each chunk in a group to be parsed before it codes the results. Coding must proceed sequentially, which is why it's on thread 0. The first chunk in a group is always parsed on thread 0, while the other chunks in the group may be parsed using jobs on the thread pool. The parse group size is dynamically increased as the match finders finish up in order to keep the thread pool's utilization high.

- Threads 1-5 are controlled by the thread pool. In LZHAM v1.1, there are only two task types: match finding, and parsing. The parsers consume the match finder's output, and the main thread consumes the parser's output. Coding is downstream of everything.

The match finders may sometimes spin and then sleep if they get too far ahead of the finders, which happens more than I would like (match finding is usually the bottleneck). This can cause the main thread to stall as it tries to code each chunk in a group.

Here's a zoom in of the above graph, showing the parsers having to wait in various places:


Some of the issues I can see with LZHAM v1.1's multithreading:
- find_all_matches() is single threaded, and the world stops until this function finishes, limiting parallelization.
- find_len2_matches() should be a task.
- It's not visualized here, but the adler32() call (done once per block) can also be a task. Currently, it's done once on thread 0 after the finder tasks are kicked off.
- The finder tasks should take roughly the same amount of time to execute, but it's clear that the finder job on thread 4 took much longer than the others.

Here's an example of a very unbalanced situation on thread 3. These long finder tasks need to be split up into much smaller tasks.


- There's some serial work in stop_encoding(), which takes around 5ms. This function interleaves the arithmetic and Huffman codes into a single output bitstream.

v1.1 is basically ready, I just need to push it to the main github repo. I'm going to refine the profiler and then use it to tune LZHAM v1.1's multithreading more.

Sunday, February 8, 2015

More LZHAM small block optimizations

Improved tiny block performance (i.e. calling lzham_decompress_memory() on a 88 byte string) by 3.3x by profiling the decompression of 400k blocks and focusing on the hotspots.

The top bottleneck was related to initializing the default Huffman tables, as I expected. At decomp startup, all the symbols have a frequency of 1, and the min/max code sizes are either equal (if the table's num_syms==pow2) or only differ by 1 (for non-pow2 tables). So this case was easy to optimize throughout the Huffman decoder table construction code.

The next major bottleneck were some calls to malloc()/free() (up to a total of 34K worth, excluding the dictionary when using unbuffered decompression). I fixed this by adding a malloc_context parameter to any object or container that allocated/freed memory (which was a big pain), then allowing the user to optionally specify a fixed-size memory arena when they create the malloc context. The allocator functions in lzham_mem.cpp then try to allocate from this arena, which just treats it as a simple stack. Only the decompressor uses an arena, because its allocation patterns are very simple.

I won't be pushing these changes up until a lot more testing. I should probably make a branch.