I've just migrated most of my projects from the sinking ship that is Google Code, but the wikis and readme.md files aren't setup yet:
https://github.com/richgel999/rg-etc1
https://github.com/richgel999/fxdis-d3d1x
https://github.com/richgel999/crunch
https://github.com/richgel999/picojpeg
https://github.com/richgel999/imageresampler
https://github.com/richgel999/miniz
https://github.com/richgel999/jpeg-compressor
The wiki data was auto-migrated into the "wiki" branch.
I haven't migrated LZHAM alpha (used by Planetside 2) yet. Still deciding if I should just archive it somewhere instead because it's a dead branch (LZHAM now lives here, and the much faster compressing development branch is here).
Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Wednesday, May 20, 2015
Friday, May 15, 2015
LZHAM decompressor optimization ideas
John Brooks (CEO of Blue Shift Inc.) and I were discussing LZHAM's current decompression performance vs. Google's Brotli codec and we came up with these ideas:
- Basic block optimization:
The current decompressor's inner loop is a general purpose coroutine based design, so a single implementation can handle both streaming and non-streaming scenarios. This is hurting perf. because the coroutine structure consists of a huge switch statement, which causes the compiler to dump locals to registers (and read them back in) a lot.
I'm going to add an alternative non-coroutine version of the inner loop that won't support streaming to optimize the very common memory to memory scenario. (LZHAM already has a few non-streaming optimizations in there, but it still uses a huge switch statement that breaks up the inner loop into tons of basic blocks.)
- LZHAM doesn't have any SIMD optimizations in its Huffman routines. I've been hesitant to use SIMD code anywhere in LZHAM because it complicates testing, but some of the Huffman routines should be easy to optimize with SIMD code.
- Finally, LZHAM's small block performance suffers vs. LZMA, Brotli, or zlib because it must compute Huffman tables on the fly at a fairly high frequency near the beginning of streams. There's a simple solution to this, which is to use precomputed Huffman table codelengths at the start of streams, then switch to dynamically updated Huffman tables once it makes sense to do so.
I already have the code to do this stuff (from the original LZHAM prototype), but it would require breaking v1.0 format compatibility. (And I'm not going to break format compatibility - if/when I do the new thing will have a different name.)
- Basic block optimization:
The current decompressor's inner loop is a general purpose coroutine based design, so a single implementation can handle both streaming and non-streaming scenarios. This is hurting perf. because the coroutine structure consists of a huge switch statement, which causes the compiler to dump locals to registers (and read them back in) a lot.
I'm going to add an alternative non-coroutine version of the inner loop that won't support streaming to optimize the very common memory to memory scenario. (LZHAM already has a few non-streaming optimizations in there, but it still uses a huge switch statement that breaks up the inner loop into tons of basic blocks.)
- LZHAM doesn't have any SIMD optimizations in its Huffman routines. I've been hesitant to use SIMD code anywhere in LZHAM because it complicates testing, but some of the Huffman routines should be easy to optimize with SIMD code.
- Finally, LZHAM's small block performance suffers vs. LZMA, Brotli, or zlib because it must compute Huffman tables on the fly at a fairly high frequency near the beginning of streams. There's a simple solution to this, which is to use precomputed Huffman table codelengths at the start of streams, then switch to dynamically updated Huffman tables once it makes sense to do so.
I already have the code to do this stuff (from the original LZHAM prototype), but it would require breaking v1.0 format compatibility. (And I'm not going to break format compatibility - if/when I do the new thing will have a different name.)
Wednesday, May 13, 2015
Dungeon Boss's current memory footprint on iOS
A snapshot of our current memory footprint on iOS, using Unity's built-in memory profiling tool (thanks to Sean Cooper):
Mono 59
Unity 53
GfxDriver 34.4
Textures 29.4
Animations 23.8
Meshes 16.8
FMOD 13
Profiler 12.8
Audio 11.1
Mono heap's allocator is a greedy SOB, and in practice only around 40-50% of its allocated memory contains persistent C# objects. We are going to try tweaking or modifying it soon to be less greedy, because we need the memory.
Also, the Unity heap is relatively huge now so we're going to poke around in there and see what's going in.
Sunday, May 10, 2015
Industrial strength Mono memory analysis tools for large scale Unity games
We've been investing a bunch of time into creating a set of Mono (C#) memory leak and performance analysis tools, which we've been using to help us ship our first title (Dungeon Boss). Here's a high-level description of how the tools work and what we can currently do with them:
First, we have a custom instrumented version of mono.dll that captures and records a full transaction log of all mono and lower level libgc (Boehm collector) heap activity. It records to the log almost all internal malloc's/free's, mono allocs, and which mono allocs are collected during each GC. A full backtrace is stored with each log record.
We also record full RAM dumps at each GC, as well as the location of all static roots, thread stacks, etc. (The full RAM dumps may seem excessive, but they are extremely compressible with LZ4 and are key to understanding the relationships between allocations.)
We've also instrumented our game to record custom events to the log file: At menu, level start/end, encounter start, start of Update(), etc.
A typical workflow is to run the game in a Windows standalone build using our instrumented version of mono.dll, which generates a large binary transaction log. We then post process this log using a C# tool named "HeapBoss", which spews out a huge .JSON file and a number of binary heap RAM snapshots. We then explore and continue processing all this data using an interactive C++ command line tool named "HeapMan".
Here's a list of things we can currently do once we have a heap transaction log and GC RAM snapshots:
- Log exploration and statistical analysis:
Dump the log command index of all GC's, the ID's of all custom events, dump all allocs or GC frees of a specific type, etc.
- Blame list construction: We can replay the transaction log up to a specific GC, then recreate the state of the mono heap at that particular GC. We can then construct a "blame list" of those C# functions which are responsible for each allocation. We use a manually created leaf function name exclusion list (consisting of about 50 regex patterns) to exclude the deepest functions from each backtrace which are too low level (or internal) to be interesting or useful for blame list construction.
This method is useful for getting a picture of the top consumers of Mono heap memory at a specific spot in the game. We output this list to a .CSV file.
- Growth over time ("leak") analysis: We replay the transaction log up to spot A, create a heap state object, then continue playing up to spot B and create another heap state object. We then construct a blame list for each state object, diff them, then record the results to a .CSV file.
This allows us to see which functions have grown or shrunk their allocations over time. We've found many leaks this way. (Of course, they aren't leaks in the C/C++ sense. In C#, it's easy to construct systems which have degenerate memory behavior over time.)
- Various queries: Find all heap objects of type X. Find all objects on the heap which potentially have a pointer to a specific object (or address). Examine an object's memory at a specific GC and determine which objects it may potentially point to. Find the allocation that includes address X, if any. Find the vtable at address X, etc.
Note our tools don't know where the pointers are really located in a type, so when we examine the raw memory of an object instance it's possible to mistake some random bits for a valid object pointer. We do our best to exclude pointers which aren't aligned, or don't point to the beginning of another allocated object, etc. In practice this approach is surprisingly reliable.
- Root analysis to help find unintended leaks: Given an object, recursively find all objects which reference it, then find all the objects which refer to those objects, etc. until you discover all the roots which directly or indirectly refer to the specific objects. Output the results as a .DOT file and import into gephi for visualization and deeper analysis. (graphviz is another option, but it doesn't scale to large graphs as well as gephi.)
- Heap transaction videos: Create a PNG image visualizing the active heap allocations and synchronize this video with the game. Helps to better understand how the title is utilizing/abusing heap memory at different spots during gameplay.
What we'll be doing next:
- "Churn" analysis: Create a histogram of the blame function for each GC free, sort by the # of frees or # of bytes freed to identify those functions which are creating the most garbage over time.
- Automatically identify the parts of the heap graph that always grow over time. Right now, we do this manually by first doing a growth analysis, then finding which types are responsible for the growth, then finding the instance(s) which grow over time.
Friday, May 8, 2015
Garbage collected systems must have good tools
Hey game engine developers: Please don't release garbage collected systems without solid tools.
We need the ability to take full heap snapshots, visualize allocation graphs, do shortest path analysis from roots to individual objects, etc. If you don't provide tools like this then it can be extremely painful to ship reliable large scale software that uses garbage collection.
We need the ability to take full heap snapshots, visualize allocation graphs, do shortest path analysis from roots to individual objects, etc. If you don't provide tools like this then it can be extremely painful to ship reliable large scale software that uses garbage collection.
Wednesday, April 29, 2015
LZHAM v1.0 is back on the Pareto frontier for decompression speed
LZHAM v1.0 is doing pretty good on this benchmark:
http://mattmahoney.net/dc/text.html
http://mattmahoney.net/dc/text.html
lzham 1.0 -d29 -x 25,002,070 202,237,199 191,600 s 202,428,799 1096 6.6 7800 LZ77 70Thanks to Michael Crogan for giving me the heads up.
Wednesday, March 11, 2015
Graphing Heap Memory Allocations
We've instrumented the version of Mono that Unity v4.6 uses so it can create a full heap transaction log, as well as full heap memory snapshots after all GC's. With this data we can build complete graphs of all objects on the Mono heap, either going "down" starting from the static roots, or "up" from a single object. Note this heap visualization method is relevant to C/C++ too, because at its core this version of Mono actually uses the Boehm Collector for C/C++.
To build this graph, we recorded a heap transaction log and post-GC memory snapshots of our product running a single dungeon and exiting to the game's menu. I then played back this transaction log with our analysis tool ("HeapBoss") and stopped playing back the log right after the dungeon was exited. I had the tool start building the graph at a single ScriptableUXObject object on the heap (instances of this type are leaking after each dungeon). The tool then found all references to this object recursively, up to any roots and exported the graph to a .dot file. Gephi is used to visualize the graph, using the Yifan Hu algorithm to optimize the layout.
Using this visualization method we quickly determined why these objects are leaking, which was easy once we could visualize the chain of references in various containers back up to the UXManager.
Subscribe to:
Posts (Atom)