Hey game engine developers: Please don't release garbage collected systems without solid tools.
We need the ability to take full heap snapshots, visualize allocation graphs, do shortest path analysis from roots to individual objects, etc. If you don't provide tools like this then it can be extremely painful to ship reliable large scale software that uses garbage collection.
Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Friday, May 8, 2015
Wednesday, April 29, 2015
LZHAM v1.0 is back on the Pareto frontier for decompression speed
LZHAM v1.0 is doing pretty good on this benchmark:
http://mattmahoney.net/dc/text.html
http://mattmahoney.net/dc/text.html
lzham 1.0 -d29 -x 25,002,070 202,237,199 191,600 s 202,428,799 1096 6.6 7800 LZ77 70Thanks to Michael Crogan for giving me the heads up.
Wednesday, March 11, 2015
Graphing Heap Memory Allocations
We've instrumented the version of Mono that Unity v4.6 uses so it can create a full heap transaction log, as well as full heap memory snapshots after all GC's. With this data we can build complete graphs of all objects on the Mono heap, either going "down" starting from the static roots, or "up" from a single object. Note this heap visualization method is relevant to C/C++ too, because at its core this version of Mono actually uses the Boehm Collector for C/C++.
To build this graph, we recorded a heap transaction log and post-GC memory snapshots of our product running a single dungeon and exiting to the game's menu. I then played back this transaction log with our analysis tool ("HeapBoss") and stopped playing back the log right after the dungeon was exited. I had the tool start building the graph at a single ScriptableUXObject object on the heap (instances of this type are leaking after each dungeon). The tool then found all references to this object recursively, up to any roots and exported the graph to a .dot file. Gephi is used to visualize the graph, using the Yifan Hu algorithm to optimize the layout.
Using this visualization method we quickly determined why these objects are leaking, which was easy once we could visualize the chain of references in various containers back up to the UXManager.
Saturday, February 21, 2015
A Telemetry-style visualization of LZHAM v1.1's multithreaded compressor
I'm a big fan of Rad's Telemetry profiler, which I used at Valve for various Linux profiling tasks, but it's unfortunately out of reach to independent developers. So I'm just making my own mini version of Telemetry, using imgui for the UI. For most uses Telemetry itself is actually overkill anyway. I may open source this if I can find the time to polish it more.
I've optimized LZHAM v1.1's compressor for more efficient multithreading vs. v1.0, but it's still doesn't scale quite as much as I was hoping. Average utilization on Core i7 Gulftown (6 cores) is only around 50-75%. Previously, the main thread would wait for all parsing jobs to complete before coding, v1.1 can now overlap parsing and coding. I also optimized the match finder so it initializes more quickly at the beginning of each block, and the finder jobs are a little faster.
Here are the results on LZHAM v1.1 (currently on github here), limited to 6 total threads, after a few hours of messing around learning imgui. Not all the time in compress_block_internal() has been marked up for profiling, but it's a start:
Thread 0 is the main thread, while threads 1-5 are controlled by the task pool. The flow of time is from left to right, and this view visualizes approx. 616.4ms.
At the beginning of a block, the compressor allocated 5 total threads for match finding and 2 total threads for parsing.
The major operations:
- Thread 0 (the caller's thread) controls everything. compress_block_internal() is where each 512KB block is compressed. find_all_matches() prepares the data structures used by the parallel match finder, kicks off a bunch of find_matches_mt() tasks to find matches for the entire block, then it finds all the nearest len2 matches for the entire block. After this is done it begins parsing (which can be done in parallel) and coding (which must be done serially).
The main thread proceeds to process the 512KB input block in groups of (typically) 3KB "parse chunks". Several 3KB chunks can be parsed in parallel within a single group, but all work within a group must finish before proceeding to the next group.
- Coding is always done by thread 0.
Thread 0 waits for each chunk in a group to be parsed before it codes the results. Coding must proceed sequentially, which is why it's on thread 0. The first chunk in a group is always parsed on thread 0, while the other chunks in the group may be parsed using jobs on the thread pool. The parse group size is dynamically increased as the match finders finish up in order to keep the thread pool's utilization high.
- Threads 1-5 are controlled by the thread pool. In LZHAM v1.1, there are only two task types: match finding, and parsing. The parsers consume the match finder's output, and the main thread consumes the parser's output. Coding is downstream of everything.
The match finders may sometimes spin and then sleep if they get too far ahead of the finders, which happens more than I would like (match finding is usually the bottleneck). This can cause the main thread to stall as it tries to code each chunk in a group.
Here's a zoom in of the above graph, showing the parsers having to wait in various places:
Some of the issues I can see with LZHAM v1.1's multithreading:
- find_all_matches() is single threaded, and the world stops until this function finishes, limiting parallelization.
- find_len2_matches() should be a task.
- It's not visualized here, but the adler32() call (done once per block) can also be a task. Currently, it's done once on thread 0 after the finder tasks are kicked off.
- The finder tasks should take roughly the same amount of time to execute, but it's clear that the finder job on thread 4 took much longer than the others.
Here's an example of a very unbalanced situation on thread 3. These long finder tasks need to be split up into much smaller tasks.
- There's some serial work in stop_encoding(), which takes around 5ms. This function interleaves the arithmetic and Huffman codes into a single output bitstream.
v1.1 is basically ready, I just need to push it to the main github repo. I'm going to refine the profiler and then use it to tune LZHAM v1.1's multithreading more.
I've optimized LZHAM v1.1's compressor for more efficient multithreading vs. v1.0, but it's still doesn't scale quite as much as I was hoping. Average utilization on Core i7 Gulftown (6 cores) is only around 50-75%. Previously, the main thread would wait for all parsing jobs to complete before coding, v1.1 can now overlap parsing and coding. I also optimized the match finder so it initializes more quickly at the beginning of each block, and the finder jobs are a little faster.
Here are the results on LZHAM v1.1 (currently on github here), limited to 6 total threads, after a few hours of messing around learning imgui. Not all the time in compress_block_internal() has been marked up for profiling, but it's a start:
At the beginning of a block, the compressor allocated 5 total threads for match finding and 2 total threads for parsing.
The major operations:
- Thread 0 (the caller's thread) controls everything. compress_block_internal() is where each 512KB block is compressed. find_all_matches() prepares the data structures used by the parallel match finder, kicks off a bunch of find_matches_mt() tasks to find matches for the entire block, then it finds all the nearest len2 matches for the entire block. After this is done it begins parsing (which can be done in parallel) and coding (which must be done serially).
The main thread proceeds to process the 512KB input block in groups of (typically) 3KB "parse chunks". Several 3KB chunks can be parsed in parallel within a single group, but all work within a group must finish before proceeding to the next group.
- Coding is always done by thread 0.
Thread 0 waits for each chunk in a group to be parsed before it codes the results. Coding must proceed sequentially, which is why it's on thread 0. The first chunk in a group is always parsed on thread 0, while the other chunks in the group may be parsed using jobs on the thread pool. The parse group size is dynamically increased as the match finders finish up in order to keep the thread pool's utilization high.
- Threads 1-5 are controlled by the thread pool. In LZHAM v1.1, there are only two task types: match finding, and parsing. The parsers consume the match finder's output, and the main thread consumes the parser's output. Coding is downstream of everything.
The match finders may sometimes spin and then sleep if they get too far ahead of the finders, which happens more than I would like (match finding is usually the bottleneck). This can cause the main thread to stall as it tries to code each chunk in a group.
Here's a zoom in of the above graph, showing the parsers having to wait in various places:
- find_all_matches() is single threaded, and the world stops until this function finishes, limiting parallelization.
- find_len2_matches() should be a task.
- It's not visualized here, but the adler32() call (done once per block) can also be a task. Currently, it's done once on thread 0 after the finder tasks are kicked off.
- The finder tasks should take roughly the same amount of time to execute, but it's clear that the finder job on thread 4 took much longer than the others.
Here's an example of a very unbalanced situation on thread 3. These long finder tasks need to be split up into much smaller tasks.
- There's some serial work in stop_encoding(), which takes around 5ms. This function interleaves the arithmetic and Huffman codes into a single output bitstream.
v1.1 is basically ready, I just need to push it to the main github repo. I'm going to refine the profiler and then use it to tune LZHAM v1.1's multithreading more.
Sunday, February 8, 2015
More LZHAM small block optimizations
Improved tiny block performance (i.e. calling lzham_decompress_memory() on a 88 byte string) by 3.3x by profiling the decompression of 400k blocks and focusing on the hotspots.
The top bottleneck was related to initializing the default Huffman tables, as I expected. At decomp startup, all the symbols have a frequency of 1, and the min/max code sizes are either equal (if the table's num_syms==pow2) or only differ by 1 (for non-pow2 tables). So this case was easy to optimize throughout the Huffman decoder table construction code.
The next major bottleneck were some calls to malloc()/free() (up to a total of 34K worth, excluding the dictionary when using unbuffered decompression). I fixed this by adding a malloc_context parameter to any object or container that allocated/freed memory (which was a big pain), then allowing the user to optionally specify a fixed-size memory arena when they create the malloc context. The allocator functions in lzham_mem.cpp then try to allocate from this arena, which just treats it as a simple stack. Only the decompressor uses an arena, because its allocation patterns are very simple.
I won't be pushing these changes up until a lot more testing. I should probably make a branch.
The next major bottleneck were some calls to malloc()/free() (up to a total of 34K worth, excluding the dictionary when using unbuffered decompression). I fixed this by adding a malloc_context parameter to any object or container that allocated/freed memory (which was a big pain), then allowing the user to optionally specify a fixed-size memory arena when they create the malloc context. The allocator functions in lzham_mem.cpp then try to allocate from this arena, which just treats it as a simple stack. Only the decompressor uses an arena, because its allocation patterns are very simple.
I won't be pushing these changes up until a lot more testing. I should probably make a branch.
Saturday, February 7, 2015
Upcoming LZHAM decompressor optimizations
All of these changes are on github:
Over the previous week I've increased the decompressor's average speed by roughly 1.5%-2%, by enabling Huffman decoding acceleration tables on the distance LSB symbol table.
I'm currently testing several changes which decrease the time it takes for the decompressor to initialize, and increases avg. throughput on tiny (<10KB) streams:
x86 timing results on a 64 byte test file (file to file decompression, 64MB dictionary):
Over the previous week I've increased the decompressor's average speed by roughly 1.5%-2%, by enabling Huffman decoding acceleration tables on the distance LSB symbol table.
I'm currently testing several changes which decrease the time it takes for the decompressor to initialize, and increases avg. throughput on tiny (<10KB) streams:
- The initial set of Huffman tables are very simple because the symbol frequencies are all 1's. The decompressor now quickly detects this case and bypasses the more general canonical Huffman codesize computation step.
- The decompressor bypasses the construction of Huffman decoding acceleration tables near the beginning of the stream. It computes the approx. cost of constructing the accel. tables and decoding the next X symbols using those tables, verses just decoding the next X symbols without accel. tables, and chooses whatever is predicted to be cheapest. The Huff decode acceleration tables can be rather big (up to 2048 entries), so this is only a win at the beginning of streams (when the tables are rapidly updated).
x86 timing results on a 64 byte test file (file to file decompression, 64MB dictionary):
Before:
Total time including init, I/O, decompression: .146ms
Decompression only time: .040ms
After:
Total time including init, I/O, decompression: .120ms
Decompression only time: .020ms
x86 timing results on a 3087 byte test file:
Before:
Total: .269ms
Decomp only: .170ms
After:
Total: .230ms
Decomp only: .132ms
x86 timing results on a 3087 byte test file:
Before:
Total: .269ms
Decomp only: .170ms
After:
Total: .230ms
Decomp only: .132ms
These optimizations should significantly reduce the time it takes to reinit() and restart decompression (hopefully around 30-50%). I'm going to profile the decompressor over the time it takes to startup and decode the first few KB next.
Wednesday, February 4, 2015
7zip 9.38 custom codec plugin for LZHAM 1.0
Igor Pavlov and the devs at encode.ru gave me some pointers on how to create a custom 7z codec DLL. Here's the first 32-bit build of the 7zip custom codec plugin DLL, compatible with 7zip 9.38 beta x86 (note I've updated this with 1 decompressor bugfix since the first release a couple days ago):
http://www.tenacioussoftware.com/7z_lzhamcodec_test3.7z
http://www.tenacioussoftware.com/7z_...c_test3_src.7z
To use this: install 9.38 beta (I got it from http://sourceforge.net/projects/seve...es/7-Zip/9.38/), manually create a subdirectory named "Codecs" in your install directory (i.e. "C:\Program Files (x86)\7-Zip\Codecs"), and extract LzhamCodec.DLL into this directory. If you run "7z i" you should see this:
Codecs:
Lib ID Name
0 C 303011B BCJ2
0 C 3030103 BCJ
0 C 3030205 PPC
...
0 C 30401 PPMD
0 40301 Rar1
0 40302 Rar2
0 40303 Rar3
0 C 6F10701 7zAES
0 C 6F00181 AES256CBC1 C 90101 LZHAM
The ID used in this DLL (90101) may change - I'm working this out with Igor.
This version's command line parameters are similar to 7z-lzham 9.20:
-mx=[0,9] - Controls LZHAM compression level, dictionary size, and extreme parsing. -mx=9 enables the best of 4 ("extreme" parser), which can be very slow.
Details:
-mx0=lzham_level:0, dict_size_log2:18
1=0,20 (fastest)
2=1,21
3=2,21
4=2,22
5=3,22
6=3,23
7=4,25
8=4,26 (slowest - the default)
9=4,26 best of 4 parsing (even slower)
-mt=on/off or integer - Controls threading (default=on)
-ma=[0,1] - Deterministic parsing. 1=on (default=off)
-md=64K, etc. - Override dictionary size. If you don't set this the dictionary size is inferred via the -mx=# parameter setting.
Example usage:
7z -m0=LZHAM -mx=8 a temp *.dll
To use LZHAM from the GUI, you can put custom parameters on the command line (without -m prefixes), i.e.: "0=bcj 1=lzham x=8". (I think you can only specify algorithm and level options here.) Note if you set the Compression Level pulldown to "Ultra" LZHAM will use best of 4 (extreme) parsing and be pretty slow, so the highest you probably want to use is "Maximum":
http://www.tenacioussoftware.com/7z_lzhamcodec_test3.7z
http://www.tenacioussoftware.com/7z_...c_test3_src.7z
To use this: install 9.38 beta (I got it from http://sourceforge.net/projects/seve...es/7-Zip/9.38/), manually create a subdirectory named "Codecs" in your install directory (i.e. "C:\Program Files (x86)\7-Zip\Codecs"), and extract LzhamCodec.DLL into this directory. If you run "7z i" you should see this:
Codecs:
Lib ID Name
0 C 303011B BCJ2
0 C 3030103 BCJ
0 C 3030205 PPC
...
0 C 30401 PPMD
0 40301 Rar1
0 40302 Rar2
0 40303 Rar3
0 C 6F10701 7zAES
0 C 6F00181 AES256CBC1 C 90101 LZHAM
The ID used in this DLL (90101) may change - I'm working this out with Igor.
This version's command line parameters are similar to 7z-lzham 9.20:
-mx=[0,9] - Controls LZHAM compression level, dictionary size, and extreme parsing. -mx=9 enables the best of 4 ("extreme" parser), which can be very slow.
Details:
-mx0=lzham_level:0, dict_size_log2:18
1=0,20 (fastest)
2=1,21
3=2,21
4=2,22
5=3,22
6=3,23
7=4,25
8=4,26 (slowest - the default)
9=4,26 best of 4 parsing (even slower)
-mt=on/off or integer - Controls threading (default=on)
-ma=[0,1] - Deterministic parsing. 1=on (default=off)
-md=64K, etc. - Override dictionary size. If you don't set this the dictionary size is inferred via the -mx=# parameter setting.
Example usage:
7z -m0=LZHAM -mx=8 a temp *.dll
To use LZHAM from the GUI, you can put custom parameters on the command line (without -m prefixes), i.e.: "0=bcj 1=lzham x=8". (I think you can only specify algorithm and level options here.) Note if you set the Compression Level pulldown to "Ultra" LZHAM will use best of 4 (extreme) parsing and be pretty slow, so the highest you probably want to use is "Maximum":
Subscribe to:
Posts (Atom)











