I'm currently seeing overall decompression speedups around 1.8x - 3.8x faster vs. previous LZHAM releases on Unity asset bundle files. On relatively incompressible files (like MP3's), it's around 2-2.3x faster, and 30-40% faster on enwik9. This is on a Core i7, I'll have statistics on iOS devices early next week.
I'm removing some experimental stuff in LZHAM that adds little to no real value:
- Got rid of all the Polar coding stuff and some misc. leftover params (like the cacheline modeling stuff)
- No more literal prediction, or delta literal predictions, for slightly faster decoding, lower memory usage, and faster initialization of the decompressor.
- Reduced is_match contexts from 768 to 12. The loss in ratio was a fraction of a percent (if any), but the decompressor can be initialized more quickly and the inner loop is slightly simplified because the prev/prev_prev decoded chars don't need to be tracked any more.
Just 2 main Huffman tables for literals/delta literals now, instead of 128 tables (!) like the previous releases. The tiny improvement in ratio (if any on many files) just didn't justify all the extra complexity. The decompressor's performance is now more stable (i.e. not so dependent on the data being compressed) and I don't need to worry about optimizing the initialization of a zillion Huff tables during the decoder's init.
I'm adding several optional, but extremely useful comp/decomp params:
// Controls tradeoff between ratio and decompression throughput. 0=default, or [1,LZHAM_MAX_TABLE_UPDATE_RATE], higher=faster but lower ratio.
lzham_uint32 m_table_update_rate;
m_table_update_rate is a higher level/simpler way of controlling these 2 optional params:
// Advanced settings - set to 0 if you don't care.
// def=64, typical range 12-128, controls the max interval between table updates, higher=longer interval between updates (faster decode/lower ratio)
lzham_uint32 m_table_max_update_interval;
// def=16, 8 or higher, scaled by 8, controls the slowing of the update update freq, higher=more rapid slowing (faster decode/lower ratio), 8=no slowing at all.
lzham_uint32 m_table_update_interval_slow_rate;
These parameters allow the user to tune the scheduling of the Huffman table updates. The out of the box defaults now cause much less frequent table updating than previous releases. The overall ratio change from the slowest (more frequent) to fastest setting is around 1%. The speed difference during decompression from the slowest to fastest setting is around 2-3x.
Next up: going to generate some CSV files to make some nice graphs, then the iOS and (eventually) Android ports.
Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Saturday, January 17, 2015
Friday, January 16, 2015
LZHAM v1.0 progress
Ported to OSX, and exposed several new compression/decompression parameters to allow the user to configure some of the codec's inner workings: literal/delta_literal bitmasks (or number of literal/delta literal bits - less versatile but simpler), the Huff table max interval between updates, and the rate at which the Huff table update interval slows between updates. These settings are absolutely critical to the decompressor's performance, memory, and CPU cache utilization.
The very early results are promising: 25-30% faster decoding (Core i7) and much less memory usage (still determining) by just tuning the settings (less tables/slower updating), with relatively little impact on compression ratio. The ratio reduction is only a fraction of 1% on the few files I've tested. (Disclaimer: I've only just got this working. These results do make sense -- it takes a bunch of CPU to update the Huff decode tables.)
Also, by reducing the # of Huff tables the decompressor shouldn't bog down nearly so much on mostly incompressible data. The user can currently select between 1-64 tables (separately for literals and delta literals, for up to 128 total tables). The codec supports prediction orders between 0-2, with 2 programmable predictor bitmasks for literals/delta_literals. (I'm not really sure exposing separate masks for literals vs. delta literals is useful, but after my experience optimizing LZMA's options with Unity asset data I'm now leaning to just exposing all sorts of stuff and let the caller figure it out.)
I'm also going to expose dictionary position related bitmasks to feed into the various predictions, just like LZMA, because they are valuable on real-life game data.
Annoyingly, when I lower the compression ratio decompression can get dramatically faster. I believe this has to do with a different mix of the decode loop exercised by the lower ratio bitstream, but I'm not really sure yet (and I don't remember if I figured out why 3+ years ago). I'll be writing a guide on how to tune the various settings to speed up LZHAM's decompressor.
On the downside, the user has more knobs to turn to make max use of the codec.
The very early results are promising: 25-30% faster decoding (Core i7) and much less memory usage (still determining) by just tuning the settings (less tables/slower updating), with relatively little impact on compression ratio. The ratio reduction is only a fraction of 1% on the few files I've tested. (Disclaimer: I've only just got this working. These results do make sense -- it takes a bunch of CPU to update the Huff decode tables.)
Also, by reducing the # of Huff tables the decompressor shouldn't bog down nearly so much on mostly incompressible data. The user can currently select between 1-64 tables (separately for literals and delta literals, for up to 128 total tables). The codec supports prediction orders between 0-2, with 2 programmable predictor bitmasks for literals/delta_literals. (I'm not really sure exposing separate masks for literals vs. delta literals is useful, but after my experience optimizing LZMA's options with Unity asset data I'm now leaning to just exposing all sorts of stuff and let the caller figure it out.)
I'm also going to expose dictionary position related bitmasks to feed into the various predictions, just like LZMA, because they are valuable on real-life game data.
Annoyingly, when I lower the compression ratio decompression can get dramatically faster. I believe this has to do with a different mix of the decode loop exercised by the lower ratio bitstream, but I'm not really sure yet (and I don't remember if I figured out why 3+ years ago). I'll be writing a guide on how to tune the various settings to speed up LZHAM's decompressor.
Wednesday, January 14, 2015
More LZHAM notes
It's been a while since I've made any major changes to LZHAM (except for minor cmake related stuff). This was a codec I wrote over a few nights and weekends while I was also working my day job. I eventually had to let active dev on LZHAM go to sleep because I got "sidetracked" shipping Portal 2. The codec's alpha has already been successfully deployed in several products, such as Planetside 2 and Titanfall, which isn't bad for a few nights of R&D and implementation work.
I covered what I was thinking of doing with LZHAM in this blog post. I have more interest in improving it again. For the types of products I'm now working on, what matters a lot is the title's retention rate, from first starting the product to the customer actually getting into real gameplay. Slow downloads or updates, loading screens, etc. equals lost users. Lost users=lower monetization. We actually measure the retention rate of every aspect of this in the field. So things like background downloading, streaming, proper organization of asset data into Unity asset bundles, and of course good data compression matter massively to us.
Anyhow, some ideas for LZHAM decompression startup and throughput improvements which I can do pretty quickly:
- After much testing on our game data, I now realize I underestimated how useful the various LZMA settings are. Right now LZHAM always uses the upper 3 MSB's of the prev. two literals for literal/delta literal contexts. Allow the user to control all of this: which prev. literal(s) (if any), say up to 8 bytes back, and which bits from those literals, separately for each type of prediction (literals/delta literals).
- In my quest to get LZHAM's ratio up to be similar to LZMA I made several tradeoffs which can greatly impact decompression perf, especially on uncompressible data. Right now the codec must always init and manage 64*2 Huffman tables. Allow the user to reduce or even increase the # of tables.
- LZHAM was designed for "solid" compression, where you give the codec dozens to hundreds of MB's containing many assets, and you don't restart/reinit the codec in between assets. It's like a slow to start drag racer. So it can suck on small files.
I'm not honestly 100% sure what to do about this yet that won't kill decompression perf. The way LZHAM updates Huffman tables seems like an albatross here. Amortized over many MB's it typically works fine, but on small files they can't be updated (adapted) quickly enough. Less tables are probably good here.
I could just integrate something like miniz into the codec, and try using it on each internal compressor block and using whatever is better. But that seems horrible.
- The Huffman table update frequency needs to be better tuned. If I can't think of anything smarter, allow the user to control the update schedule.
Note if you are very serious about fast, high ratio compression and decompression, Rad's Oodle product is very good. Given what I know about it, it's the best (fastest, highest compression, and most scalable/portable) production class lossless codec I know of.
I covered what I was thinking of doing with LZHAM in this blog post. I have more interest in improving it again. For the types of products I'm now working on, what matters a lot is the title's retention rate, from first starting the product to the customer actually getting into real gameplay. Slow downloads or updates, loading screens, etc. equals lost users. Lost users=lower monetization. We actually measure the retention rate of every aspect of this in the field. So things like background downloading, streaming, proper organization of asset data into Unity asset bundles, and of course good data compression matter massively to us.
Anyhow, some ideas for LZHAM decompression startup and throughput improvements which I can do pretty quickly:
- After much testing on our game data, I now realize I underestimated how useful the various LZMA settings are. Right now LZHAM always uses the upper 3 MSB's of the prev. two literals for literal/delta literal contexts. Allow the user to control all of this: which prev. literal(s) (if any), say up to 8 bytes back, and which bits from those literals, separately for each type of prediction (literals/delta literals).
- In my quest to get LZHAM's ratio up to be similar to LZMA I made several tradeoffs which can greatly impact decompression perf, especially on uncompressible data. Right now the codec must always init and manage 64*2 Huffman tables. Allow the user to reduce or even increase the # of tables.
- LZHAM was designed for "solid" compression, where you give the codec dozens to hundreds of MB's containing many assets, and you don't restart/reinit the codec in between assets. It's like a slow to start drag racer. So it can suck on small files.
I'm not honestly 100% sure what to do about this yet that won't kill decompression perf. The way LZHAM updates Huffman tables seems like an albatross here. Amortized over many MB's it typically works fine, but on small files they can't be updated (adapted) quickly enough. Less tables are probably good here.
I could just integrate something like miniz into the codec, and try using it on each internal compressor block and using whatever is better. But that seems horrible.
- The Huffman table update frequency needs to be better tuned. If I can't think of anything smarter, allow the user to control the update schedule.
Note if you are very serious about fast, high ratio compression and decompression, Rad's Oodle product is very good. Given what I know about it, it's the best (fastest, highest compression, and most scalable/portable) production class lossless codec I know of.
Friday, January 9, 2015
Improved Unity asset bundle file compression
Download Times Matter
Sean Cooper, Ryan Inselmann and I have been building a custom lossless archiver designed specifically for Unity asset bundle files. The archiver itself uses several well-known techniques in the hardcore archiving/game repacking world. This post is mostly about how we've begun to tune the LZMA settings used by the archiver to be more effective on Unity asset bundle data. We'll cover the actual archiver in a later post.
LZMA has several knobs you can turn to potentially improve compression:
http://stackoverflow.com/questions/3057171/lzma-compression-settings-details
The LZHAM codec (my faster to decode alternative to LZMA) isn't easily tunable in the same way as LZMA yet, which is a flaw. I totally regret this because tuning these options works:
Total asset bundle asset data (iOS): 197,514,230
http://stackoverflow.com/questions/3057171/lzma-compression-settings-details
The LZHAM codec (my faster to decode alternative to LZMA) isn't easily tunable in the same way as LZMA yet, which is a flaw. I totally regret this because tuning these options works:
Total asset bundle asset data (iOS): 197,514,230
Unity's built-in compression: 91,692,327
Our archiver, un-tuned LZMA: 76,049,530Our archiver, tuned LZMA (48 option trials): 75,231,764
Our archiver, tuned LZMA (225 option trials): 74,154,817
LZ codecs with untunable models/settings are much less interesting to me now. (Yet one more thing to work on in LZHAM.)
Here are the optimal settings we've found for each of our Unity asset classes on iOS. So for example, textures are compressed using different LZMA options (3, 3, 3) vs. animation clips (1, 2, 2).
Best LZMA settings after trying all 225 options. (In case of ties the compressor just selects the lowest lc,lp,pb settings - I just placed a triply nested for() loop around calling LZMA and it chooses the first best as the "best".)
Class (id): lc lp pb
GameObject (1): 8 0 0
Light (108): 2 2 2
Animation (111): 8 2 2
MonoScript (115): 2 0 0
LineRenderer (120): 0 0 0
SphereCollider (135): 0 0 0
SkinnedMeshRenderer (137): 8 4 2
AssetBundle (142): 0 2 2
WindZone (182): 0 0 0
ParticleSystem (198): 0 2 3
ParticleSystemRenderer (199): 8 3 3
Camera (20): 0 2 2
Material (21): 3 0 1
SpriteRenderer (212): 0 0 0
MeshRenderer (23): 8 4 2
Texture2D (28): 8 2 3
MeshFilter (33): 8 4 1
Transform (4): 6 2 2
Mesh (43): 0 0 1
MeshCollider (64): 1 4 1
BoxCollider (65): 7 4 2
AnimationClip (74): 0 2 2
AudioSource (82): 0 0 0
AudioClip (83): 8 0 0
Avatar (90): 2 0 2
AnimatorController (91): 2 0 2
? (95): 7 4 1
TrailRenderer (96): 0 0 0
Tuesday, January 6, 2015
Utils/tools to help transition to OSX from Windows
I've developed software under Kubuntu and Windows for years now. I found adapting to Kubuntu from a Windows world to be amazingly easy (ignoring the random Linux driver installation headaches). After a few days mucking around with some keyboard shortcuts and relearning Bash and I was all set.
These days, the center of gravity in the mobile development world is OSX due to iOS's market share, so it's time I bit the bullet and dive into the Apple world.
I'll admit, transitioning to OSX has been mind numbingly painful at times. (Apple, why do you persist on using such a wonky keyboard layout? Where's my ctrl+left??! No alt+tab?? wtf? ARGH.) I mentioned this to Matt Pritchard, a long time OSX/iOS developer, and he passed along a list of OSX utilities and tools that can help make the transition easier for long-time Windows developers.
Monday, January 5, 2015
BonusXP's hybrid office arrangement
After my Valve experience I'm now deeply interested in how companies arrange and maintain the actual space their employees work in. The first thing I do when I enter a studio is look beyond the reception area and poke around to see how much importance and thought went into their actual workspace. This can tell you a lot about a company. Ignore the marketing and just observe.
BonusXP (makers of Monster Crew and Cave Mania, a growing indie game developer near Dallas in Allen, TX) used feedback from everyone at the studio to decide how their new office would be arranged. They decided on a hybrid mix of a low-density open plan surrounded by normal offices, with some key desk additions to cut down on visual distractions and give devs a better sense of control over their environment.
They also have a pretty sweet optional work from home on Friday policy.
BonusXP (makers of Monster Crew and Cave Mania, a growing indie game developer near Dallas in Allen, TX) used feedback from everyone at the studio to decide how their new office would be arranged. They decided on a hybrid mix of a low-density open plan surrounded by normal offices, with some key desk additions to cut down on visual distractions and give devs a better sense of control over their environment.
They also have a pretty sweet optional work from home on Friday policy.
Key things about this space that I see:
- Tidy: no rats nests, shabby furniture, or giant collections of 2000 nerf guns and action figures.
- Opaque glass dividers on each desk.
- Small offices with windows surrounding the open office area.
- Real desks with actual storage space.
- Low density and lots of space around each mini "pod" of desks.
- Blue color theme, which triggers a relaxing response.
Sunday, January 4, 2015
Robot Entertainment knows how to make open offices work better
There are precious few public pics available of Robot Entertainment's (The Orcs Must Die people near Dallas, TX - one of the post-Ensemble Studios companies) offices. This is a shame, because they've got a very smartly laid out space. (Update: Found more pics.)
The important elements:
Another major difference between Robot's offices verses the spaces of companies with dehumanizing, industrial scale open layouts is persistence and presence. At Robot's (and the old Ensemble Studios) offices, employees can bring in books, references, games, etc. and arrange them about their office in actual physical book cases and shelves. Sometimes Kindle doesn't cut it, especially for historical references. You can't practically do this at companies that attach wheels to your little desks and force you to play bumper cars every quarter.
For completeness, here's their lobby and meeting area:
Some impressions I get about this space by just looking at the pics: openness, welcoming, and team oriented.
- Discipline based "pod" organization and half partitions above eye height.
- Disruptive foot traffic and audio/visual noise mostly confined within a single pod.
- Pervasive availability of usable and visible whiteboards
- High ceilings
- No power cords ducktaped to the floor
- Lower density desk layout.
- Small desks with whiteboards encourages small meetings
And at this studio you get an entire Beer Garden in your offices. No high school lunch room cafeteria here!
Another major difference between Robot's offices verses the spaces of companies with dehumanizing, industrial scale open layouts is persistence and presence. At Robot's (and the old Ensemble Studios) offices, employees can bring in books, references, games, etc. and arrange them about their office in actual physical book cases and shelves. Sometimes Kindle doesn't cut it, especially for historical references. You can't practically do this at companies that attach wheels to your little desks and force you to play bumper cars every quarter.
For completeness, here's their lobby and meeting area:
Some impressions I get about this space by just looking at the pics: openness, welcoming, and team oriented.
Subscribe to:
Posts (Atom)








.jpg)








