Sunday, March 27, 2016

Quotes from "My Year In Startup Hell"

I loved this article on Fortune.com (exerted from "Disrupted: My Misadventure in the Start-Up Bubble"). It covers so many corporate company culture things I've seen or experienced in my game development career in a single article. The article is from a startup perspective, but much of this applies to many other more mature companies. These quotes especially struck a chord with me:
"Arriving here feels like landing on some remote island where a bunch of people have been living for years, in isolation, making up their own rules and rituals and religion and language—even, to some extent, inventing their own reality. This happens at all organizations, but for some reason tech startups seem to be especially prone to groupthink. Every tech startup seems to be like this. Believing that your company is not just about making money, that there is a meaning and a purpose to what you do, that your company has a mission, and that you want to be part of that mission—that is a big prerequisite for working at one of these places."
On people that get fired:
"Dharmesh’s culture code incorporates elements of HubSpeak. For example, it instructs that when someone quits or gets fired, the event will be referred to as “graduation.” In my first month at HubSpot I’ve witnessed several graduations, just in the marketing department. We’ll get an email from Cranium saying, “Team, just letting you know that Derek has graduated from HubSpot, and we’re excited to see how he uses his superpowers in his next big adventure!” Only then do you notice that Derek is gone, that his desk has been cleared out. Somehow Derek’s boss will have arranged his disappearance without anyone knowing about it. People just go up in smoke, like Spinal Tap drummers."
On what I call "Reality Shaping":
"The ideal HubSpotter is someone who exhibits a quality known as GSD, which stands for “get shit done.” This is used as an adjective, as in “Courtney is always in super-GSD mode.” The people who lead customer training seminars are called inbound marketing professors and belong to the faculty at HubSpot Academy. Our software is magical, such that when people use it—wait for it—one plus one equals three. Halligan and Dharmesh first introduced this alchemical concept at HubSpot’s annual customer conference, with a huge slide behind them that said “1 + 1 = 3.” Since then it has become an actual slogan at the company. People use the concept of one plus one equals three as a prism through which to evaluate new ideas. One day Spinner, the woman who runs PR, tells me, “I like that idea, but I’m not sure that it’s one-plus-one-equals-three enough.”
This is so true:
"Another thing I’m learning in my new job is that while people still refer to this business as the “tech industry,” in truth it is no longer really about technology at all. “You don’t get rewarded for creating great technology, not anymore,” says a friend of mine who has worked in tech since the 1980s, a former investment banker who now advises startups. “It’s all about the business model. The market pays you to have a company that scales quickly. It’s all about getting big fast. Don’t be profitable, just get big."
 Yup:
"On top of the fun stuff you create a mythology that attempts to make the work seem meaningful. Supposedly millennials don’t care so much about money, but they’re very motivated by a sense of mission. So, you give them a mission. You tell your employees how special they are and how lucky they are to be here. You tell them that it’s harder to get a job here than to get into Harvard and that because of their superpowers they have been selected to work on a very important mission to change the world. You make a team logo. You give everyone a hat and a T-shirt. You make up a culture code and talk about creating a company that everyone can love. You dangle the prospect that some might get rich."
Umm yea I know the feeling:
"Training takes place in a tiny room, where for two weeks I sit shoulder to shoulder with 20 other new recruits, listening to pep talks that start to sound like the brainwashing you get when you join a cult. It’s everything I ever imagined might take place inside a tech company, only even better."
On the office environment:
"Everyone works in vast, open spaces, crammed next to one another like seamstresses in Bangladeshi shirt factories, only instead of being hunched over sewing machines people are hunched over laptops. Nerf-gun battles rage, with people firing weapons from behind giant flat-panel monitors, ducking and rolling under desks. People hold standing meetings and even walking meetings, meaning the whole group goes for a walk and the meeting takes place while you’re walking."
Personally, I've learned over the years that I really need to live solidly in reality. I can't switch back and forth from "Corporate Enforced Reality #57" to "Real-World Ground-Level Reality" every single day of the week and stay happy and healthy long term.

I've learned that I don't need to associate myself with any corporation to be a "real" developer. If you aren't treated with respect by someone because you aren't associated with a company label, that person may very well not be a person you want to associate yourself with.

Here's a little rant: I think having to rigidly conform to a corporate mental model (like the insane one described at the company above) to earn money is demeaning and even dehumanizing. Folks, 1+1 does not equal 3. This is insane.

Somewhere back in time I must have fallen into some jacked parallel universe where treating workers like utterly replaceable mind controllable automatons is normal, accepted, encouraged, and even something to be proud of. In the universe of 1984, 1+1=3.

And the money the company gives you? It's just binary bits in some bank computer. There are plenty of ways of making money that don't involve becoming mentally insane. Trading off sanity for a bi-weekly set of binary digits added to your account balance actually isn't the greatest idea in my experience. Staying at a job you are unhappy with thinking that, eventually, you will achieve true long lasting happiness during "retirement" is also a pretty extreme way of living life. There are other paths to happiness.

My brain now pushes back when I think about living and working this way again. It basically says "hey that's super unhealthy and unsustainable!". I used to be irrationally fearful of working outside of a corporate bubble. Fear is your worst enemy and can make you much more manipulatable to others.

A long time ago, at Ensemble Studios in Dallas, I was completely wrapped up in our company's special little super insular "tribal" company culture. The company collapsed overnight and we all learned what was actually occurring at the corporate level for the previous 6-9 months. It became super clear that we all were living in a fairy-tale corporate enforced reality bubble. Even many of my "company friends" evaporated overnight into non-friends. It was ugly: even the very personalities of formally awesome coworkers instantly changed.

My ego was tied solidly into this company's culture and products. After it collapsed I had to carefully hit the "ego reset button". I had to strongly resist automatically following the locally exploitative paths laid out for me after Ensemble collapsed.

Monday, January 18, 2016

Compression ratio plots of zlib competitors Brotli and BitKnit

Continuing yesterday's post, here are the compression ratios for all 12k data points (between 256 -128MB) in the LZHAM vs. LZMA test file corpus, on various codecs: LZ4HC L8, Brotli L9, Rad's BitKnit VeryHigh, LZHAM m4, and of course zlib L9.

Vertical Axis = Compression ratio (higher is more compression)
Horizontal Axis = File, sorted purely by zlib's compression ratio
Color = Codec (using the same color coding as my previous post)

The data points have been sorted by zlib's compression ratio, which is why the green line is so nice and smooth. These are the same data points as yesterday's scatter graphs.


LZ4HC vs. BitKnit vs. zlib:



LZ4HC vs. Brotli vs. zlib:



 Brotli vs. BitKnit vs. zlib:



Here are a couple bonus plots, this time for LZHAM vs. LZ4HC or Brotli:



Comments:


It's clear from looking at these plots that simply stating "codec X has a higher ratio than codec Y" is at best a gross approximation. It highly depends on the file's content, the file's size, and even how well a file's content resembles what the codec designer's tuned the codec to handle best.

For example, Brotli has special optimizations (such as a precomputed static dictionary) for textual data. Also, like zlib, it uses entropy coding tables (the Huffman symbol code lengths) precomputed by the compressor, which can give it the edge on smaller files vs. codecs that use purely adaptive table updating approaches like LZHAM. (Also, I would imagine that the more stationary the data source, the more precomputed Huffman tables make sense.)

Another advantage Brotli shares with zlib due to its usage of precomputed Huffman tables: It doesn't need to spend valuable CPU time computing Huffman code lengths during decompression. LZHAM struggles to do this quickly, particularly on small files (less than approx. 4-8KB) where most decompression time is spent computing Huffman code lengths (and not actually decompressing the file!).

It's also possible to design a codec to be very strong at handling binary files. Apparently, BitKnit is tuned more in this direction. It still handles text files well, but it makes some intelligent design tradeoffs that favor really high and symmetrical compression/decompression performance with only a small sacrifice to text file ratios. This tradeoff makes a lot of sense, particularly in the game development context where a lot of data files are in various special binary formats.

Interestingly, Brotli and BitKnit seem to flip flop back and forth as to who is best ratio-wise. There are noticeable clusters of files where Brotli is slightly better, then clusters with BitKnit. I'll be analyzing these clusters soon to attempt to see what's going on. I believe this helps show that this data file corpus has a decent amount of interesting variety.

Finally, Brotli's compression ratio is just about always at least as good as zlib's (or extremely close). IMO, the Brotli team's Zopfli roots are showing strongly here.

Next Steps:


I need to break down these ratio graphs into clusters somehow. So we can then show results for "small text files" vs. "large binary files" etc. As the compressed totals show yesterday, Brotli and BitKnit have approximately equal compression power across the entire corpus. But there are categories of data were one codec is better than the other.

Looking into the future, it may be a good idea for the next major compressor to support both precomputed (Brotli-style) and semi-adaptive (LZHAM and presumably BitKnit-style) entropy table updating approaches.

Thanks to Blue Shift's CEO, John Brooks, for suggesting to chart this way.

Sunday, January 17, 2016

zlib in serious danger of becoming obsolete

Intro


I’m now starting to deeply analyze the performance of two new general purpose data compression codecs. One is Google’s Brotli codec, another is a brand new codec from Rad Game Tools named “BitKnit”. Both codecs are attempting to displace zlib, which is used by the Linux kernel, and is one of the most used compression libraries in the world. So I’m paying very close attention to what’s going on here.

To put things into perspective, in the lossless compression world we’re lucky to see a significant advancement every 5-10 years. Now, we have two independently implemented codecs that are giving zlib serious competition on multiple axes: throughput, ratio, and even code size.

Summary


I’m now using what I think is a very interesting and insightful approach to deeply analyze the practical performance characteristics of lossless codecs. As I learned while working on the Steam Linux/SteamOS project, robust benchmarking can be extremely difficult in practice. So I'm still gathering and analyzing the data, and tweaking how it’s graphed. What I’ve seen so far looks very interesting for multiple reasons.

First, it's looking pretty certain that both BitKnit and Brotli compete extremely well against zlib's decompression performance, but at much higher (LZMA/LZHAM-like) compression ratios. Amazingly, BitKnit’s compressor is also extremely fast, around the same speed as zlib’s. (By comparison, at maximum compression levels, both Brotli’s and LZHAM's compressors are pretty slow.) The graphs in this post only focus on decompression throughput, however. I’m saving the compression throughput analysis for another post.

One rough way of judging the complexity of a compressor vs. others is to compare the number of lines of code in each implementation. BitKnit at 2,700 lines of code (including comments) is smaller than both LZ4 (3,306 - no comments), zlib's (23,845 - no comments, incl. 3k lines of asm), or LZHAM’s (11,651 - no comments). Brotli's is rather large at 47,919 lines (no comments), but some fraction of this consists of embedded static tables.

Interestingly to me, BitKnit’s decompressor uses around half the temporary work RAM of LZHAM's (16k vs. 34k).

New Benchmarking Approach


While writing and analyzing LZHAM I started with a tiny set (like 5-10) of files for early testing. I spent a huge amount of time optimizing the compressor to excel on large text XML files such as enwik8/9, which are popular in the lossless data compression world. I consider this a serious mistake, so I've been rethinking how to best benchmark these systems.

The new codec analysis approach I’m using runs each decompressor on thousands of test corpus files, then I plot the resulting (throughput, ratio) data pairs in Excel using scatter graphs. The data points are colored by codec, and the points are transparent so regions with higher density (or with data points from multiple overlapping codecs) are more easily visualized. This is far better than what the Squash Compression Benchmark does IMO, because at a single glance you can see the results on thousands of (hopefully interesting) files, instead of the results on only a single file at a time from a tiny set of corpus files.

I generated these scatter graphs on 12k data files from the final LZHAM vs. LZMA corpus. There is some value in using these data files, because I used this same test corpus to analyze LZHAM to ensure it was competitive against LZMA. This corpus consists of a mix of game data, traditional textual data, every other compression corpus I could get my hands on, some artificial XML/JSON/UBJ test data, and lots of other interesting stuff I’ve encountered on the various projects I’ve worked on over the years. (Unfortunately, I cannot publicly distribute the bulk of these data files until someone comes up with an approach that allows me to share the corpus in a one way encrypted manner that doesn’t significantly impact throughput or ratio. I have no idea how this could really be done, or even if it's possible.)

The Data


The X axis is decompression throughput (right=faster), and the Y axis is compression ratio (higher=better ratio or more compression). The very bottom of the graph is the uncompressible line (ratio=1.0).

Color code:

Black/Gray = LZHAM
Red = Brotli
Green = zlib
Blue = BitKnit
Yellow = LZ4

Totals for 11,999 files (including uncompressible files):

Uncomp:   2,499,169,096
lz4:      1,167,777,908 2.14
zlib:     1,044,180,362 2.39
brotli:     918,949,263 2.72
bitknit:    898,621,908 2.78
lzham:      882,723,287 2.83

Totals after removing 1,330 files with a zlib compression ratio less than 1.1 (i.e. uncompressible files):

Uncomp:    2,147,530,394
lz4:         815,221,536 2.63
zlib:        693,090,474 3.1
brotli:      568,461,065 3.78
bitknit:     547,869,148 3.92
lzham:       532,235,143 4.03

This is a log2 log2 plot, basically an overview of the data:


This is a zoomed linear plot , looking more closely at the uncompressible (ratio=1) or nearly uncompressible (ratio very close but not 1) regions:



This log2 log2 plot is limited to just LZHAM vs. BitKnit:


Finally, another log2 log2 plot showing just BitKnit vs. zlib:



Current Observations


Fabian Giesen (Rad) and I have noticed several interesting things about these scatter plots:

- The data points with a ratio of 1 (or extremely close to 1) show how well the algorithm handles uncompressible data, which is hopefully near memcpy() performance.

(Note LZMA’s data will be very interesting, because it doesn’t have good handling for uncompressible data.)

There are a handful (around 50-60 depending on the codec) of data points with a ratio slightly below 1 (.963-.999). The majority are small (287-1kb) uncompressible files.

- Slightly "above" this ratio (very close to ratio 1, but not quite), literal handling dominates the decompressor's workload. There are distinct clusters on and near the ratio=1 line for each compressor.

LZHAM actually does kinda well here vs. the others, but it falls apart rapidly as the ratio increases.

- Notice the rough pushed down "<" shape of each algorithm's plot. LZHAM's is pretty noticeable. At the bottom right (ratios at/close to 1.0), literals dominate the decompression time.

Interpreting this as if all algorithms are plain LZ with discrete literals and matches:
As you go "up" to higher ratios, the decompressor has to process more and more matches, which (at least in LZHAM) are more expensive to handle vs. literals. After the “bend”, as you go up and to the right the matches grow increasingly numerous and longer (on average).

- LZHAM has an odd little cluster of data points to the right (on the ratio ~3 line) where it almost keeps up with BitKnit. Wonder what that is exactly? (Perhaps lots of easily decoded rep matches?)

- Notice zlib’s throughput stops increasing and plateaus as the ratio increases - why is that? Somebody needs to dive into zlib’s decompressor and see why it’s not scaling here.

I need to add my implementation of zlib’s core API (miniz) to see how well it compares.

Important Notes:

- The x64 benchmark command line app was run on Win10, 2x Xeon E5-2690 V2 3.0GHz (20 cores/40 threads). Benchmark app is single threaded.

- All test corpus files are between 256 bytes and 127.7 MB

- All algorithms were directly linked into the executable, and the decompressors were invoked in the same way

- Each decompressor is invoked repeatedly in a loop until 10ms has elapsed, this is done 4 times and shortest average runtime is taken

- Brotli was limited to comp level 9, as level 10 can be too slow for rapid experimentation, 16MB dictionary size (the largest it supports)

- LZHAM was limited to using its regular parser (not it's best of X arrivals parser, i.e. the "extreme parsing" flag was disabled), 64MB dictionary size.

- LZ4HC compressor, level 8

- zlib's asm modules were not used in this run. It'll be interesting to see what difference the asm optimizations make.

Next Step


The next major update will also show LZMA, Zstd, and miniz. I'm also going to throw some classes from crunch in here to clusterize the samples, so we can get a better idea of how well each algorithm performs on different classes of data.

I feel strongly that scatter graphs like these can be used to help intelligently guide the design and implementation of new practical compressors.

A big thanks to Fabian “ryg” Giesen at Rad Game Tools for giving me access to "BitKnit" for analysis. BitKnit is going to be released in the next major version of Oodle, Rad’s network and general purpose data compression product.

Thursday, December 17, 2015

Light Indexed Deferred Rendering

I can't find this information anywhere on the net, so I'll put it here. I've always been fascinated by the various alternatives to pure deferred shading, so I was really excited when I first encountered a light indexed variant.

Backstory: Halo Wars

I tired of the challenges inherent in pure deferred rendering approaches by around 2005, so I purposely used several deferred rendering alternatives on Halo Wars. HW combined X360 specific multi-light forward lighting for precise lighting (or for lights that required dynamic shadowing and/or cookies), along with an approximate voxel-based method for much cheaper, lower priority "bulk" lights.

In the voxel based approximate approach, one volume texture contained the total incoming light at each voxel's center (irradiance), and the other contained the weighted average of all the light vectors from all the lights that influenced each voxel (with the total weight in alpha). From this data, it was possible to compute approximate exit radiances at any point/normal. This "voxel light buffering" system was very efficient and scalable, so it was also used to illuminate terrain chunks and meshes by most omni/spot lights in the game. As an added bonus, it was easy and cheap to illuminate particles using this approach too.

We couldn't use multi-light forward on everything because the GPU fillrate cost from each added light was too high, and our object vs. light overlap check (just simple bounding box vs. sphere or cone) was very rough. Especially on terrain chunks, the extra pixel cost was brutal.

I had the GPU update the voxel textures each frame. Lights became as cheap as super low-res sprites, so we could have hundreds of them. The voxel texture's bounds in worldspace always covered the terrain in the view frustum (up to a certain depth and world height). The resolution was either 64x64x16 or 128x128x16 (from memory).

I found this approach interesting because it could achieve both diffuse and specular lighting, and because it visually degraded well when more than one light hit a voxel. Anyhow, after Ensemble collapsed I put all this stuff into the back of my mind.

Source 2

When I first came to Valve after HW shipped (late 2009), I kept encouraging the graphics programmers to look for another viable alternative to deferred rendering. One day, in early to middle 2010, I came in to work and Shanon Drone showed me his amazing prototype implementation of a voxel based forward rendering variant of what is known as Light Indexed Deferred Lighting. (So what would the name be? Voxed Based Light Indexed Forward Lighting?)

From what I remember, his approach voxelized in projection space. Each voxel contained an integer index into a list of lights that overlapped with that voxel's worldspace bounding box. The light lists were stored in a separate texture. The CPU (using SIMD instructions) was used to update the voxel texture, and the CPU also filled in the lightlist texture.

The scene was rendered just like with forward lighting, except the pixel shader fetched from the voxel light index texture, and then iterated over all the lights using a dynamic loop. The light parameters were fetched from the lightlist texture. I don't remember where the # of lights per voxel was stored, it could have been in either texture.

The new approach was ported and tested on X360, and I remember we switched to packing the light indices (or something - it's been a long time!) into an array of constant registers to make the approach practical at all on X360.

I have no idea if this is published anywhere, but I remember the day very clearly as an internal breakthrough. It's been many years now, so I doubt anyone will mind me mentioning it.

Published material on light indexed deferred/forward rendering techniques (or similar variants):

I'll update this blog post as I run across links:

By Damian Trebilco:

Data Compression in Dungeon Boss

Dungeon Boss (DB) is a Unity iOS/Android title I helped Boss Fight Entertainment ship this year. I recently wrote a blog post basically containing a brain dump of all the low-level memory related things we learned in the trenches while trying to make the product stable.

Due to the sheer amount of content and code, DB couldn't have shipped at all on our target devices without aggressively utilizing every built-in Unity compression option we could get our hands on:

- Audio: We found and modified a custom editor script that allowed DB's sound designer to fine tune the music and FX audio bitrates to the absolutely minimum needed. We gave more bits to music vs. sound effects. (It also helped that DB's sound effects purposely have a classic 8-bit feel.)

- Meshes: We used Unity's built-in mesh compression feature as aggressively as we could. This caused a few issues on map meshes that we had to work around by hand.

- Animation: We tuned our FBX Importer compression settings several times to find the best balance between visible artifacts and memory/file savings.

- Texture compression: At the last minute, we had to switch the majority of our textures to PVRTC 2bpp to free up as much memory headroom as possible. Without doing this DB was too unstable on lower end iOS devices. Quality wise, 2bpp looked amazingly good (much better than I expected).

- Asset bundles: We disabled Unity 4.6's built-in LZMA asset bundle compression, grouped the assets by type, then we packed the raw type-grouped asset data using a better compressor (LZHAM). Thankfully, there were enough Unity modding tools available, and Unity was flexible enough, that we could do this without having access to the source code of Unity itself.

Also see this page in the Unity docs for more info on how to examine and reduce file sizes of Unity builds.

Sunday, December 13, 2015

Binary delta compression and seed dictionaries as universal OS services

Some late Saturday night compression ideas:

Brotli's inclusion of a seed dictionary that helps it gain higher ratios on textual data, and our recent work on offline and real-time binary delta compression (patching), has given me this idea:

First, Brotli's static dictionary isn't "breaking" any rules at all. What they've done makes a lot of sense. They've tailored their codec for a very common use case (text), a place where it will have a lot of value. The seed dictionary is a legitimate part of the decompressor, except instead of pure executable code they also tacked on some interesting "universal" data to it.

Which got me thinking, the OS itself should just include a universal seed dictionary, or a set of them for different data types. A small set of seed data, like even just 4-16MB, would be fine for starters. This data can be automatically distributed to clients as part of an OS update or upgrade. It could contain natural language phrases, common x86 opcode sequences, XML, common JPEG quantization tables/markers/headers, common GLSL/HLSL keywords, etc. Once Linux does it, everybody will do it sooner or later.

Normally, when you want to compress some data for either real-time transmission, or archival purposes, you would use plain non-delta based compression. You can't assume the remote side has your seed dictionary right? But this is a weak if not flawed argument, because the remote side already has a compatible decompressor. They can surely download a little more data to have a good, stable universal seed dictionary too. And if they only download it once (through OS updates), then it's not a big cost.

Alternately, we can adopt a new codec that has a good built-in selection of universal seed data, and embed that into the OS. But the seed data should be available to OS processes as well, so they can compress/decompress data against it without having to carry around and distribute their own private seed dictionaries.

Other related thoughts:


Obviously, static dictionaries don't help in all cases. No harm done, just don't use them in those cases.

This method can be applied to networking too. The seed dictionary can be made part of the actual protocol. Packets can be delta compressed against app specific or universal seed dictionaries.

New codecs should just support seed dictionaries as required features. New LZ algorithms should be competitive against bsdiff in binary patching scenarios.

In the open source world, we need something better than bsdiff, which requires huge amounts of RAM on large files (and is just unreliable). Another interesting area of work would be to figure out the "best" universal seed dictionary to use for a OS service like this.

Plain lossless data compression and binary delta compression should not be separate concepts or tools: let's unify them. LZ-Sub shows one potential way forward.