Dungeon Boss (DB) is a Unity iOS/Android title I helped Boss Fight Entertainment ship this year. I recently wrote a blog post basically containing a brain dump of all the low-level memory related things we learned in the trenches while trying to make the product stable.
Due to the sheer amount of content and code, DB couldn't have shipped at all on our target devices without aggressively utilizing every built-in Unity compression option we could get our hands on:
- Audio: We found and modified a custom editor script that allowed DB's sound designer to fine tune the music and FX audio bitrates to the absolutely minimum needed. We gave more bits to music vs. sound effects. (It also helped that DB's sound effects purposely have a classic 8-bit feel.)
- Meshes: We used Unity's built-in mesh compression feature as aggressively as we could. This caused a few issues on map meshes that we had to work around by hand.
- Animation: We tuned our FBX Importer compression settings several times to find the best balance between visible artifacts and memory/file savings.
- Texture compression: At the last minute, we had to switch the majority of our textures to PVRTC 2bpp to free up as much memory headroom as possible. Without doing this DB was too unstable on lower end iOS devices. Quality wise, 2bpp looked amazingly good (much better than I expected).
- Asset bundles: We disabled Unity 4.6's built-in LZMA asset bundle compression, grouped the assets by type, then we packed the raw type-grouped asset data using a better compressor (LZHAM). Thankfully, there were enough Unity modding tools available, and Unity was flexible enough, that we could do this without having access to the source code of Unity itself.
Also see this page in the Unity docs for more info on how to examine and reduce file sizes of Unity builds.
Co-owner of Binomial LLC, working on GPU texture interchange. Open source developer, graphics programmer, former video game developer. Worked previously at SpaceX (Starlink), Valve, Ensemble Studios (Microsoft), DICE Canada.
Thursday, December 17, 2015
Sunday, December 13, 2015
Binary delta compression and seed dictionaries as universal OS services
Some late Saturday night compression ideas:
Brotli's inclusion of a seed dictionary that helps it gain higher ratios on textual data, and our recent work on offline and real-time binary delta compression (patching), has given me this idea:
First, Brotli's static dictionary isn't "breaking" any rules at all. What they've done makes a lot of sense. They've tailored their codec for a very common use case (text), a place where it will have a lot of value. The seed dictionary is a legitimate part of the decompressor, except instead of pure executable code they also tacked on some interesting "universal" data to it.
Which got me thinking, the OS itself should just include a universal seed dictionary, or a set of them for different data types. A small set of seed data, like even just 4-16MB, would be fine for starters. This data can be automatically distributed to clients as part of an OS update or upgrade. It could contain natural language phrases, common x86 opcode sequences, XML, common JPEG quantization tables/markers/headers, common GLSL/HLSL keywords, etc. Once Linux does it, everybody will do it sooner or later.
Normally, when you want to compress some data for either real-time transmission, or archival purposes, you would use plain non-delta based compression. You can't assume the remote side has your seed dictionary right? But this is a weak if not flawed argument, because the remote side already has a compatible decompressor. They can surely download a little more data to have a good, stable universal seed dictionary too. And if they only download it once (through OS updates), then it's not a big cost.
Alternately, we can adopt a new codec that has a good built-in selection of universal seed data, and embed that into the OS. But the seed data should be available to OS processes as well, so they can compress/decompress data against it without having to carry around and distribute their own private seed dictionaries.
Obviously, static dictionaries don't help in all cases. No harm done, just don't use them in those cases.
This method can be applied to networking too. The seed dictionary can be made part of the actual protocol. Packets can be delta compressed against app specific or universal seed dictionaries.
New codecs should just support seed dictionaries as required features. New LZ algorithms should be competitive against bsdiff in binary patching scenarios.
In the open source world, we need something better than bsdiff, which requires huge amounts of RAM on large files (and is just unreliable). Another interesting area of work would be to figure out the "best" universal seed dictionary to use for a OS service like this.
Plain lossless data compression and binary delta compression should not be separate concepts or tools: let's unify them. LZ-Sub shows one potential way forward.
Brotli's inclusion of a seed dictionary that helps it gain higher ratios on textual data, and our recent work on offline and real-time binary delta compression (patching), has given me this idea:
First, Brotli's static dictionary isn't "breaking" any rules at all. What they've done makes a lot of sense. They've tailored their codec for a very common use case (text), a place where it will have a lot of value. The seed dictionary is a legitimate part of the decompressor, except instead of pure executable code they also tacked on some interesting "universal" data to it.
Which got me thinking, the OS itself should just include a universal seed dictionary, or a set of them for different data types. A small set of seed data, like even just 4-16MB, would be fine for starters. This data can be automatically distributed to clients as part of an OS update or upgrade. It could contain natural language phrases, common x86 opcode sequences, XML, common JPEG quantization tables/markers/headers, common GLSL/HLSL keywords, etc. Once Linux does it, everybody will do it sooner or later.
Normally, when you want to compress some data for either real-time transmission, or archival purposes, you would use plain non-delta based compression. You can't assume the remote side has your seed dictionary right? But this is a weak if not flawed argument, because the remote side already has a compatible decompressor. They can surely download a little more data to have a good, stable universal seed dictionary too. And if they only download it once (through OS updates), then it's not a big cost.
Alternately, we can adopt a new codec that has a good built-in selection of universal seed data, and embed that into the OS. But the seed data should be available to OS processes as well, so they can compress/decompress data against it without having to carry around and distribute their own private seed dictionaries.
Other related thoughts:
Obviously, static dictionaries don't help in all cases. No harm done, just don't use them in those cases.
This method can be applied to networking too. The seed dictionary can be made part of the actual protocol. Packets can be delta compressed against app specific or universal seed dictionaries.
New codecs should just support seed dictionaries as required features. New LZ algorithms should be competitive against bsdiff in binary patching scenarios.
In the open source world, we need something better than bsdiff, which requires huge amounts of RAM on large files (and is just unreliable). Another interesting area of work would be to figure out the "best" universal seed dictionary to use for a OS service like this.
Plain lossless data compression and binary delta compression should not be separate concepts or tools: let's unify them. LZ-Sub shows one potential way forward.
Saturday, December 12, 2015
Friday, December 11, 2015
Why by-convention read-only shared state is bad
I'm going to switch topics from my usual stuff to something else I find interesting: multithreading.
Back at Ensemble Studios, I remember working on the multithreaded rendering code in both Age of Empires 3 and Halo Wars. At the time (2003-2004 timeframe), multithreading was a new thing for game engines. Concepts like shared state, producer consumer queues, TLS, semaphores/events, condition variables, job systems, etc. were new things to many game engineers. (And lockfree approaches were considered very advanced, like Area 51 grade stuff.) Age of Empires 3 was a single threaded engine, but we did figure out how to use threading for at least two things I'm aware of: the loading screen, and jobified skinning.
On Age3, it was easy to thread the SIMD skinning code used to render all skinned meshes, because I analyzed the rendering code and determined that the input data we needed for "jobified" skinning was guaranteed to be stable during the execution of the skinning job. (The team also said it seemed "impossible" to do this, which of course just made me more determined to get this feature working and shipped.)
The key thing about making it work was this: I sat down and said to the other rendering programmer who owned this code, "do not change this mesh and bone data in between this spot and this spot in your code -- or the multithreaded skinning code will randomly explode!". We then filled in the newly introduced main thread bubble with some totally unrelated work that couldn't possibly modify our mesh/bone data (or so we thought!), and in our little world all was good.
This approach worked, but it was dangerous and fragile because it introduced a new constraint into the codebase that was hard to understand or even notice by just locally studying the code. If someone would have changed how the mesh renderer worked, or changed the data that our skinning jobs accessed, they could have introduced a whole new class of bugs (such as Heisenbugs) into the system without knowing it. Especially in a large rapidly changing codebase, this approach is a slippery slope.
We were willing to do it this way in the Age3 codebase because very few people (say 2-3 at most in the entire company) would ever touch this very deep rendering code. I made sure the other programmer who owned the system had a mental model of the new constraint, also made sure there were some comments and little checks in there, and all was good.
Importantly, Ensemble had extremely low employee turnover, and this low-level engine code rarely changed. Just in case the code exploded in a way I couldn't predict on a customer's machine, I put in a little undocumented command line option that disabled it completely.
Age3 and Halo Wars made a lot of use of the fork and join model for multithreading. Halo Wars used the fork and join model with a global system, and we allowed multiple threads (particularly the main and render threads) to fork and join independently.

See those sections where you have multiple tasks stacked up in parallel? Those parts of the computation are going to read some data, and if it's some global state it means you have a potential conflict with the main thread. Anything executed in those parallel sections of the overall computation must follow your newly introduced conventions on which shared state data must remain stable (and when), or disaster ensues.
Looking back now, I realize this approach of allowing global access to shared state that is supposed to be read-only and stable by convention or assumption is not a sustainable practice. Eventually, somebody somewhere is going to introduce a change using only their "local" mental model of the code that doesn’t factor in our previously agreed upon data stability constraint, and they will break the system in a way they don't understand. Bugs will come in, the engine will crash in some way randomly, etc. Imagine the damage to a codebase if this approach was followed by dozens of systems, instead of just one.
Creating and modifying large scale software is all about complexity management. Our little Age3 threaded skinning change made the engine more complex in ways that were hard to understand, talk about, or reason about. When you are making decisions about how to throw some work onto a helper thread, you need to think about the overall complexity and sustainability of your approaches at a very high level, or your codebase is eventually going to become buggy and practically unchangeable.
One sustainable engineering solution to this problem is Read Copy Update (RCU, used in the Linux kernel). With RCU, there are simple clear policies to understand, like "you must bump up/down the reference to any RCU object you want to read”, etc. RCU does introduce more complexity, but now it’s a first class engine concept that other programmers can understand and reason about.
RCU is a good approach for sharing large amounts of data, or large objects. But for little bits of shared state, like a single int or whatever, the overhead is pretty high.
Multithreading Skinning on Age of Empires 3
Back at Ensemble Studios, I remember working on the multithreaded rendering code in both Age of Empires 3 and Halo Wars. At the time (2003-2004 timeframe), multithreading was a new thing for game engines. Concepts like shared state, producer consumer queues, TLS, semaphores/events, condition variables, job systems, etc. were new things to many game engineers. (And lockfree approaches were considered very advanced, like Area 51 grade stuff.) Age of Empires 3 was a single threaded engine, but we did figure out how to use threading for at least two things I'm aware of: the loading screen, and jobified skinning.
On Age3, it was easy to thread the SIMD skinning code used to render all skinned meshes, because I analyzed the rendering code and determined that the input data we needed for "jobified" skinning was guaranteed to be stable during the execution of the skinning job. (The team also said it seemed "impossible" to do this, which of course just made me more determined to get this feature working and shipped.)
The key thing about making it work was this: I sat down and said to the other rendering programmer who owned this code, "do not change this mesh and bone data in between this spot and this spot in your code -- or the multithreaded skinning code will randomly explode!". We then filled in the newly introduced main thread bubble with some totally unrelated work that couldn't possibly modify our mesh/bone data (or so we thought!), and in our little world all was good.
This approach worked, but it was dangerous and fragile because it introduced a new constraint into the codebase that was hard to understand or even notice by just locally studying the code. If someone would have changed how the mesh renderer worked, or changed the data that our skinning jobs accessed, they could have introduced a whole new class of bugs (such as Heisenbugs) into the system without knowing it. Especially in a large rapidly changing codebase, this approach is a slippery slope.
We were willing to do it this way in the Age3 codebase because very few people (say 2-3 at most in the entire company) would ever touch this very deep rendering code. I made sure the other programmer who owned the system had a mental model of the new constraint, also made sure there were some comments and little checks in there, and all was good.
Importantly, Ensemble had extremely low employee turnover, and this low-level engine code rarely changed. Just in case the code exploded in a way I couldn't predict on a customer's machine, I put in a little undocumented command line option that disabled it completely.
Age3 and Halo Wars made a lot of use of the fork and join model for multithreading. Halo Wars used the fork and join model with a global system, and we allowed multiple threads (particularly the main and render threads) to fork and join independently.
See those sections where you have multiple tasks stacked up in parallel? Those parts of the computation are going to read some data, and if it's some global state it means you have a potential conflict with the main thread. Anything executed in those parallel sections of the overall computation must follow your newly introduced conventions on which shared state data must remain stable (and when), or disaster ensues.
Complexity Management
Looking back now, I realize this approach of allowing global access to shared state that is supposed to be read-only and stable by convention or assumption is not a sustainable practice. Eventually, somebody somewhere is going to introduce a change using only their "local" mental model of the code that doesn’t factor in our previously agreed upon data stability constraint, and they will break the system in a way they don't understand. Bugs will come in, the engine will crash in some way randomly, etc. Imagine the damage to a codebase if this approach was followed by dozens of systems, instead of just one.
Creating and modifying large scale software is all about complexity management. Our little Age3 threaded skinning change made the engine more complex in ways that were hard to understand, talk about, or reason about. When you are making decisions about how to throw some work onto a helper thread, you need to think about the overall complexity and sustainability of your approaches at a very high level, or your codebase is eventually going to become buggy and practically unchangeable.
One solution: RCU
One sustainable engineering solution to this problem is Read Copy Update (RCU, used in the Linux kernel). With RCU, there are simple clear policies to understand, like "you must bump up/down the reference to any RCU object you want to read”, etc. RCU does introduce more complexity, but now it’s a first class engine concept that other programmers can understand and reason about.
RCU is a good approach for sharing large amounts of data, or large objects. But for little bits of shared state, like a single int or whatever, the overhead is pretty high.
Another solution: Deep Copy
The other sustainable approach is to deep copy the global data into a temporary job-specific context structure, so the job won't access the data as shared/global state. The deep copy approach is even simpler, because the job just has its own little stable snapshot of the data which is guaranteed to stay stable during job execution.
Unfortunately, implementing the deep copy approach can be taxing. What if a low-level helper system used by job code (that previously accessed some global state) doesn’t easily have access to your job context object? You need a way to somehow get this data “down” into deeper systems.
The two approaches I’m aware of are TLS, or (for lack of a better phrase) “value plumbing”. TLS may be too slow, and is crossing into platform specific land. The other option, value plumbing, means you just pass the value (or a pointer to your job context object) down into any lower level systems that need it. This may involve passing the context pointer or value data down several layers through your callstacks.
But, you say, this plumbing approach is ugly because of all these newly introduced method variables! But, it is a sustainable approach in the long term, like RCU. Eventually, you can refactor and clean up the plumbing code to make it cleaner if needed, assuming it’s a problem at all. You have traded off some code ugliness for a solution that is reliable in the long term.
Sometimes, when doing large scale software engineering, you must make changes to codebases that may seem ugly (like value plumbing) as a tradeoff for higher sustainability and reliability. It can be difficult to make these choices, so some higher level thinking is needed to balance the long term pros and cons of all approaches.
But, you say, this plumbing approach is ugly because of all these newly introduced method variables! But, it is a sustainable approach in the long term, like RCU. Eventually, you can refactor and clean up the plumbing code to make it cleaner if needed, assuming it’s a problem at all. You have traded off some code ugliness for a solution that is reliable in the long term.
Conclusion
Sometimes, when doing large scale software engineering, you must make changes to codebases that may seem ugly (like value plumbing) as a tradeoff for higher sustainability and reliability. It can be difficult to make these choices, so some higher level thinking is needed to balance the long term pros and cons of all approaches.
Thursday, December 10, 2015
Okumura's "ar002" and reverse engineering the internals of Katz's PKZIP
Sometime around 1992, still in high school, I upgraded from my ancient 16-bit OS/9 computer to a 80286 DOS machine with a hard drive. I immediately got onto some local BBS's at the blistering speed of 1200 baud and discovered this beautiful little gem of a compression program written in the late 80's by Haruhiko Okumura at Matsusaka University:
http://oku.edu.mie-u.ac.jp/~okumura/compression/ar002/

I studied every bit of ar002 intensely, especially the Huffman related code. I was impressed and learned tons of things from ar002. It used a complex (to a teenager) tree-based algorithm to find and insert strings into a sliding dictionary, and I sensed there must be a better way because its compressor felt fairly slow. (How slow? We're talking low single digit kilobytes per second on a 12MHz machine if I remember correctly, or thousands of CPU cycles per input byte!)
So I started focusing on alternative match finding algorithms, with all experiments done in 8086 real mode assembly for performance because compilers at the time weren't very good. On these old CPU's a human assembly programmer could run circles around C compilers, and every cycle counted.
Having very few programming related books (let alone very expensive compression books!), I focused instead on whatever was available for free. This was stuff downloaded from BBS's, like the archiving tool PKZIP.EXE. PKZIP held a special place, because it was your gateway into this semi-secretive underground world of data distributed on BBS's by thousands of people. PKZIP's compressor internally used mysterious, patented algorithms that accomplished something almost magical (to me) with data. The very first program you needed to download to do much of anything was PKZIP.EXE. To me, data compression was a key technology in this world.
Without PKZIP you couldn't do anything with the archives you downloaded, they were just gibberish. After downloading an archive on your system (which took forever using Z-Modem), you would manually unzip it somewhere and play around with whatever awesome goodies were inside the archive.
Exploring the early data communication networks at 1200 or 2400 baud was actually crazy fun, and this tool and a good terminal program was your gateway into this world. There were other archivers like LHA and ARJ, but PKZIP was king because it had the best practical compression ratio for the time, it was very fast compared to anything else, and it was the most popular.
This command line tool advertised awesomeness. The help text screamed "FAST!". So of course I became obsessed with cracking this tool's secrets. I wanted to deeply understand how it worked and make my own version that integrated everything I had learned through my early all-assembly compression experiments and studying ar002 (and little bits of LHA, but ar002's source was much more readable).

I used my favorite debugger of all time, Borland's Turbo Debugger, to single step through PKZIP:

PKZIP was written by Phil Katz, who was a tortured genius in my book. In my opinion his work at the time is under appreciated. His Deflate format was obviously very good for its time to have survived all the way into the internet era.
I single stepped through all the compression and decompression routines Phil wrote in this program. It was a mix of C and assembly. PKZIP came with APPNOTE.TXT, which described the datastream format at a high level. Unfortunately, at the time it lacked some key information about how to decode each block's Huffman codelengths (which were themselves Huffman coded!), so you had to use reverse engineering techniques to figure out the rest. Also most importantly, it only covered the raw compressed datastream format, so the algorithms were up to you to figure out.
The most important thing I learned while studying Phil's code: the entire sliding dictionary was literally moved down in memory every ~4KB of data to compress. (Approximately 4KB - it's been a long time.) I couldn't believe he didn't use a ring buffer approach to eliminate all this data movement. This little realization was key to me: PKZIP spent many CPU cycles just moving memory around!
PKZIP's match finder, dictionary updater (which used a simple rolling hash), and linked list updater functions were all written in assembly and called from higher-level C code. The assembly code was okay, but as a demoscener I knew it could be improved (or at least equaled) with tighter code and some changes to the match finder and dictionary updating algorithms.
Phil's core loops would use 32-bit instructions on 386 CPU's, but strangely he turned off/on interrupts constantly around the code sequences using 32-bit instructions. I'm guessing he was trying to work around interrupt handlers or TSR's that borked the high word in 32-bit registers.
To fully reverse engineer the format, I had to feed in very tiny files into PKZIP's decompressor and single step through the code near the beginning of blocks. If you paid very close attention, you could build a mental model of what the assembly code was doing relative to the datastream format described in APPNOTE.TXT. I remember doing it, and it was slow, difficult work (in an unheated attic on top of things).
I wrote my own 100% compatible codec in all-assembly using what I thought were better algorithms, and it was more or less competitive against PKZIP. Compression ratio wise, it was very close to Phil's code. I started talking about compression and PKZIP on a Fidonet board somewhere, and this led to the code being sublicensed for use in EllTech Development's "Compression Plus" compression/archiving middleware product.
For fun, here's the ancient real-mode assembly code to my final Deflate compatible compressor. Somewhere at the core of my brain there is still a biologically-based x86-compatible assembly code optimizer. Here's the core dictionary updating code, which scanned through new input data and updated its hash-based search accelerator:
As an exercise while learning C I ported the core hash-based LZ algorithms I used from all-assembly. I uploaded two variants to local BBS's as "prog1.c" and "prog2.c". These little educational compression programs were referenced in the paper "A New Approach to Dictionary-Based Lossless Compression" (2004), and the code is still floating around on the web.
I rewrote my PKZIP-compatible codec in C, using similar techniques, and this code was later purchased by Microsoft (when they bought Ensemble Studios). It was used by Age of Empires 1/2, which sold tens of millions of copies (back when games shipped in boxes this was a big deal). I then optimized this code to scream on the Xbox 360 CPU, and this variant shipped on Halo Wars, Halo 3, and Forza 2. So if you've played any of these games, somewhere in there you where running a very distant variant of the above ancient assembly code.
Eventually the Deflate compressed bitstream format created by Phil Katz made its way into an internet standard, RFC 1951.
A few years ago, moonlighting while at Valve, I wrote an entirely new Deflate compatible codec called "miniz" but this time with a zlib compatible API. It lives in a single source code file, and the entire stream-capable decompressor lives in a single C function. I first wrote it in C++, then quickly realized this wasn't so hot of an idea (after feedback from the community) so I ported it to pure C. It's now on Github here. In its lowest compression mode miniz also sports a very fast real-time compressor. miniz's single function decompression function has been successfully compiled and executed on very old processors, like the 65xxx series.
I don't think it's well known that the work of compression experts and experimenters in Japan significantly influenced early compression programmers in the US. There's a very interesting early history of Japanese data compression work on Haruhiko Okumura's site here. Back then Japanese and American compression researchers would communicate and share knowledge:
Also, I believe Phil Katz deserves more respect for creating Inflate and Deflate. The algorithm obviously works, and the compressed data format is an internet standard so it will basically live forever. These days, Deflate seems relatively simple, but at the time it was anything but simple to workers in this field. Amazingly, good old Deflate still has some legs, as Zopfli and 7-zip's optimizing Deflate-compatible compressors demonstrate.
Finally, I personally learned a lot from studying the code written by these inspirational early data compression programmers. I'll never get to meet Phil Katz, but perhaps one day I could meet Dr. Okumura and say thanks.
http://oku.edu.mie-u.ac.jp/~okumura/compression/ar002/

I studied every bit of ar002 intensely, especially the Huffman related code. I was impressed and learned tons of things from ar002. It used a complex (to a teenager) tree-based algorithm to find and insert strings into a sliding dictionary, and I sensed there must be a better way because its compressor felt fairly slow. (How slow? We're talking low single digit kilobytes per second on a 12MHz machine if I remember correctly, or thousands of CPU cycles per input byte!)
So I started focusing on alternative match finding algorithms, with all experiments done in 8086 real mode assembly for performance because compilers at the time weren't very good. On these old CPU's a human assembly programmer could run circles around C compilers, and every cycle counted.
Having very few programming related books (let alone very expensive compression books!), I focused instead on whatever was available for free. This was stuff downloaded from BBS's, like the archiving tool PKZIP.EXE. PKZIP held a special place, because it was your gateway into this semi-secretive underground world of data distributed on BBS's by thousands of people. PKZIP's compressor internally used mysterious, patented algorithms that accomplished something almost magical (to me) with data. The very first program you needed to download to do much of anything was PKZIP.EXE. To me, data compression was a key technology in this world.
Without PKZIP you couldn't do anything with the archives you downloaded, they were just gibberish. After downloading an archive on your system (which took forever using Z-Modem), you would manually unzip it somewhere and play around with whatever awesome goodies were inside the archive.
Exploring the early data communication networks at 1200 or 2400 baud was actually crazy fun, and this tool and a good terminal program was your gateway into this world. There were other archivers like LHA and ARJ, but PKZIP was king because it had the best practical compression ratio for the time, it was very fast compared to anything else, and it was the most popular.
This command line tool advertised awesomeness. The help text screamed "FAST!". So of course I became obsessed with cracking this tool's secrets. I wanted to deeply understand how it worked and make my own version that integrated everything I had learned through my early all-assembly compression experiments and studying ar002 (and little bits of LHA, but ar002's source was much more readable).

I used my favorite debugger of all time, Borland's Turbo Debugger, to single step through PKZIP:

PKZIP was written by Phil Katz, who was a tortured genius in my book. In my opinion his work at the time is under appreciated. His Deflate format was obviously very good for its time to have survived all the way into the internet era.
I single stepped through all the compression and decompression routines Phil wrote in this program. It was a mix of C and assembly. PKZIP came with APPNOTE.TXT, which described the datastream format at a high level. Unfortunately, at the time it lacked some key information about how to decode each block's Huffman codelengths (which were themselves Huffman coded!), so you had to use reverse engineering techniques to figure out the rest. Also most importantly, it only covered the raw compressed datastream format, so the algorithms were up to you to figure out.
The most important thing I learned while studying Phil's code: the entire sliding dictionary was literally moved down in memory every ~4KB of data to compress. (Approximately 4KB - it's been a long time.) I couldn't believe he didn't use a ring buffer approach to eliminate all this data movement. This little realization was key to me: PKZIP spent many CPU cycles just moving memory around!
PKZIP's match finder, dictionary updater (which used a simple rolling hash), and linked list updater functions were all written in assembly and called from higher-level C code. The assembly code was okay, but as a demoscener I knew it could be improved (or at least equaled) with tighter code and some changes to the match finder and dictionary updating algorithms.
Phil's core loops would use 32-bit instructions on 386 CPU's, but strangely he turned off/on interrupts constantly around the code sequences using 32-bit instructions. I'm guessing he was trying to work around interrupt handlers or TSR's that borked the high word in 32-bit registers.
To fully reverse engineer the format, I had to feed in very tiny files into PKZIP's decompressor and single step through the code near the beginning of blocks. If you paid very close attention, you could build a mental model of what the assembly code was doing relative to the datastream format described in APPNOTE.TXT. I remember doing it, and it was slow, difficult work (in an unheated attic on top of things).
I wrote my own 100% compatible codec in all-assembly using what I thought were better algorithms, and it was more or less competitive against PKZIP. Compression ratio wise, it was very close to Phil's code. I started talking about compression and PKZIP on a Fidonet board somewhere, and this led to the code being sublicensed for use in EllTech Development's "Compression Plus" compression/archiving middleware product.
For fun, here's the ancient real-mode assembly code to my final Deflate compatible compressor. Somewhere at the core of my brain there is still a biologically-based x86-compatible assembly code optimizer. Here's the core dictionary updating code, which scanned through new input data and updated its hash-based search accelerator:
;--------------- HASH LOOP
ALIGN 4
HDLoopB:
ReptCount = 0
REPT 16
Mov dx, [bp+2]
Shl bx, cl
And bh, ch
Xor bl, dl
Add bx, bx
Mov ax, bp
XChg ax, [bx+si]
Mov es:[di+ReptCount], ax
Inc bp
Shl bx, cl
And bh, ch
Xor bl, dh
Add bx, bx
Mov ax, bp
XChg ax, [bx+si]
Mov es:[di+ReptCount+2], ax
Inc bp
ReptCount = ReptCount + 4
ENDM
Add di, ReptCount
Dec [HDCounter]
Jz HDOdd
Jmp HDLoopB
As an exercise while learning C I ported the core hash-based LZ algorithms I used from all-assembly. I uploaded two variants to local BBS's as "prog1.c" and "prog2.c". These little educational compression programs were referenced in the paper "A New Approach to Dictionary-Based Lossless Compression" (2004), and the code is still floating around on the web.
I rewrote my PKZIP-compatible codec in C, using similar techniques, and this code was later purchased by Microsoft (when they bought Ensemble Studios). It was used by Age of Empires 1/2, which sold tens of millions of copies (back when games shipped in boxes this was a big deal). I then optimized this code to scream on the Xbox 360 CPU, and this variant shipped on Halo Wars, Halo 3, and Forza 2. So if you've played any of these games, somewhere in there you where running a very distant variant of the above ancient assembly code.
Eventually the Deflate compressed bitstream format created by Phil Katz made its way into an internet standard, RFC 1951.
A few years ago, moonlighting while at Valve, I wrote an entirely new Deflate compatible codec called "miniz" but this time with a zlib compatible API. It lives in a single source code file, and the entire stream-capable decompressor lives in a single C function. I first wrote it in C++, then quickly realized this wasn't so hot of an idea (after feedback from the community) so I ported it to pure C. It's now on Github here. In its lowest compression mode miniz also sports a very fast real-time compressor. miniz's single function decompression function has been successfully compiled and executed on very old processors, like the 65xxx series.
Conclusion
I don't think it's well known that the work of compression experts and experimenters in Japan significantly influenced early compression programmers in the US. There's a very interesting early history of Japanese data compression work on Haruhiko Okumura's site here. Back then Japanese and American compression researchers would communicate and share knowledge:
"At one time a programmer for PK and I were in close contact. We exchanged a lot of ideas. No wonder PKZIP and LHA are so similar."I'm guessing he's referring to Phil Katz? (Who else could it have been?)
Finally, I personally learned a lot from studying the code written by these inspirational early data compression programmers. I'll never get to meet Phil Katz, but perhaps one day I could meet Dr. Okumura and say thanks.
The awesomeness that is ClangFormat
While working on the Linux project at Valve we started using this code formatting tool from the LLVM folks:
http://clang.llvm.org/docs/ClangFormat.html
Mike Sartain (now at Rad) recommended we try it. ClangFormat is extremely configurable, and can handle more or less any convention you could want. It's a top notch tool. Here's vogl's .clang-format configuration file to see how easy it is to configure.
One day on the vogl GL debugger project, I finally got tired of trying to enforce or even worry about the lowest level bits of coding conventions. Things like if there should be a space before/after pointers, space after commas, bracket styles, switch styles, etc.
I find codebases with too much style diversity increase the mental tax of parsing code, especially for newcomers. Even worse, for both newcomers and old timers: you need to switch styles on the fly as you're modifying code spread throughout various files (to ensure the new code fits in to its local neighbor). After some time in such a codebase, you do eventually build a mental model and internalize each local neighborhood's little style oddities. This adds no real value I can think of, it only subtracts as far as I can tell.
I have grown immune to most styles now, but switching between X different styles to write locally conforming code seems like an inefficient use of programmer time.
http://clang.llvm.org/docs/ClangFormat.html
One day on the vogl GL debugger project, I finally got tired of trying to enforce or even worry about the lowest level bits of coding conventions. Things like if there should be a space before/after pointers, space after commas, bracket styles, switch styles, etc.
I have grown immune to most styles now, but switching between X different styles to write locally conforming code seems like an inefficient use of programmer time.
In codebases like this, trying to tell newcomers to use a single nice and neat company sanctioned style convention isn't practically achievable. A codebase's style diversity tends to increase over time unless you stop it early. Your codebase is stuck in a style vortex.
So sometimes it makes sense to just blow away the current borked style and do a drive-by ClangFormat on all files and check them back in. Of course this can make diff'ing and 3-way merging harder, but after a while that issue mostly becomes moot as churn occurs. It's a traumatic thing, yes, but doable.
Anyhow, here we are in 2015. We individually have personal super computers, and yet we're still editing text on glass TTY's (or their LCD equivalents). I now have more RAM than the hard drives I owned until 2000 or so. Hardware technology has improved massively, but software technology hasn't caught up yet.
Next, you can integrate ClangFormat into your checkin process, or into the editor. Require all submitted code to be first passed through ClangFormat. Detecting divergence can be part of the push request process, or something.
On new codebases, be sure to figure this out early or you'll get stuck in the style diversity vortex.
On new codebases, be sure to figure this out early or you'll get stuck in the style diversity vortex.
Microsoft's old QuickBASIC product from the 80's would auto-format to a standard style as lines were being entered and edited.
Perhaps in the far future (or a current alternate universe), the local coding style can just be an editor, diff-time, and grep-time only thing. Let's ditch text completely and switch to some sort of parse tree or AST that also preserves human things like comments.
Then, the editor just loads and converts to text, you edit there, and it converts back to binary for saving and checkin. With this approach, it should be possible to switch to different code views to visualize and edit the code in different ways. (I'm sure this has all been thought of 100 times already, and it ended in tears each time it was tried.)
As a plus, once the code has been pre-parsed perhaps builds can go a little faster.
Then, the editor just loads and converts to text, you edit there, and it converts back to binary for saving and checkin. With this approach, it should be possible to switch to different code views to visualize and edit the code in different ways. (I'm sure this has all been thought of 100 times already, and it ended in tears each time it was tried.)
As a plus, once the code has been pre-parsed perhaps builds can go a little faster.
Monday, December 7, 2015
The future of GPU texture compression
Google engineers were the first to realize the value of crunch (original site here), my advanced lossy texture compression library and command line toolset for DXTc textures that was recently integrated into Unity 5.x. Here's Brandon Jones at Google describing how crunch can be used in WebGL apps in his article "Saving Bandwidth and Memory with WebGL and Crunch", from the book "HTML5 Game Development Insights".

Intel: Fast ISPC Texture Compressor
NVidia: GPU Accelerated Texture Compression
libsquish
These are awesome, high quality GPU texture compression tools/libs, with lots of customers. Unfortunately they solve the wrong problem.

Won Chun (then at Google, now at Rad) came up with a great idea a few years back. What the web and game engine worlds could really use is a "Universal Crunch" format. A GPU agnostic "download anywhere" format, that can be quickly transcoded into any other major format, like DXTc, or PVRTC, or ASTC, etc. Such a texture codec would be quite an undertaking, but I've been thinking about it for years and I think it's possible. Some quality tradeoffs would have to be made, of course, but if you like working on GPU texture compression problems, or want to commercialize something in this space, perhaps go in this direction.
While I was writing crunch I was only thinking "this is something useful for console and PC game engines". I had no idea it could be useful for web or WebGL apps. Thinking back, I should have sat down and asked myself "what other software technology, or what other parts of the stack, will need to deal with compressed GPU textures"? (One lesson learned: Learn Javascript!)
Anyhow, crunch is an example of the new class of compression solutions opened up by collapsing parts of the traditional game data processing tool chain into single, unified solutions.
So let's go forward and break down some artificial barriers and combine knowledge across a few different problem domains:
- Game development art asset authoring methods
- Game engine build pipelines/data preprocessing
- GPU texture compression
- Data compression
- Rate distortion optimization
The following examples are for DXTc, but they apply to other formats like PVRTC/ETC/etc. too. (Of course, many companies have different takes on the pipelines described here. These are just vanilla, general examples. Id Software's megatexturing and Allegorithmic's tech use very different approaches.)

The previous way of creating DXTc GPU textures was (this example is the Halo Wars model):
1. Artists save texture or image as an uncompressed file (like .TGA or .PNG) from Photoshop etc.
2. We call a command line tool which grinds away to compress the image to the highest quality achievable at a fixed bitrate (the lossy GPU texture compression step). Alternately, we can use the GPU to accelerate the process to near-real time. (Both solve the same problem.)
This is basically a fixed rate lossy texture compressor with a highly constrained standardized output format compatible with GPU texture hardware.
Now we have a .DDS file, stored somewhere in the game's repo.
3. To ship a build, the game's asset bundle or archive system losslessly packs the texture, using LZ4, LZO, Deflate, LZX, LZMA, etc. - this data gets shipped to end users

The "current" new way is a little less complex (at the high level) because we delete the lossless compression step in stage 3. Step 2 now borrows a "new" concept from the video compression world, Rate Distortion Optimization (RDO), and applies it to GPU texture compression:
1. Artists selects a JPEG-style quality level and saves texture or image as an uncompressed file (like .TGA or .PNG) from Photoshop etc.
2. We call a command line tool called "crunch" that combines lossy clusterized DXTc compression with VQ, followed by an custom optimizing lossless backend coder. Now we have a .CRN file at some quality level, stored somewhere in the game's repo
3. To ship a build, game's asset bundle or archive system stores the .CRN file uncompressed (because it's already compressed earlier) - this data gets shipped to end users
The most advanced game engines, such as Unity and some other AAA in-house game engines, do it more or less this way now.

1. Artists hits a "Save" button, and a preview window pops up. Artist can tune various compression options in real-time to find the best balance between lossy compression artifacts and file size. (Why not? It's their art. This is also the standard web way, but with JPEG.) "OK" button saves a .CRN and .PNG file simultaneously.
2. To ship a build, game's asset bundle or archive system stores the .CRN file uncompressed (because it's already been compressed) - this data gets shipped to end users
So let's go forward and break down some artificial barriers and combine knowledge across a few different problem domains:
- Game development art asset authoring methods
- Game engine build pipelines/data preprocessing
- GPU texture compression
- Data compression
- Rate distortion optimization
The following examples are for DXTc, but they apply to other formats like PVRTC/ETC/etc. too. (Of course, many companies have different takes on the pipelines described here. These are just vanilla, general examples. Id Software's megatexturing and Allegorithmic's tech use very different approaches.)
The old way

The previous way of creating DXTc GPU textures was (this example is the Halo Wars model):
1. Artists save texture or image as an uncompressed file (like .TGA or .PNG) from Photoshop etc.
2. We call a command line tool which grinds away to compress the image to the highest quality achievable at a fixed bitrate (the lossy GPU texture compression step). Alternately, we can use the GPU to accelerate the process to near-real time. (Both solve the same problem.)
This is basically a fixed rate lossy texture compressor with a highly constrained standardized output format compatible with GPU texture hardware.
Now we have a .DDS file, stored somewhere in the game's repo.
3. To ship a build, the game's asset bundle or archive system losslessly packs the texture, using LZ4, LZO, Deflate, LZX, LZMA, etc. - this data gets shipped to end users
The Current New Way

The "current" new way is a little less complex (at the high level) because we delete the lossless compression step in stage 3. Step 2 now borrows a "new" concept from the video compression world, Rate Distortion Optimization (RDO), and applies it to GPU texture compression:
2. We call a command line tool called "crunch" that combines lossy clusterized DXTc compression with VQ, followed by an custom optimizing lossless backend coder. Now we have a .CRN file at some quality level, stored somewhere in the game's repo
3. To ship a build, game's asset bundle or archive system stores the .CRN file uncompressed (because it's already compressed earlier) - this data gets shipped to end users
The most advanced game engines, such as Unity and some other AAA in-house game engines, do it more or less this way now.
The Other New Way (that nobody knows about)

1. Artists hits a "Save" button, and a preview window pops up. Artist can tune various compression options in real-time to find the best balance between lossy compression artifacts and file size. (Why not? It's their art. This is also the standard web way, but with JPEG.) "OK" button saves a .CRN and .PNG file simultaneously.
2. To ship a build, game's asset bundle or archive system stores the .CRN file uncompressed (because it's already been compressed) - this data gets shipped to end users
But step #1 seems impossible right? crunch's compression engine is notoriously slow, even on a 20 core Xeon machines. Most teams build .CRN data in the cloud using hundreds to thousands of machines. I jobified the hell out of crunch's compressor, but it's still very slow.
Internally, the crunch library has a whole "secret" set of methods and classes that enable this way forward. (Interested? Start looking in the repo in this file here.)
The Demo of real-time crunch compression
Here's a Windows demo showing crunch-like compression done in real-time. It's approximately 50-100x faster than the command line tool's compression speed. (I still have the source to this demo somewhere, let me know if you would like it released.)
This demo utilizes the internal classes in crnlib to do all the heavy lifting. All the real code is already public. These classes don't output a .CRN file though, they just output plain .DDS files which are then assumed to be losslessly compressed later. But there's no reason why a fast and simple (non-optimizing) .CRN backend couldn't be tacked on, the core concepts are all the same.
One of the key techniques used to speed up the compression process in the QDXT code demonstrated in this demo is jobified Tree Structured VQ (TSVQ), described here.
GPU texture compression tools: What we really want

The engineers working on GPU texture compression don't always have a full mental model of how texture assets are actually utilized by game makers. Their codecs are typically optimized for either highest possible quality (without taking eons to compress), or they optimize for fastest compression time with minimal to no quality loss (relative to offline compression). These tools ignore the key distribution problems that their customers face completely, and they don't allow artists to control the tradeoff between quality vs. filesize like 25 year old standard formats such as JPEG do.
Good examples of this class of tools:
NVidia: GPU Accelerated Texture Compression
libsquish
These are awesome, high quality GPU texture compression tools/libs, with lots of customers. Unfortunately they solve the wrong problem.
What we really want, are libraries and tools that give us additional options that help solve the distribution problem, like rate distortion optimization. (As an extra bonus, we want new GPU texture formats compatible with specialized techniques like VQ/clusterization/etc. But now I'm probably asking for too much.)
The GPU vendors are the best ones to bridge the artificial divides described earlier. This is some very specialized technology, and the GPU format engineers just need to learn more about compression, machine learning, entropy coding, etc. Make sure, when you are designing a new GPU texture format, that you release something like crunch for that format, or it'll be a 2nd class format to your customers.
The GPU vendors are the best ones to bridge the artificial divides described earlier. This is some very specialized technology, and the GPU format engineers just need to learn more about compression, machine learning, entropy coding, etc. Make sure, when you are designing a new GPU texture format, that you release something like crunch for that format, or it'll be a 2nd class format to your customers.
Now, the distant future

Won Chun (then at Google, now at Rad) came up with a great idea a few years back. What the web and game engine worlds could really use is a "Universal Crunch" format. A GPU agnostic "download anywhere" format, that can be quickly transcoded into any other major format, like DXTc, or PVRTC, or ASTC, etc. Such a texture codec would be quite an undertaking, but I've been thinking about it for years and I think it's possible. Some quality tradeoffs would have to be made, of course, but if you like working on GPU texture compression problems, or want to commercialize something in this space, perhaps go in this direction.
Subscribe to:
Posts (Atom)

