Saturday, October 31, 2015

15 Reasons why developers won't use your awesome codec

Getting other developers to use your code in their products is surprisingly difficult.  A lot of this applies to open source development in general:

1. "Nobody's ever been fired for choosing IBM."
The codec is perceived to be "too new" (i.e. less than 5-10 or whatever years old), and the developer is just afraid to adopt it.

2. Inertia: The developer just has a super favorite codec they believe is useful for every scenario and they've already hooked it up, so they don't want to make the investment into switching to something else.

3. Politics: The developer irrationally refuses to even look at your codec because you work for the same company as they do.

4. Perceived lack of support: If the codec fails in the field, who's gonna help us debug it?

(I believe this is one reason why Rad's compression products are perceived to offer enough value to be worth purchasing.)

5. Linus hates C++: The codec is written in C++, so it's obviously crap.

6. Bad packaging: It's not wrapped up into a trivial to compile and use library, with a dead simple C-style API.

7. Too exotic: The developer just doesn't understand it.

8. Untested on mobile: It's been designed and tested on a desktop CPU, with unknown mobile performance (if it compiles/works at all).

9. Memory usage: It's perceived to use too much RAM.

10. Executable size: The codec's executable size is perceived to be too large.

11. Lacks some feature(s): The codec doesn't support streaming, or the decompressor doesn't support in-place decompression.

12. Performance: The thing is #1 on the compression ratio charts, but it's just stupendously slow.

13. Robustness: Your codec failed us once and so we no longer trust it.

14. Licensing issues: GPL, LGPL, etc. is the kiss of death.

15. Patent FUD: A patent troll has mentioned that your codec may be infringing on one of their patents, so the world is now afraid to use it. It doesn't matter if your codec doesn't actually infringe.

Thursday, October 29, 2015

Grab bag list of LZHAM improvements

Putting this somewhere so I don't forget it:

Patching specific:

- Allow the decompressor to refer to a static dictionary anywhere in memory (i.e. no memcpy() needed into the decompressor's dictionary buffer)

- bsdiff is so simple looking. Why does it work so well when patching very similar files?

- For patching, investigate allowing the compressor to emit "delta rep" matches, or rep matches with an optional plus/minus distances.

- Use Matt Mahoney's FV tool on patch file scenarios.

- Add a visualization mode to LZHAM's compressor, like FV.

General improvements:

- Symbol price accuracy: the parsing jobs uses approximate symbol statistics that are locked at the beginning of blocks. Examine how inaccurate these statistics really are.

- Try to SIMD the Huffman table update routes.

- The codec is too focused on the totally general case, which is streaming. Many useful scenarios do not involve streaming at all.

Add a "small file" optimization mode to LZHAM's compressor, which can be used as a hint to the compressor that it's OK to try encoding the first block in multiple ways.

Add a decompressor variant that doesn't support streaming at all, to reduce the # of decompressor basic blocks (idea from John Brooks).

- Add a compile time option that reduces the size of the decompressor as much as possible, even if it sacrifices perf.

- The big Huffman tables (like for literals, delta literals, match len/dist categories) are all initialized to symbol frequencies of 1's, which is wasteful for things like text files. Implement some way for the compressor to have control over this, like escape codes to jam small regions of the symbol table frequencies to 1's, or perhaps configuration bits at the start of the stream.

- LZHAM's Huffman table decoding fastbits (symbol decoding acceleration) table is too large on very small streams (an observation due to Charles Bloom). The decoder's should start with small tables and grow them over time.

- From John Brooks: Combine Deflate's idea of storing compressed Huffman table codelengths in the data stream with LZHAM's current approach of rebuilding the tables over time. At the start of streams, use compressed codelengths, then switch to dynamic.

- Add a configuration bit (global or per block?) to completely disable rep matches, which I believe will help a small amount on text files. Have the compressor try this optimization out.

- Steal the idea of global configuration settings from LZMA that tweak some of its prediction models, so the user can call LZHAM multiple times with different settings and choose the smallest results.

- There are many compromise decisions in LZHAM. For example, the decompressor state machine transition table can't be varied, the bitwise arithmetic coder's adaption rate can't be varied, and the Huffman table update interval is only user controllable. Allowing the compressor to optimize along these axis can result in gains.

- Further profile and improve LZHAM's small block perf (reduce startup cost, increase throughput near start of streams).

Platform specific:

- Get Emscripten compilation of the decompressor working, for Javascript support

- Deeply examine and optimize the generated assembly of the decompressor on ARM

Traumatic:

- Just dump arithmetic and Huffman coding and just switch to something like rANS. I think I'll need to do this to ultimately compete against Brotli.

Very long term pie in the sky stuff:

I have a frighteningly complex ROLZ branch of LZHAM alpha somewhere that works. Re-evaluate it.

Compression is Compilation

This post's title was inspired by the title of Paul Khuong's blog post "Refactoring With LZ77: Compression Is Compilation (?)". I doubt this idea is new, it's just the best viewpoint (or mental model) I think I need to have while working on LZHAM's successor. I've explained it to several developers and haven't been laughed out of the room yet.

I think the usual description of LZ coding given is a very mechanical (and boring) one, and dwells too much on details like match finding and entropy coding.

To create LZHAM I had to rethink the entire way I thought about (or mentally modeled) LZ77 codecs, basically from a compression centric viewpoint to a compiler+VM centric viewpoint. Before, while working on Deflate compatible codecs optimized for specialized processors like X360's, I focused intensely on lower level things like Huffman coding, match finding, simplistic parsing algorithms (greedy vs. lazy matching), and CPU specific decompressor optimizations.

My new viewpoint is that a LZ compressor is a compiler. The output data from a LZ compressor is an executable program consisting of very efficiently encoded instructions for a sort of virtual CPU. The decompressor is a type of virtual machine that decodes and executes the instructions encoded in the compressed data in order to precisely decode the original source data. The VM's output is the decompressed data. The compiler/compressor's goal is to produce a program that when executed results in the source data.

A modern LZ compressor like LZMA or LZHAM has a backend "Instruction selection" optimizer: "Typically, instruction selection is implemented with a backwards dynamic programming algorithm which computes the "optimal" tiling for each point starting from the end of the program and based from there." If you understand this then you will easily understand LZ "Optimal Parsing", which is more or less the same thing. (Charles Bloom and Yann Collet have some good posts on optimal parsing. There are also several decent thread replies on optimal parsing at encode.ru.)

The endless LZ77-based variants have a wide variety of "CPU" instruction sets, registers, and global state. They use a wide variety of entropy coding backends (raw bits, arithmetic, Huffman, or new fangled things like ANS/FSE) to encode instructions. Examples of some significant LZ77 variants that change up the instruction set and/or VM "registers" are ROLZ or LZP.

Also, in modern LZ77 implementations there is still a large amount of redundancy in how the instructions are selected and coded. The vanilla LZ codespace is just too flexible. Charles Bloom is one of the trail blazers in publishing observations about the inefficiencies and opportunities in LZ compression. His LAM ("Literals After Mismatches") exclusion concept is one example.

Some of these varieties are designed to tradeoff compression ratio (i.e. instruction encoding efficiency) for extremely high compression and decompression (i.e. VM execution) throughput, like LZ4 or Snappy. And some varieties very carefully focus on balancing compression ratio vs. decompression throughput (with slow, usually offline, compressors) like LZMA, LZHAM, and brotli.

I think LZHAM is useful at all because it has an instruction set optimizer (near-optimal parsing using dynamic programming), a carefully chosen instruction set and state "registers" (basically LZMA's), with a fast entropy coding scheme that blends Huffman and bitwise arithmetic coding in a way that trades off a tiny amount of ratio for higher symbol decoding throughput (i.e. VM execution speed) vs. LZMA. LZHAM shows one way to compete against another, well entrenched codec: choose one important axis to improve upon (say decompression performance), but keep the same (or better) compression ratio as your competitor. If all other things are equal, the new codec is "better". (Side note: Another way to compete against another codec is to clone its API and high-level behavior and semantics as closely as you can. This is what miniz and LZHAM do vs. zlib. These codecs even try to clone zlib's obscure flush semantics.)

At a even higher level, data compression (not just LZ77) can also be viewed as a type of Machine Learning followed by Entropy Coding. LZ77 is an intensely practical/efficient compromise that constrains the implementation in just the right way: byte level prediction/matching, with a high average amount of bytes output per each decoded LZ instruction during decompression.

LZ codecs like zlib, LZ4, LZMA, brotli, LZHAM, Tornado, and Rad Game Tool's Oodle product all strike some of the most interesting, and currently best (or very nearly the best) compromises and choices between ratio, various important measures of throughput, memory consumption, code size, optimization level, bug support, platform support, library update frequency, etc.

Benchmarks like Squash and the Large Text Compression Benchmark show the performance of a variety of codecs (not just LZ based ones), executed on different types of CPU's with different categories and sizes of source data.

Game developers need to pitch in and contribute samples of their shipping data to the public domain, so LZ codec authors can make implementation and optimization decisions tuned to their unique datasets. We need a "Game Data Compression Benchmark" site.

Tuesday, October 27, 2015

Quick EXE patching experiment: bsdiff/xdelta vs. lzham

This is just a couple data points, so I'm not going to read much into this. I'm just trying to get a feel for how far away LZHAM's current support for patch generation is vs. other well known tools.

Unity46.exe -> Unity52.exe:

Unity46.exe: 30476872
Unity52.exe: 41259480
lzham_codec_devel (1): 8737015 
lzham_codec_devel (2): 8605360
bsdiff: 12320793
xdelta: 14233171

Unity510.exe -> Unity512.exe:

Unity510.exe: 52680480
Unity512.exe: 52740568
lzham_codec_devel (1): 5042088
lzham_codec_devel (2): 4736124
bsdiff: 3810343
xdelta: 11945751 

bsdiff: Windows version, based off bsdiff 4.1
xdelta: -9 (max compression)
lzham 1 settings: -m4 -x8 -d27
lzham 2 settings: -o -e -b -h4 -m4 -x16 -d27

Note about lzham settings: 2 is ridiculously slow (and because of -x16 required rebuilding the compressor DLL), but it shows the highest ratios lzham is capable of at the moment.

Quick and dirty conclusions:
xdelta isn't doing well here.

When the old/new files resemble each other (Unity510.exe vs. Unity512.exe), bsdiff beats LZHAM by a significant margin.

On the other hand, when the files aren't so similar (Unity46.exe vs. Unity52.exe), the tables are turned and LZHAM beats bsdiff. Hmm.

File patching using data compression with flushing

This neat idea is from Alexander Suvorov at Unity. There's a similar approach described in "Data Compression: The Complete Reference", section 8.14.2 File Differencing: VCDIFF, but it's not described in a way that is easily adapted to existing codecs. (Note any errors below are due to me, not Alexander.)

This simple file patching (or differential compression) technique is compatible with any deterministic lossless codec that supports flushing during compression, like zlib and LZHAM. It doesn't require the codec to explicitly support patch file generation, just flushing during compression. (See the LZHAM_FULL_FLUSH etc. enums in lzham.h.)

Let's say you want to transmit a new version of a file (let's call it file B) to a remote client, and the remote client already has the first version (file A). Here's how you can create a patch file which the remote client can use to recover file B from file A.

Assumptions: Both the local and remote clients have the same codec, use the same codec settings, the compressor is deterministic, and flushing doesn't reset the compressor's internal state (it just compresses all input provided so far). For LZ, the dictionary size is at least filesize(A)+filesize(B) bytes.

1. The remote client compresses file A and flushes the compressor, resulting in a blob of compressed data (let's call this "CA").

Of course, it's probable the remote client already has the compressed version of file A, in which case you skip this step.

2. The local client also compresses file A and flushes the compressor, resulting in a blob of compressed data which is the same as CA in step 1. Then it immediately compresses file B (using the same compressor instance), resulting in another blob of compressed data (CB). 

The flush doesn't reset the compressor's internal state, it just causes it to output as much compressed data as possible aligned to a byte boundary, so when it compresses file B it'll take advantage of matches against file A.

Said in another way: You compress a total of filesize(A)+filesize(B) bytes using one instance of the compressor (the compressor's internal state is not reset in between files, for LZ this means the sliding dictionary is not reset during the flush), but flush between A and B. The compressor flush cleanly separates the compressed data needed to decompress file A from the compressed data needed to decompress file B.

3. Key step: Only transmit CB to the remote client (i.e. only the compressed data occurring immediately after the flush). The local client doesn't need to transmit CA because the remote client has already computed it in step 1.

4. Now the remote client combines compressed blobs CA and CB into a single compressed blob and decompresses the entire thing. It ignores the first filesize(A) decompressed bytes, resulting in file B. That's it.

Several downsides to this idea: The remote client must compress file A (costing CPU and/or storage), the dictionary size must be large enough to encompass both files (costing RAM), and the remote client must decompress both files. These requirements could be too expensive for mobile with large files.

Patch file generation using LZHAM is simpler because it explicitly supports seed dictionaries. You set the seed dictionary to file A, compress file B, resulting in a compressed patch file that recovers B from A. This still can require too much RAM for mobile, because you need a dictionary of at least filesize(A) bytes.