Monday, January 19, 2015

Parallelized download+decomp performance of various codecs

I finished porting and testing LZHAM v1.0 on OSX today. Everything works, including multithreaded compression. I've also tuned the Huffman table updating options more. Next stop is iOS.

The graphs in the previous post show the summation of load_time+decomp_time at various download rates. For large amounts of data (not small blocks), it makes sense to decompress each buffer of compressed data as it becomes available (from the disk or network) using streaming decompression, instead of waiting for all the compressed data to be fully downloaded first.

The following graphs show uncompressed_size / MAX(load_time, decomp_time) at various download rates on ~166MB of uncompressed Unity iOS asset bundle data compressed as a single .TAR file with various codecs. (Of course, in case it's not obvious, I'm assuming everywhere here that the data has already been pre-compressed. I am not including compression times anywhere here.)

My brain is tired, but this should roughly approximate the total amount of time it would take to deliver the uncompressed data (ignoring the effects of buffering and other overheads). I'm also assuming it's a single large stream of compressed data and you're not doing something fancy like parallelizing decompression.

The version of LZHAM referenced in this post is v1.0, which I'll be releasing on github hopefully this week (not the old alpha on Google Code).

Note: I used the regular LZ4 compressor for these graphs, *not* LZ4-HC which compresses slightly better.

1. 3.3GHz CPU, X axis scale=281.63 megabytes/sec.

Graph 1: On fast CPU's, up until 13 MB/sec. download rates LZMA is the winner (because its ratio is highest), then up to 95MB/sec. LZHAM v1.0 switches to the winner (because its ratio is almost as high as LZMA but it decompresses more quickly - at these rates LZMA is just too slow to keep up with the download rate). After than, good old Deflate (miniz's decompressor) is best up until ~130MB/sec., then LZ4 takes over from there.

2. 3.3GHz CPU, X axis scale=2.35 megabytes/sec. (zoomed version of above)

Graph 2: A zoomed in version of the left of graph 1 showing more detail at the slower downloads speeds. Graph 2 shows that on fast CPU's and slow networks, LZMA is the clear winner because all that matters to the overall delivery time is compression ratio. LZHAM is close but loses because it has a slightly lower ratio (usually 1-4% lower) on this data. The other codecs just have too low a ratio to compete at these low network speeds.

3. VERY SLOW CPU, X axis scale=4.7 megabytes/sec.

Graph 3: A much slower CPU (simulated 1/12th the performance of my desktop CPU - just multiplied the decomp time by 12). This graph is zoomed in to show detail at low network speeds. LZMA is the winner up to ~1MB/sec., then it plateaus because it's too slow to keep up on this slow ass CPU. LZHAM can sustain up to 3MB/sec, then Deflate takes over. From there LZ4 will eventually take over as the winner at the higher network rates because Deflate eventually won't be able to keep up. At the very highest network rates it makes more sense to not even bother using compression at all, because the CPU won't be able to keep up, even with LZ4.

My 2 cents: The "best" codec to use (where "best" minimizes the amount of time the user needs to wait for the full download) depends on the client's CPU speed, the codec's compression ratio, and the download (or disk) rate. A smart content server (that doesn't give a crap about how much data it actually sends over the network) could choose the best codec to use depending on these factors.

To minimize content delivery (download) time on fast desktop CPU's, use something like LZMA or LZHAM. Single threaded LZMA doesn't scale beyond ~13MB/sec. on my CPU, where LZHAM will scale up to 95MB/sec. but has a slight download time penalty (around 1-4%) at the slower download rates.

On slow CPU's, LZMA plateaus very early. Beyond this download rate, LZHAM v1.0 is the winner, then Deflate followed by LZ4.

Of course, we do care about how much data must be downloaded, due to ISP caps, monthly rate plans, etc. In this scenario, we need to stick to using a high ratio codec like LZMA or LZHAM. LZMA doesn't scale well on slow CPU's, so we need something like LZHAM that has a similar ratio but scales more effectively.

To ultimately improve the current state of the art, we need a new codec with higher ratios than LZMA that doesn't run like a complete dog, use massive amounts of RAM, or require large #'s of cores to be practical. I don't know of any open source codecs that fit these requirements yet. Somebody needs to write one.


  1. I'm surprised that deflate is worse than lz4 - in my mental model lz4 was superior. Are you using LZ4-HC? LZ4-HC compresses significantly better compared to LZ4 (albeit significantly slower); decompression performance is usually roughly equal so for off-line compression you should definitely use that instead of the baseline compressor.

    1. I used plain LZ4 - I'll switch to the HC version. If its ratio is close to zlib's it'll be superior (or really close) for offline compression.

    2. LZ4-HC ratio can be on par with deflate; it depends on the data. deflate has a tiny window (32k!) but does Huffman after literals; LZ4-HC has a big window but only does LZ.

      I seem to remember checking the ratios on significant amount of source code and they were reasonably close... I may be wrong.