Wednesday, December 2, 2015

A graph submission API for lossless data compression

Earlier today I was talking with John Brooks (CEO of Blue Shift Inc.) about my previous blog post (adding a new CompressQuery()API to lossless compressors). It's an easy API to understand and add to existing compressors, and I know it's useful, but it seems like only the first most basic step.

For fun, let's try path finding into the future and see if we can add some more API's that expose more possibilities. How about these API's, which enable the caller to explore the solution space more deeply:

- CompressPush(): Push compressor's internal state
- CompressPop(): Pop compressor's internal state
- CompressQuery(): Determine how many bits it would take to compress a blob
- CompressContinue(): Commit some data generating some compressed output

Once we have these new API's (push/pop/query, we already have commit) we can use the compressor to explore data graphs in order to compose the smallest compressed output.

The Current Situation

Here's what we do with compressors today:

There are two classes of nodes here that represent different concepts.

The blue nodes (A, B, C, etc.) represent internal compressor states, and the black nodes represent some data you've given to the compressor. Black nodes represent calls to CompressContinue() with some data to compress. You put in some data by calling CompressContinue(), and the compressor moves from internal states A all the way to F, and at the end you have some compressed data that will recover the data blobs input in nodes G, H, I, etc. at decompression time. Whenever the compressor moves from one blue node to the next it'll hand you some compressed bits.

Now let's introduce CompressQuery()

Let's see what possibilities CompressQuery() opens up:

Now the black nodes represent "trials". In this graph, the compressor starts in state A, and we conduct three trials labeled B, C, and D using the CompressQuery() API to determine the cheapest path to take (i.e. the path with the highest compression ratio). After figuring out how to get into compressor state E in the cheapest way (i.e. the fewest amount of compressed bits), we "commit" the data we want to really compress by calling CompressContinue(), which takes us into state E (and also gives us a blob of compressed data). We repeat the process for trials F, G, H, which gets the compressor into state I, etc. At Y we have fully compressed the input data stream and we're done.

CompressQuery() is a good, logical first step but it's too shallow. It's just a purely local optimization tool.

Let's go further: push/pop the compressor state

Sometimes you're going to want to explore more deeply, into a forest of trials, to find the optimal solution. You're going to need to push the current compressor's state, do some experiments, then pop it to conduct more experiments. This approach could result in higher compression than just purely local optimization.

Imagine something like this:

At compressor state A we first push the compressor's internal state. Now conduct three trials (C, D, E), giving us compressor state G, etc. At L we're done, so we pop back to node A and explore the bottom forest. Once we've found the best solution we pop back to A and commit the black nodes with the best results to get the final compressed data.


The main point of this post: Lossless data compressors don't need to be opaque black boxes fed fixed, purely linear, data streams. Tightly coupling our data generation code with the backend compressor can enable potentially much higher ratios.


  1. Do you mean like media filters that winrar uses but more specialized to your data?

    BTW I would love it if you could somehow work together with 7-zip to get LZHAM in the GUI.

  2. For 1st question, yes.

    About 7-zip: Yes, this is one of my goals, I will contact Igor and ask him. I did this myself in the 7-zip codebase from years ago, and it wasn't that hard. But now we have a new plugin API in 7-zip, and that would need to be hooked up. Igor is the right person to do it. (I would do it and give him the code, but I don't have the free time.)

    Right now we have two 7-zip codec plugins written (LZHAM and Zstd), Zstd one is coming. After that, the squash universal library.