Tuesday, December 5, 2023

Announcing CPNG, "Compatible Network Graphics", a backwards compatible fork of PNG

CPNG ("Compatible Network Graphics") is a 100% backwards compatible fork of the ~30 year old PNG image format, which is still thoroughly, and probably forever, stuck in the 1990's. I've been quietly designing, prototyping, and shipping CPNG's key features over the past couple years on github.

Why continue messing with PNG at all? Because if you can figure out how to squeeze new features into PNG in a backwards compatible way, it's instantly compatible with all browsers, OS's, engines, etc. This is valuable.

I realized, as long as my software is significantly faster for encoding/decoding (>10x encode, >2-3x decode - per thread) and is 100% backwards compatible, I'll have gathered the momentum and library adoption to also add some badly needed new features: faster encoding/decoding by constraining and simplifying the Deflate stream to pixels vs. bytes, SIMD encoding, multithreading so performance can scale up to modern systems, and real HDR pixels (FP16 IEEE half float and LOGLUV32) so it scales to modern and future HDR monitors/tool chains.

Crucially, the CPNG data stream must be entirely backwards compatible with the existing PNG format, like how color TV was introduced but it still worked on B&W TV's. New features can be added, as long as existing decoders ignore them and return a viewable image:

1. Constrained Deflate for huge (10-25x) encoding speedups vs. most other libraries, and 2-3x decoding speedups: this has shipped already in fpng/fpnge.

I documented the Constrained Deflate feature utilized by fpng here: 

Other encoder/decoder authors can utilize this feature today, just by following this spec. CPNG will leverage this approach at first, before moving to a SIMD encoding approach (like Google's fpnge utilizes).

2. Multithreaded encoding/decoding using multiple IDAT's: Apple does this already.

We're putting a seek (or offset) table into the CPNG ancillary chunk. The image will be optionally parallel encodable and decodable in strips.

This also parallelizes the painful Alder-32 and CRC-32 steps. To remain backwards compatible, the CPNG format still has to (somewhat ridiculously) compute both checksums, putting it at a disadvantage vs. other modern image formats. (Who thought putting two checksums into an image format was a good idea, anyway?)

3. FP16 (IEEE 754-2008 half float) support, but with a lossless tone mapped fallback. One approach has already been demonstrated in the png16 repo. This feature exploits PNG's already built-in support for 16-bit pixels.

Here are some encoded 48-bit PNG example images. These images are losslessly tone mapped:

These PNG files are tone mapped 16-bit images, but internally they are also standard half float images. A small 256 entry lookup table is stored in an ancillary chunk, so a CPNG HDR decoder can retrieve the half float pixels from the 16-bit unsigned pixels. See the png16 repo source for more details on how this approach works.

HDR CPNG images must remain viewable, in some reasonable way, by all browsers and OS's that only support PNG, even if they are HDR images. This requires a lossless tonemap operator of some sort, so the legacy PNG image is viewable as an SDR image, but the HDR data can always be losslessly recovered using a simple and very fast procedure. 

4. LOGLUV32 - This is Ward's very space efficient, perceptually lossless HDR pixel format:

The L16 portion is stored in the legacy PNG image as a 16-bit grayscale image, so the basic image remains viewable as a SDR PNG file. We'll have to losslessly tonemap these log luminance pixels, using a technique similar to the approach used for FP16 pixels. 

The U8V8 portion (color) is stored in ancillary chunk(s), compressed as another 8,8 image. For alpha support, the CPNG file can exploit the LA16 vs. L16 portion of the PNG file. These ancillary chunks will be ignored by old PNG readers.

The separate chunk for UV color data is needed, otherwise I don't see how the 16-bits of UV color can be placed into the PNG without older viewers displaying a totally unviewable image. Older PNG readers viewing/previewing just tone mapped luminance should be fine.

Friday, May 12, 2023

The claimed rANS decoding advantages are being exaggerated

To put it simply:

SSE 4.1, on an Intel Ice Lake CPU:

SIMD Range: 64 streams: 738 MiB/sec., 436,349 bytes

SIMD Range: 16 streams: 619.1 MiB/sec.

SIMD rANS: 8 streams: 556.5 MiB/s, 435,626 bytes

SIMD Huffman: 32 streams: 1.1 GiB/sec.

The vectorized 24-bit Pavlov/Subbotin range coder is here

The rANS decoder I'm comparing against appears to be well implemented. (If I could use rANS, which I can't because it's patented, I would start with this code.) It could be unrolled more to 16 (vs. 8) or more streams, and that's going to boost decoding perf. some. Let's say ~25% faster, so ~695 MiB/sec. Range is still comparable, and Huffman is of course faster. The benchmark data is showing it's pointless to risk infringement.

Also, if you show me probably any vectorized rANS decoder, many of the vectorization techniques also easily apply to SIMD range and Huffman decoders too. They each have to do many of the same things, and this helps equalize the perceived performance advantages of rANS decoding. 

Ultimately working around range coding's division requirement is not the biggest problem. It's the gathers and keeping all the lanes normalized and fed efficiently - which are the same problems the other decoders face. Once you solve these problems for one entropy coder you can apply them to the others. Many of the implementation differences of the actual entropy coding technique melt away once vectorized.

Note encoding is a different beast. I've been unable to see a way to easily interleave the output bytes from X parallel range coders. The SSE 4.1 range sample code linked to above has to swizzle the bytes after encoding into a single stream, however there is no special side band or signaling information needed (just like with rANS). (This is just like my LZHAM codec, which has been interleaving arithmetic and Huffman coded streams with no extra signaling or side band data for many years now - including before FSE/ANS was a thing.) With rANS encoding interleaving the outputs seems very easy. With more research into other entropy coders perhaps it's possible to eliminate this advantage too.

A straightforward AVX-2 port of this SSE 4.1 range coder is ~65% faster on Ice Lake, putting it roughly in the same ballpark as the AVX-2 rANS decoders I've seen benchmarked. This makes sense because many of the bottlenecks are similar and have little to do with the actual entropy coding technique.

(Note I could upload my SSE 4.1 Huffman decoder to github, but honestly 1.1 GiB/sec. is not a big deal. Fast scalar Huffman on my machine gets ~800 MiB/sec. SSE 4.1 lacks easy variable per-lane bit shifting which hurts Huffman a lot. Straightforward AVX-2 Huffman decoding with an 8-bit alphabet gets ~1.8 GiB/sec. on this machine.)

Friday, May 5, 2023

LZ_XOR/LZ_ADD progress

I'm tired of all the endless LZ clones, so I'm trying something different.

I now have two prototype LZ_XOR/ADD lossless codecs. In this design a new fundamental instruction is added to the usual LZ virtual machine, either XOR or ADD. Currently the specific instruction added is decided at the file level. (From now on I'm just going to say XOR, but I really mean XOR or ADD.)

These new instructions are like the usual LZ matches, except XOR's are followed by a list of entropy coded byte values that are XOR'd to the string bytes matched in the sliding dictionary. On certain types of content these new ops are a win (to a big win), but I'm still benchmarking it.

The tradeoff is an expensive fuzzy search problem. Also, with this design you're on your own - because there's nobody to copy ideas from. The usual bag of parsing heuristic tricks that everybody copies from LZMA don't work anymore, or have to be modified.

One prototype is byte oriented and is somewhat fast to decompress (>1 GiB/sec.), the other is like LZMA and uses a bitwise range coder. Fuzzy matching is difficult but I've made a lot of headway. It's no longer a terrifying search problem, now it's just scary.

The ratio of XOR's vs. literals or COPY ops highly depends on the source data. On plain text XOR's are weak and not worth the trouble. They're extremely strong on audio and image data, and they excel on binary or structured content. 

With the LZMA-like codec LZ_XOR instructions using mostly 0 delta bytes can become so cheap to code they can be preferred over COPY's, which is at first surprising to see. It can be cheaper to extend an LZ_XOR with some more delta bytes vs. truncating one and starting a COPY instruction. On some repetitive log files nearly all emitted instructions are long XOR's. 

COPY ops must stop on the first mismatch, while XOR ops can match right through minor mismatches and still have net gain. Adding XOR ops can drastically reduce the # of overall instructions the VM ("decompressor") has to process, and also gives the parser an expanded amount of freedom to trade off reduced instructions vs. ratio. It's not all about ratio, it's about decompression speed.

Overall this appears to be a net win, assuming you can optimize the parsing. GPU parsing is probably required to pull this off, which I'm steadily moving towards.

The other improvement that shows net gain on many files is to emit an optional "history distance delta offset" value. This allows the encoder to specify a [-128,127] offset relative to one of the "REP" match history distances. The offset is entropy coded.

Sunday, April 30, 2023

Somewhere Range Coding went off the rails

I've lost track of the number of Range Coders I've seen floating around for the past ~25 years. Subbotin's coders released in the late 90's/early 2000's seem to be very influential. Each implementation has different tradeoffs that greatly impact a vectorized implementation. There is no single definitive range coder design, and I've seen 3 very different ways to handle carries. I learned a lot implementing my first vectorized range decoder.

Most of the tradeoffs boil down to how carries are handled or how many state registers are used by the decoder. You can delay the carries during encoding (Subbotin's earliest "carry-less" or delayed carry encoder that I can find, which is very clever), or fixup the already encoded bytes by propagating the carries backwards in memory during encoding (LZMA's binary range coder, which is similar to Subbotin's), or move the carry logic into the decoder (another by Subbotin called the "Russian People's Range Coder"). There are also decoder variants that track low/code/range vs. just code/range (LZMA's and Subbotin's earliest).

Importantly, range coders using this renormalization method during decoding are unnecessarily complex:


Here's a different Subbotin variant that also uses overly complex and obfuscated renormalization logic during decoding. (Please just stop guys - this is the wrong direction.)

Why is all that messy conditional logic in the decoder renormalization step? It appears to be carry related, but the carries should have been handled during encoding in some way. It's unnecessary to create a working and correct range coder.

Instead do this, from Subbotin's first delayed carry range coder. This is easily vectorizable and he only uses 2 variables (code and range). This is quite similar to LZMA's and LZHAM's, except it doesn't need to propagate carries back in the buffer containing the output bytes.


Another important note: If you switch to 24-bits vs. 32-bits, you don't need to use integer division. (Note division isn't needed for a binary range coder. It's also not strictly needed for a range coder that only supports small alphabets. The division enables using a fast decoding table for large alphabets.) Instead you can use single precision division and truncate the results. This option is crucial for vectorization on modern CPU's. Modern vectorized single precision division is fairly fast now. Or said in a different way, it's not so slow that it's a show stopper. 

In vectorized implementations of both Huffman and range decoding, most of the instructions are actually not related to the actual entropy coding method. Instead they are common things like fetching from the decoder's lookup table (using some sort of manual or explicit gather), and fetching source bytes and distributing them to the lanes during renormalization. The FP division is a small part of the whole process. 

What seems to slow down vectorized range decoding is the more complex renormalization vs. Huffman, which in my implementation can fetch between [0,2] bytes. By comparison, my vectorized Huffman decoders fetch either 0 or 2 bytes per step.

All of these details are starting to matter now that rANS coding is falling under patent protection. Range Coding is a ~44 years old technique.

Comparing Vectorized Huffman and Range Decoding vs. rANS (Or: rANS entropy coding is overrated)

The point here is to show that both Huffman and range decoding are all vectorizable and competitive vs. rANS in software.

What I care about is fast entropy decoding. Even scalar encoding of any of these techniques is already fast enough for most use cases. For the distribution use case, you encode once on some server(s) and then distribute to millions of users.

I've constructed fast SSE 4.1 and AVX-2 vectorized decoders for range coding and Huffman coding. The SSE 4.1 vectorized 24-bit range decoder (using just 16 streams for simplicity) is here on github. This code is easily ported to AVX-2 and Huffman in a day or two. The Steam Hardware Survey shows that 99.58% of surveyed users (as of 3/23) support SSE 4.1, so it seems pointless to focus on or benchmark scalar decoders any more. 

There's nothing particularly special or exotic about these implementations. The range coder is LZMA's binary range coder (which is also used by LZHAM), which I adapted to handle 8-bit symbols with 12-bit probabilities and 24-bit registers. It appears quite similar to a range coder released by Dmitry Subbotin, except it handles carries differently. The Huffman decoder uses the Package Merge algorithm. 

The tricky renormalization step (where each vector lane is fed with more bytes from the source streams) is basically the same between rANS, Huffman and range decoding. Range decoding needs integer division which is scary at first, but the CPU engineers have solved that problem (if all you need is 24-bits). Also, each decoder uses some sort of table to accelerate decoding. 

The vectorized versions are surprisingly similar once implemented. They do some math, do a gather from a table, write the output symbols, do some more math, then fetch some bytes from the source buffer and renormalize and distribute to each lane in parallel these source bytes. Once you've implemented one well you've more or less got the others.

Here are some benchmarks on the same machine (a Dell Ice Lake laptop, Core i7 1065G7) on Calgary Corpus book1 (768,771 bytes). All coders use 8-bit symbols. 

SSE 4.1 Huffman/Range Decoding

  • Huffman: 32 streams, 1,122 MiB/sec., 438,602 bytes
  • Range: 64 streams, 24-bit: 738 MiB/sec., 436,349 bytes (32 streams=436,266 bytes)

AVX-2 Huffman/Range Decoding

  • Huffman: 32 streams, 1,811 MiB/sec., 438,602 bytes
  • Range: 64 streams, 24-bit: 1,223 MiB/sec., 436,349 bytes

Notes: The range coder uses 12 bit probabilities. The Huffman coder was limited to a max code size of 13 bits. The same encoded stream is compatible with SSE 4.1 and AVX-2 decoders. The compressed bytes statistic doesn't include the probability or Huffman code length tables. I'm not a vector coding expert, so I'm sure in the right hands these decoders could be made even faster.

Collet's Scalar Huffman/FSE Decoding

Using his fast scalar Huffman and FSE encoders (FSE -b -h, or FSE -b):
  • Huffman: 840 MiB/sec., 439,150 bytes
  • FSE: 365 MiB/sec., 437,232 bytes
Also see FPC on github by Konstantinos Agiannis, which gets on a Core i5 ~814 MiB/sec. for scalar length-limited Huffman decoding.

Giesen's Vectorized rANS Decoding

Using exam_simd_sse41, running on the same machine using WSL and compiled with clang v10:

  • rANS: 213.9 MiB/s, 435,604 bytes
  • interleaved rANS: 324.7 MiB/s, 435,606 bytes
  • SIMD rANS: 556.5 MiB/s, 435,626 bytes

Under Windows 10 all code was built with MSVC 2019 x64.

It's easy to construct a Huffman decoding table that supplies 2-3 symbols per decode step.  (My Huffman benchmarks don't demonstrate this ability, but they could if I limited the max Huffman code size to 6 or 7 bits, which is practical for some problems.) AFAIK this ability has not been demonstrated with rANS. Until it is, Huffman isn't going anywhere. Considering rANS is patented, it's just not worth the risk and trouble for questionable decoding gains when usable alternatives exist. I'll revisit this in 20+ years (the lifetime of patents in the US).

Monday, April 24, 2023

Vectorized interleaved Range Coding using SSE 4.1

In order to avoid the current (and upcoming) ANS/rANS entropy coding patent minefield, we're avoiding it and using vectorized Range Coding instead. Here's a 24-bit SSE 4.1 example using 16 interleaved streams. This example decoder gets 550-700 megabytes/sec. with 8-bit alphabets on various Intel/AMD CPU's I've tried:

More on the rANS patent situation (from early 2022):

This decoder design is practical on any CPU or GPU that supports fast hardware integer or float division. It explicitly uses 24-bit registers to sidestep issues with float divides. I've put much less work on optimizing the encoder, but the key step (the post-encode byte swizzle) is the next bottleneck to address.

Friday, April 21, 2023

Faster LZ is not the answer to 150-250+ GB video game downloads

When the JPEG folks were working on image compression, they didn't create a better or faster LZ. Instead they developed new approaches. I see games growing >150GB and then graphs like this, and it's obvious the game devs are going in the wrong direction:


(Note these benchmarks are great and extremely useful.)

Separating out the texture encoding stage from the lossless stage is a compromise. I first did this in my "crunch" library around 15 years ago. It was called "RDO mode". You can swizzle the ASTC/BC1-7 bits before LZ, and precondition them, and that'll help, but the two steps are still disconnected. Instead combine the texture and compression steps (like crunch's .CRN mode - shipped by Unity for BC1-5.) 

Alternatively: defer computing the GPU texture data until right before it's actually needed and cache it. Ship the texture signal data using existing image compression technology, which at this point is quite advanced. For normal maps, customize or tune existing tech to handle them without introducing excessive angular distortion. I think both ideas are workable.

Also, these LZ codecs are too fast. They are designed for fast loading and streaming off SSD's. Who cares about shaving off a few hundred ms (or a second) when it takes hours or days to download the product onto the SSD?

Somebody could develop a 10x faster Oodle (or an Oodle that compresses 1-2% better) and we're still going to wait many hours or days to actually use the product. And then there's the constant updates. This approach doesn't scale.

It's fun and sexy to work on faster LZ but the real problem (and value add) doesn't call for better or more lossless tech. This is a distraction. If trends continue the downloads and updates will be measured in terms of fractional or 1+ week(s).

Sunday, April 2, 2023

The Dark Horse of the Image Codec World: Near-Lossless Image Formats Using Ultra-Fast LZ Codecs

I think simple ultra-high speed lossy (or near-lossless) image codecs, built from the new generation of fast LZ codecs, are going to become more relevant in the future.

Computing bottlenecks change over time. As disk space, disk bandwidth, and internet bandwidth increases, older image codecs that squeeze every last bit out of the resulting file become less valuable for many use cases. Eventually websites or products using noticeably lossy compressed images will be less attractive. The bandwidth savings from overly lossy image codecs will become meaningless, and the CPU/user time and battery or grid energy spent on complex decompression steps will be wasted.

Eventually, much simpler codecs with lower (weaker) compression ratios that introduce less distortion, but have blazingly fast decompression rates are going to become more common. This core concept motivates this work.

One way to construct a simple lossless or lossy image codec with fast decompression is to combine a custom encoding tool with the popular lossless LZ4 codec. LZ4's compression and decompression is extremely fast, and the library is reliable, updated often, extensively fuzz tested, and very simple to use. 

To make it lossy, the encoder needs to precondition the image data so when it's subsequently compressed by LZ4, the proportion of 4+ byte matches vs. literals is increased compared to the original image data. I've been constructing LZ Preconditioners, and building new LZ codecs that amend themselves to this preconditioning step, over the past year.

Such a codec will not be able to compete against JPEG, WebP, JPEG 2000, etc. for perceived quality per bit. However, it'll be extremely fast to decode, very simple, and will likely not bloat executables because the LZ4 library is already present in many codebases. Using LZ4 introduces no new security risks.

This LZ preconditioning step must be done in a way that minimizes obvious visible artifacts, as perceived by the Human Visual System (HVS). This tradeoff, of increasing distortion but reducing the bitrate is a classic application of Rate-distortion theory. This is well-known in video coding, and now in GPU texture encoding (which I introduced in 2009 with my "crunch" compression library).

The rdopng tool on github supports creating lossy LZ4 compressed RGB/RGBA images using a simple rate-distortion model. (Lossless is next, but I wanted to solve the harder problem first.) During the preconditioning step, the LZ4 rate in bits is approximated using a sliding dictionary and a match finder. For each potential match replacement which would introduce distortion into the lookahead buffer, the preconditioner approximates the introduced visual error by computing color error distances in a scaled Oklab perceptual colorspace. (Oklab is one of the most powerful colorspaces I've used for this sort of work. There are better colorspaces for compression, but Oklab is simple to use and well-documented.)

Perceptually, distortions introduced into regions of images surrounded by high frequency details are less noticeable vs. regions containing smooth or gradient features. Before preconditioning, the encoder computes two error scaling masks which indicate which areas of the image contains large or small gradients/smooth regions. These scaling masks suppress introducing distortions (by using longer or more matches) if doing so would be too noticeable to the HVS. This step has a large impact on bitrate and can be improved.

To speed up encoding, the preconditioner only examines a window region above and to the left of the lookahead buffer. LZ4's unfortunate minimum match size of 4 bytes complicates encoding of 24bpp RGB images. Encoding is not very fast due to this search process, but it's possible to thread it by working on different regions of the image in parallel. The encoder is a proof of principle and testing grounds, and not as fast as it could be, but it works.

The encoder also supports angular error metrics for encoding tangent space normal maps.

LZ4I images are trivial to decode in any language. A LZ4I image consists of a simple header followed by the LZ4 compressed 24bpp RGB (R first) or RGBA pixel data:

Some example encodings:

Lossless original PNG image, 19.577 bits/pixel PNG, or 20.519 bits/pixel LZ4I:

Lossy LZ4I: 42.630 709 Y dB,  12.985 bits/pixel, 1.1 gigapixels/sec. decompression rate using LZ4_decompress_safe() (on a mobile Core i7 1065G7 at 1.3GHz base clock):

Biased delta image:

Greyscale histogram of biased delta image:

A more lossy LZ4I encoding, 38.019 709 Y dB, 8.319 bits/pixel:

Biased delta image:

Greyscale delta image histogram:

Lossless original: 14.779 bits/pixel PNG, 16.551 bits/pixel LZ4I:

Lossy LZ4I: 45.758 709 Y dB, 7.433 bits/pixel (less than half the size vs. lossless LZ4I!):

Biased delta image:

rdopng also supports lossy QOI and PNG encoding. QOI is a particularly attractive for lossy compression because the encoding search space is tiny, however decompression is slower. Lossy QOI encoding is extremely fast vs. PNG/LZ4.

It's also possible to construct specialized preconditioners for other LZ codecs, such as LZMA, Brotli, Zstandard, or LZSSE. Note the LZ4 preconditioner demonstrated here is universal (i.e. it's compatible with any LZ codec) because it just introduces more or longer matches, but it doesn't exploit the individual LZ commands supported by each codec.

LZSSE is particularly attractive as a preconditioning target because it's 30-40% faster than LZ4 and has a higher ratio. This is next. A format that uses tiled decompression and multiple threads is also a no-brainer. Ultimately I think LZ or QOI-like variants will be very strong contenders in the future.

Thursday, March 30, 2023

EA/Microsoft Neural Network GPU Texture Compression Patents

Both Microsoft and EA have patented various ways of bolting on neural networks to GPU (ASTC/BC7/BC6h) texture compressors, in order to accelerate determining the compression params (mode, partition etc.):

EA: https://patents.google.com/patent/US10930020B2/en

Microsoft: https://patents.google.com/patent/US10504248B2/en

What this boils down to: techniques like TexNN are potentially patented. This work was done in 2017/2018 but it wasn't public until 2019:

Over a decade ago, I and others researched attempting to bolt real-time GPU tex compressors after existing lossy compressors (like JPEG etc.). The results suffer because you're dealing with 2 generations of lossy artifacts. Also existing image compressors aren't designed with normal maps etc. 

If you're working on a platform with limited computing resources (the web, mobile), or limited to no SIMD (web), real-time recompression has additional energy and resource constraints.

Both my "crunch" library and Basis Universal bypass the need for fast real-time texture compression entirely. They compress directly in the GPU texture domain. This approach is used by Unity and many AAA video games:

I've been researching the next series of codecs for Basis Universal. This is why I wrote RDO PNG, QOI and LZ4, and bc7enc_rdo.

I am strongly anti-software patent. All the EA/MS patents will do is push developers towards open solutions, which will likely be better in the long run anyway. Their patents add nothing and were obvious. These corporations incentivize their developers to patent everything they can get their hands on (via bonuses etc.), which ultimately results in exploiting the system by patenting trivial or obvious ideas. Ultimately this slows innovation and encourages patent wars, which is bad for everybody.

It's possible to speed up a BC7 encoder without using neural networks. An encoder can first try the first 16 partition patterns, and find which is best. The results can be used to predict which more complex patterns are likely to improve the results. See the table and code here - this works:

It's potentially possible to use a 3 or 4 level hierarchy to determine the BC7 partition pattern. bc7enc.cpp only uses 2 levels. This would reduce the # of partition patterns to examine to only a handful.

To cut down the # of BC7 modes to check, you can first rule out mode 7 because it's only useful for blocks containing alpha. Then try mode 6. If it's good enough, stop. Then try mode 1. If that's good enough, stop, etc. Only a subset of blocks need to use the complex multiple subset modes. In many cases the 3 subset modes can be ignored entirely with little noticeable impact on the results. The component "rotation" feature is usually low value.

These optimizations cause divergence in SIMD encoders, unfortunately. The Neural Network encoders also suffer from the same problem. Neural Network encoders also must be trained, and if the texture to compress doesn't resemble the training data they could have severe and unpredictable quality cliffs. 

TexNN was first published here: