Richard Geldreich's Blog

Tuesday, December 5, 2023

Announcing CPNG, "Compatible Network Graphics", a backwards compatible fork of PNG

CPNG ("Compatible Network Graphics") is a 100% backwards compatible fork of the ~30 year old PNG image format, which is still thoroughly, and probably forever, stuck in the 1990's. I've been quietly designing, prototyping, and shipping CPNG's key features over the past couple years on github.

Why continue messing with PNG at all? Because if you can figure out how to squeeze new features into PNG in a backwards compatible way, it's instantly compatible with all browsers, OS's, engines, etc. This is valuable.

I realized, as long as my software is significantly faster for encoding/decoding (>10x encode, >2-3x decode - per thread) and is 100% backwards compatible, I'll have gathered the momentum and library adoption to also add some badly needed new features: faster encoding/decoding by constraining and simplifying the Deflate stream to pixels vs. bytes, SIMD encoding, multithreading so performance can scale up to modern systems, and real HDR pixels (FP16 IEEE half float and LOGLUV32) so it scales to modern and future HDR monitors/tool chains.

Crucially, the CPNG data stream must be entirely backwards compatible with the existing PNG format, like how color TV was introduced but it still worked on B&W TV's. New features can be added, as long as existing decoders ignore them and return a viewable image:

1. Constrained Deflate for huge (10-25x) encoding speedups vs. most other libraries, and 2-3x decoding speedups: this has shipped already in fpng/fpnge.

I documented the Constrained Deflate feature utilized by fpng here:

https://github.com/richgel999/fpng/wiki/fdEC-PNG-chunk-definition

Other encoder/decoder authors can utilize this feature today, just by following this spec. CPNG will leverage this approach at first, before moving to a SIMD encoding approach (like Google's fpnge utilizes).

2. Multithreaded encoding/decoding using multiple IDAT's: Apple does this already.

We're putting a seek (or offset) table into the CPNG ancillary chunk. The image will be optionally parallel encodable and decodable in strips.

This also parallelizes the painful Alder-32 and CRC-32 steps. To remain backwards compatible, the CPNG format still has to (somewhat ridiculously) compute both checksums, putting it at a disadvantage vs. other modern image formats. (Who thought putting two checksums into an image format was a good idea, anyway?)

3. FP16 (IEEE 754-2008 half float) support, but with a lossless tone mapped fallback. One approach has already been demonstrated in the png16 repo. This feature exploits PNG's already built-in support for 16-bit pixels.

Here are some encoded 48-bit PNG example images. These images are losslessly tone mapped:

https://github.com/richgel999/png16/tree/main/bin

These PNG files are tone mapped 16-bit images, but internally they are also standard half float images. A small 256 entry lookup table is stored in an ancillary chunk, so a CPNG HDR decoder can retrieve the half float pixels from the 16-bit unsigned pixels. See the png16 repo source for more details on how this approach works.

HDR CPNG images must remain viewable, in some reasonable way, by all browsers and OS's that only support PNG, even if they are HDR images. This requires a lossless tonemap operator of some sort, so the legacy PNG image is viewable as an SDR image, but the HDR data can always be losslessly recovered using a simple and very fast procedure.

4. LOGLUV32 - This is Ward's very space efficient, perceptually lossless HDR pixel format:

"The LogLuv Encoding for Full Gamut, High Dynamic Range Images, Gregory Ward Larson, Silicon Graphics, Inc."

The L16 portion is stored in the legacy PNG image as a 16-bit grayscale image, so the basic image remains viewable as a SDR PNG file. We'll have to losslessly tonemap these log luminance pixels, using a technique similar to the approach used for FP16 pixels.

The U8V8 portion (color) is stored in ancillary chunk(s), compressed as another 8,8 image. For alpha support, the CPNG file can exploit the LA16 vs. L16 portion of the PNG file. These ancillary chunks will be ignored by old PNG readers.

The separate chunk for UV color data is needed, otherwise I don't see how the 16-bits of UV color can be placed into the PNG without older viewers displaying a totally unviewable image. Older PNG readers viewing/previewing just tone mapped luminance should be fine.

The claimed rANS decoding advantages are being exaggerated

To put it simply:

SSE 4.1, on an Intel Ice Lake CPU:

SIMD Range: 64 streams: 738 MiB/sec., 436,349 bytes

SIMD Range: 16 streams: 619.1 MiB/sec.

SIMD rANS: 8 streams: 556.5 MiB/s, 435,626 bytes

SIMD Huffman: 32 streams: 1.1 GiB/sec.

The vectorized 24-bit Pavlov/Subbotin range coder is here.

The rANS decoder I'm comparing against appears to be well implemented. (If I could use rANS, which I can't because it's patented, I would start with this code.) It could be unrolled more to 16 (vs. 8) or more streams, and that's going to boost decoding perf. some. Let's say ~25% faster, so ~695 MiB/sec. Range is still comparable, and Huffman is of course faster. The benchmark data is showing it's pointless to risk infringement.

Also, if you show me probably any vectorized rANS decoder, many of the vectorization techniques also easily apply to SIMD range and Huffman decoders too. They each have to do many of the same things, and this helps equalize the perceived performance advantages of rANS decoding.

Ultimately working around range coding's division requirement is not the biggest problem. It's the gathers and keeping all the lanes normalized and fed efficiently - which are the same problems the other decoders face. Once you solve these problems for one entropy coder you can apply them to the others. Many of the implementation differences of the actual entropy coding technique melt away once vectorized.

Note encoding is a different beast. I've been unable to see a way to easily interleave the output bytes from X parallel range coders. The SSE 4.1 range sample code linked to above has to swizzle the bytes after encoding into a single stream, however there is no special side band or signaling information needed (just like with rANS). (This is just like my LZHAM codec, which has been interleaving arithmetic and Huffman coded streams with no extra signaling or side band data for many years now - including before FSE/ANS was a thing.) With rANS encoding interleaving the outputs seems very easy. With more research into other entropy coders perhaps it's possible to eliminate this advantage too.

A straightforward AVX-2 port of this SSE 4.1 range coder is ~65% faster on Ice Lake, putting it roughly in the same ballpark as the AVX-2 rANS decoders I've seen benchmarked. This makes sense because many of the bottlenecks are similar and have little to do with the actual entropy coding technique.

(Note I could upload my SSE 4.1 Huffman decoder to github, but honestly 1.1 GiB/sec. is not a big deal. Fast scalar Huffman on my machine gets ~800 MiB/sec. SSE 4.1 lacks easy variable per-lane bit shifting which hurts Huffman a lot. Straightforward AVX-2 Huffman decoding with an 8-bit alphabet gets ~1.8 GiB/sec. on this machine.)

Friday, May 5, 2023

LZ_XOR/LZ_ADD progress

I'm tired of all the endless LZ clones, so I'm trying something different.

I now have two prototype LZ_XOR/ADD lossless codecs. In this design a new fundamental instruction is added to the usual LZ virtual machine, either XOR or ADD. Currently the specific instruction added is decided at the file level. (From now on I'm just going to say XOR, but I really mean XOR or ADD.)

These new instructions are like the usual LZ matches, except XOR's are followed by a list of entropy coded byte values that are XOR'd to the string bytes matched in the sliding dictionary. On certain types of content these new ops are a win (to a big win), but I'm still benchmarking it.

The tradeoff is an expensive fuzzy search problem. Also, with this design you're on your own - because there's nobody to copy ideas from. The usual bag of parsing heuristic tricks that everybody copies from LZMA don't work anymore, or have to be modified.

One prototype is byte oriented and is somewhat fast to decompress (>1 GiB/sec.), the other is like LZMA and uses a bitwise range coder. Fuzzy matching is difficult but I've made a lot of headway. It's no longer a terrifying search problem, now it's just scary.

The ratio of XOR's vs. literals or COPY ops highly depends on the source data. On plain text XOR's are weak and not worth the trouble. They're extremely strong on audio and image data, and they excel on binary or structured content.

With the LZMA-like codec LZ_XOR instructions using mostly 0 delta bytes can become so cheap to code they can be preferred over COPY's, which is at first surprising to see. It can be cheaper to extend an LZ_XOR with some more delta bytes vs. truncating one and starting a COPY instruction. On some repetitive log files nearly all emitted instructions are long XOR's.

COPY ops must stop on the first mismatch, while XOR ops can match right through minor mismatches and still have net gain. Adding XOR ops can drastically reduce the # of overall instructions the VM ("decompressor") has to process, and also gives the parser an expanded amount of freedom to trade off reduced instructions vs. ratio. It's not all about ratio, it's about decompression speed.

Overall this appears to be a net win, assuming you can optimize the parsing. GPU parsing is probably required to pull this off, which I'm steadily moving towards.

The other improvement that shows net gain on many files is to emit an optional "history distance delta offset" value. This allows the encoder to specify a [-128,127] offset relative to one of the "REP" match history distances. The offset is entropy coded.

Sunday, April 30, 2023

Somewhere Range Coding went off the rails

I've lost track of the number of Range Coders I've seen floating around for the past ~25 years. Subbotin's coders released in the late 90's/early 2000's seem to be very influential. Each implementation has different tradeoffs that greatly impact a vectorized implementation. There is no single definitive range coder design, and I've seen 3 very different ways to handle carries. I learned a lot implementing my first vectorized range decoder.

Most of the tradeoffs boil down to how carries are handled or how many state registers are used by the decoder. You can delay the carries during encoding (Subbotin's earliest "carry-less" or delayed carry encoder that I can find, which is very clever), or fixup the already encoded bytes by propagating the carries backwards in memory during encoding (LZMA's binary range coder, which is similar to Subbotin's), or move the carry logic into the decoder (another by Subbotin called the "Russian People's Range Coder"). There are also decoder variants that track low/code/range vs. just code/range (LZMA's and Subbotin's earliest).

Importantly, range coders using this renormalization method during decoding are unnecessarily complex:

https://github.com/jkbonfield/rans_static/blob/master/arith_static.c#L222

Here's a different Subbotin variant that also uses overly complex and obfuscated renormalization logic during decoding. (Please just stop guys - this is the wrong direction.)

Why is all that messy conditional logic in the decoder renormalization step? It appears to be carry related, but the carries should have been handled during encoding in some way. It's unnecessary to create a working and correct range coder.

Instead do this, from Subbotin's first delayed carry range coder. This is easily vectorizable and he only uses 2 variables (code and range). This is quite similar to LZMA's and LZHAM's, except it doesn't need to propagate carries back in the buffer containing the output bytes.

https://gist.github.com/richgel999/d522e318bf3ad67e019eabc13137350a

Another important note: If you switch to 24-bits vs. 32-bits, you don't need to use integer division. (Note division isn't needed for a binary range coder. It's also not strictly needed for a range coder that only supports small alphabets. The division enables using a fast decoding table for large alphabets.) Instead you can use single precision division and truncate the results. This option is crucial for vectorization on modern CPU's. Modern vectorized single precision division is fairly fast now. Or said in a different way, it's not so slow that it's a show stopper.

In vectorized implementations of both Huffman and range decoding, most of the instructions are actually not related to the actual entropy coding method. Instead they are common things like fetching from the decoder's lookup table (using some sort of manual or explicit gather), and fetching source bytes and distributing them to the lanes during renormalization. The FP division is a small part of the whole process.

What seems to slow down vectorized range decoding is the more complex renormalization vs. Huffman, which in my implementation can fetch between [0,2] bytes. By comparison, my vectorized Huffman decoders fetch either 0 or 2 bytes per step.

All of these details are starting to matter now that rANS coding is falling under patent protection. Range Coding is a ~44 years old technique.

Comparing Vectorized Huffman and Range Decoding vs. rANS (Or: rANS entropy coding is overrated)

The point here is to show that both Huffman and range decoding are all vectorizable and competitive vs. rANS in software.

What I care about is fast entropy decoding. Even scalar encoding of any of these techniques is already fast enough for most use cases. For the distribution use case, you encode once on some server(s) and then distribute to millions of users.

I've constructed fast SSE 4.1 and AVX-2 vectorized decoders for range coding and Huffman coding. The SSE 4.1 vectorized 24-bit range decoder (using just 16 streams for simplicity) is here on github. This code is easily ported to AVX-2 and Huffman in a day or two. The Steam Hardware Survey shows that 99.58% of surveyed users (as of 3/23) support SSE 4.1, so it seems pointless to focus on or benchmark scalar decoders any more.

There's nothing particularly special or exotic about these implementations. The range coder is LZMA's binary range coder (which is also used by LZHAM), which I adapted to handle 8-bit symbols with 12-bit probabilities and 24-bit registers. It appears quite similar to a range coder released by Dmitry Subbotin, except it handles carries differently. The Huffman decoder uses the Package Merge algorithm.

The tricky renormalization step (where each vector lane is fed with more bytes from the source streams) is basically the same between rANS, Huffman and range decoding. Range decoding needs integer division which is scary at first, but the CPU engineers have solved that problem (if all you need is 24-bits). Also, each decoder uses some sort of table to accelerate decoding.

The vectorized versions are surprisingly similar once implemented. They do some math, do a gather from a table, write the output symbols, do some more math, then fetch some bytes from the source buffer and renormalize and distribute to each lane in parallel these source bytes. Once you've implemented one well you've more or less got the others.

Here are some benchmarks on the same machine (a Dell Ice Lake laptop, Core i7 1065G7) on Calgary Corpus book1 (768,771 bytes). All coders use 8-bit symbols.

SSE 4.1 Huffman/Range Decoding

Huffman: 32 streams, 1,122 MiB/sec., 438,602 bytes
Range: 64 streams, 24-bit: 738 MiB/sec., 436,349 bytes (32 streams=436,266 bytes)

AVX-2 Huffman/Range Decoding

Huffman: 32 streams, 1,811 MiB/sec., 438,602 bytes
Range: 64 streams, 24-bit: 1,223 MiB/sec., 436,349 bytes

Notes: The range coder uses 12 bit probabilities. The Huffman coder was limited to a max code size of 13 bits. The same encoded stream is compatible with SSE 4.1 and AVX-2 decoders. The compressed bytes statistic doesn't include the probability or Huffman code length tables. I'm not a vector coding expert, so I'm sure in the right hands these decoders could be made even faster.

Collet's Scalar Huffman/FSE Decoding

Using his fast scalar Huffman and FSE encoders (FSE -b -h, or FSE -b):

Huffman: 840 MiB/sec., 439,150 bytes
FSE: 365 MiB/sec., 437,232 bytes

Also see FPC on github by Konstantinos Agiannis, which gets on a Core i5 ~814 MiB/sec. for scalar length-limited Huffman decoding.

Giesen's Vectorized rANS Decoding

Using exam_simd_sse41, running on the same machine using WSL and compiled with clang v10:

rANS: 213.9 MiB/s, 435,604 bytes
interleaved rANS: 324.7 MiB/s, 435,606 bytes
SIMD rANS: 556.5 MiB/s, 435,626 bytes

Under Windows 10 all code was built with MSVC 2019 x64.

It's easy to construct a Huffman decoding table that supplies 2-3 symbols per decode step. (My Huffman benchmarks don't demonstrate this ability, but they could if I limited the max Huffman code size to 6 or 7 bits, which is practical for some problems.) AFAIK this ability has not been demonstrated with rANS. Until it is, Huffman isn't going anywhere. Considering rANS is patented, it's just not worth the risk and trouble for questionable decoding gains when usable alternatives exist. I'll revisit this in 20+ years (the lifetime of patents in the US).

Monday, April 24, 2023

Vectorized interleaved Range Coding using SSE 4.1

In order to avoid the current (and upcoming) ANS/rANS entropy coding patent minefield, we're avoiding it and using vectorized Range Coding instead. Here's a 24-bit SSE 4.1 example using 16 interleaved streams. This example decoder gets 550-700 megabytes/sec. with 8-bit alphabets on various Intel/AMD CPU's I've tried:

https://github.com/richgel999/sserangecoding

More on the rANS patent situation (from early 2022):

https://www.theregister.com/2022/02/17/microsoft_ans_patent/

This decoder design is practical on any CPU or GPU that supports fast hardware integer or float division. It explicitly uses 24-bit registers to sidestep issues with float divides. I've put much less work on optimizing the encoder, but the key step (the post-encode byte swizzle) is the next bottleneck to address.

Friday, April 21, 2023

Faster LZ is not the answer to 150-250+ GB video game downloads

When the JPEG folks were working on image compression, they didn't create a better or faster LZ. Instead they developed new approaches. I see games growing >150GB and then graphs like this, and it's obvious the game devs are going in the wrong direction:

https://aras-p.info/blog/2023/01/31/Float-Compression-2-Oodleflate/

(Note these benchmarks are great and extremely useful.)

Separating out the texture encoding stage from the lossless stage is a compromise. I first did this in my "crunch" library around 15 years ago. It was called "RDO mode". You can swizzle the ASTC/BC1-7 bits before LZ, and precondition them, and that'll help, but the two steps are still disconnected. Instead combine the texture and compression steps (like crunch's .CRN mode - shipped by Unity for BC1-5.)

Alternatively: defer computing the GPU texture data until right before it's actually needed and cache it. Ship the texture signal data using existing image compression technology, which at this point is quite advanced. For normal maps, customize or tune existing tech to handle them without introducing excessive angular distortion. I think both ideas are workable.

Also, these LZ codecs are too fast. They are designed for fast loading and streaming off SSD's. Who cares about shaving off a few hundred ms (or a second) when it takes hours or days to download the product onto the SSD?

Somebody could develop a 10x faster Oodle (or an Oodle that compresses 1-2% better) and we're still going to wait many hours or days to actually use the product. And then there's the constant updates. This approach doesn't scale.

It's fun and sexy to work on faster LZ but the real problem (and value add) doesn't call for better or more lossless tech. This is a distraction. If trends continue the downloads and updates will be measured in terms of fractional or 1+ week(s).