Tuesday, September 27, 2016

How to use crunch's GPU block encoder test vector generator

This option selects a different mode of operation from crunch's usual texture file conversion role. It causes the tool to crawl through a directory and load every .PNG file there. It will then randomly select a percentage of the 4x4 pixel blocks from the image and append the results into one or more 4096x4096 output images. These output images can then be used as test vectors to compare different block encoders.

crunch -corpus_gen -deep .035 -width 4096 -height 4096 -in J:\dev\test_images\*.png

You can specify multiple -in arguments, and -in @file.txt loads a textual listing file of files/directories to load or scan.

The -corpus_test option can be used to compare the different DXT encoders supported by crunch, using images generated using -corpus_gen.

Here's a very zoomed in example from the test vector generator:



Notice how the blocks are sorted by the sum of R, G's, and B's standard deviation as a key.

Sunday, September 25, 2016

More on SSIM

This paper is referenced in the SSIM article on Wikipedia:

"A comprehensive assessment of the structural similarity index"
http://link.springer.com/article/10.1007/s11760-009-0144-1
"In this paper, it is shown, both empirically and analytically, that the index is directly related to the conventional, and often unreliable, mean squared error. In the first evaluation, the two metrics are statistically compared with one another. Then, in the second, a pair of functions that algebraically connects the two is derived. These results suggest a much closer relationship between the structural similarity index and mean squared error."
"This research, however, appears to be the first to directly consider the statistical relationships between the two methods. As well, this work develops a pair of mathematical functions that directly link the two. Given these findings, one is left to question whether the structural similarity index is ready for widespread adoption."
Interesting! I get the feeling there's more to SSIM than meets the eye. Unfortunately, this paper is behind a paywall. Another quote from the paper:
"These findings suggest a reasonably significant level of correlation between the SSIM and MSE. Values range from r = 0.6364 to r = 1.0000, with an average of r = 0.9116 and a variance of 0.007. An average this large, along with a small variance, suggests that most of the correlations are decidedly significant. Clearly, when ordering coded images, the SSIM and MSE often choose similar arrangements. Results such as this are likely a sign of a deeper relationship between the two methods."
Hmm, okay. So MSE and SSIM are highly correlated. The paper even has simple algorithms to convert between MSE<->SSIM. Perhaps I could use these algorithms to help optimize my SSIM code. (Just joking.) From the conclusion:
"Collectively, these findings suggest that the performance of the SSIM is perhaps much closer to that of the MSE than some might claim. Consequently, one is left to question the legitimacy of many of the applications of the SSIM."
Got it. Here's another interesting paper, this one not behind a paywall:

"Mystery behind similarity measures MSE and SSIM"
https://pdfs.semanticscholar.org/8a92/541e46fc4b8237c4e611401d601c8ecc6893.pdf

Some quotes:
"We see that it is based on the same sample moments and correlation coefficient as MSE. So this is the first observation/property or mystery revealed about MSE and SSIM: both measures are composed of the same parameters which are only combined in a different way."
"So the third observation for SSIM is its instability around zero point (0,0) and the fourth one – it can be used only for data of the same sign. The authors of SSIM solve these problems by introducing small constants and restricting the usage to non-negative data only, respectively."
"The fifth observation for Dice measure and thus for SSIM too is that it depends on the absolute values of input parameters. First, it is insensitive at all if one of the parameters is equal 0. Secondly, its sensitivity is decreasing by the increase of absolute parameter values."
Hmm, none of that sounds great to me. They go on to introduce their own metric they call CMSC, and claim "all proposed measures are free of drawbacks of MSE and SSIM and thus are more suitable as objective similarity/quality measures not only for the images but any signals."

John Brooks at Blue Shift experimented with using SSIM in his new ETC1/2 encoder, etc2comp. In a conversation about SSIM, he said that:
"It [SSIM] becomes insensitive in high-contrast areas. SSIM is all about matching contrast & structure. But Block Truncation Coding by its nature is increasing contrast because it posterizes color transitions to 4 selector values. This made the encoder freak out and try to reduce contrast to compensate, making the encoding look crappy. I think it might be the right tool for high-level jobs, but was a poor tool for driving low-level encoder behavior."
"BTC trades 16 shades for 4 which means sharper transitions and more contrast when measured against the original. It also usually means less structure than the original due to posterizing 16-to-4. But neither artifact can be controlled by the encoder as they are a result of the encoding, so it's very hard to navigate the encoding search space when SSIM is so outside its design parameters."
Sounds pretty reasonable to me. I'm going to be doing some testing using a ETC1 encoder optimized for SSIM very soon. Let's see what happens.

Image error metrics

While developing and refining crunch I used a matrix of statistics like this:

RGB Total   Error: Max:  73, Mean: 17.404, MSE: 176.834, RMSE: 13.298, PSNR: 25.655, SSIM: 0.000000
RGB Average Error: Max:  73, Mean: 5.801, MSE: 58.945, RMSE: 7.678, PSNR: 30.426, SSIM: 0.907993
Luma        Error: Max:  64, Mean: 4.640, MSE: 37.593, RMSE: 6.131, PSNR: 32.380, SSIM: 0.945000
Red         Error: Max:  69, Mean: 5.387, MSE: 52.239, RMSE: 7.228, PSNR: 30.951, SSIM: 0.921643
Green       Error: Max:  70, Mean: 5.052, MSE: 48.298, RMSE: 6.950, PSNR: 31.291, SSIM: 0.934051
Blue        Error: Max:  73, Mean: 6.966, MSE: 76.296, RMSE: 8.735, PSNR: 29.306, SSIM: 0.868285

I computed these stats from a PNG image uploaded by @dougallj showing the progress he's been making on his experimental ETC1 encoder with kodim18, originally from here:


The code that computes this stuff is actually used by the DXT1 front-end to determine how the 8x8 "macroblocks" should be tiled.

The per-channel stuff is useful for debugging, and for tuning the encoder's perceptual RGB weights (which is only used when the compressor is in perceptual mode). Per-channel stats are also useful when trying to get a rough idea what weights a closed source block encoder uses, too.

Here's a useful PCA paper I found while writing HW1's renderer

I used this technique in a real-time GPU DXT1 encoder I wrote around 10 years ago:

"Candid Covariance-Free Incremental Principal Component Analysis"
http://www.cse.msu.edu/~weng/research/CCIPCApami.pdf

With this approach you can compute a decent-enough PCA in a few lines of shader code.

HW1 used this encoder to compress all of the GPU splatted terrain textures into a GPU texture cache. One of my coworkers, Colt McAnlis, designed and wrote the game's amazing terrain texture caching system.

SSIM

Alright, I'm implementing SSIM. There are like 30 different implementations on the web, and most either rely on huge dependencies like OpenCV or have crappy licenses. So which one do I compare mine too? The situation with SSIM seems worse than PSNR. There are just so many variations on how to compute this thing.

I'm choosing this implementation for comparison purposes, because I already have the fundamental image processing primitives handy:

http://mehdi.rabah.free.fr/SSIM/SSIM.cpp

On Multi-Scale SSIM: I've been given conflicting information on whether or not this is actually useful to me. Let's first try regular SSIM.

For testing, I compared my implementation, using my own float image processing code, vs. the code above that uses doubles and OpenCV. To generate some distorted test images, I loaded kodim18 into Paint Shop Pro X8 and saved to various JPEG quality levels from 1-99. I then ran the two tools and graphed the results in Excel:




The X axis represents the various quality levels, from highest to lowest quality. The 12 PSP JPEG quality levels tested are 1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, and 99. Y axis is SSIM.

Thanks to John Brooks at Blue Shift for feedback on this post.

Friday, September 23, 2016

About the HW1 codebase having "too many globals"

First off, this project was a death march. What Paul Bettner (formerly Ensemble, now at Playful Corp) publicly said years ago is true: Ensemble Studios was addicted to crunching. I lived, breathed, and slept that codebase. We had demos every 4-8 weeks or something. This time in my life was beyond intense. I totally understand why Microsoft shut us down, because we really needed to be put out of our collective misery.

I was more or less addicted to crunch at Ensemble. I remember working so much, and being so consumed with work on this game, that the muscles in my neck would basically "lock up". Working on all those demo milestones was a 3 year adventure. That team was so amazing, and we all got along so well. I could never do it again like that unless lives depended on it.

Anyhow, the engine/tools team on that project built a low-level, very 360-specific "game OS" in C++ for the simulation team. Why did we build a whole new engine from the ground up? Because the Age3 engine just completely melted down after Billy Khan and I ported it to 360. (That was 4 months of the most painful, mind numbing full-time coding, porting and debugging I've ever done.)

The Age3 360 port ran at ~7 FPS, on a single thread, and took 3-5 minutes to load. After I got the net code working on 360 (no easy task, because Age3 used the Win32 window message-based Winsock API's), we played a few brutally slow multiplayer games on the 360. It was pretty bad.

Of course, we could have spent months trying to optimize and thread this engine to get it above 30Hz. But Billy and I just rolled off Age3, where we spent months working on optimizing and tuning the engine to run well on PC's. I also had a bunch of new 360-specific rendering features I wanted to implement, and doing this in the old PC-centric codebase would have been a nightmare.

The HW1 engine consisted of many global managers, very heavy use of synchronous/asynchronous cross-thread messaging, and lightweight platform-specific wrappers built on top of the Win32 and D3D API's. The renderer, animation, sound, streaming, decompression, networking, and overlapped I/O systems were heavily multithreaded. (Overlapped I/O actually worked properly on Xbox 360's OS.) We used 360-specific D3D9 extensions that allowed us to compose command buffers from multiple threads, and we carefully managed all GPU physical memory ourselves just like a driver would. There are lots of other cool things we did on HW1 that I'll cover here on rainy days.

The original idea for using message passing for most of our parallelism in our next engine was from Bill Jackson, now CCO at Boss Fight Entertainment in Dallas. I implemented it and refined the idea before I really understood how useful it was. It was inspired by message passing and concurrency in Erlang. It worked well and was really fun to use, but was hard to debug. Something like 5,000 intra and inter thread messages were involved in loading a map in the background while Scaleform UI was playing back on its own core. We also had a simple job system, but most of our concurrency was implemented using message passing. (See this article on a similar Message Passing system by Nicholas Vining.)

We tried to follow our expression of the Unix philosophy on this game: Lots of little objects, tools, and services interacting in an ecosystem. Entire "game OS" services were designed to only send/receive and process messages on particular 360 CPU cores.

My manager and I created this powerful, highly abstracted virtual file I/O system with streaming support. The entire game (except the 360 executable) could quickly load over the network using TCP/IP, or off the hard drive or DVD using package files. Hot reloading was supported over the network, so artists could watch their textures, models, animations, terrain, and lights change in real-time. We had the entire company (artists, designers, programmers) using this system.

Something like singletons made no sense for the managers. These services were abstracting away one specific global piece of hardware or global C API, so why bother. I've been told the C-based Halo codebases "followed not strictly the same philosophy, but of the same mind".

This codebase was very advanced for its time. It made the next series of codebases I learned and enhanced feel 5-10 behind the times. I don't talk about it because this entire period of time in my life was so intense.

Wednesday, September 21, 2016

ETC1/2 vs. DXT1 texture compression benchmark

I'm using the same testing tool, dataset and methodology explained in my ETC1/2 benchmark. In this benchmark, I've added in my vanilla (non-RDO/CRN) DXT1 block encoder (really, its DXT1 endpoint optimizer class), which is derived from crunch's.

In 2009 my DXT1 encoder was as good or better than all available DXT1 compressors that I tested it against, such as squish, ATI Compressonator, NVidia's original and old NVDXT libary, and D3DX's. Not sure how much change has occurred in DXT1 compression since that time. I can also throw in other DXT1 encoders if there's interest.

RGB error metrics:


Here's just ETC2 vs. DXT1:


This is fascinating!

Next up: BC7.