Sunday, September 25, 2016

SSIM

Alright, I'm implementing SSIM. There are like 30 different implementations on the web, and most either rely on huge dependencies like OpenCV or have crappy licenses. So which one do I compare mine too? The situation with SSIM seems worse than PSNR. There are just so many variations on how to compute this thing.

I'm choosing this implementation for comparison purposes, because I already have the fundamental image processing primitives handy:

http://mehdi.rabah.free.fr/SSIM/SSIM.cpp

On Multi-Scale SSIM: I've been given conflicting information on whether or not this is actually useful to me. Let's first try regular SSIM.

For testing, I compared my implementation, using my own float image processing code, vs. the code above that uses doubles and OpenCV. To generate some distorted test images, I loaded kodim18 into Paint Shop Pro X8 and saved to various JPEG quality levels from 1-99. I then ran the two tools and graphed the results in Excel:




The X axis represents the various quality levels, from highest to lowest quality. The 12 PSP JPEG quality levels tested are 1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, and 99. Y axis is SSIM.

Thanks to John Brooks at Blue Shift for feedback on this post.

Friday, September 23, 2016

About the HW1 codebase having "too many globals"

First off, this project was a death march. What Paul Bettner (formerly Ensemble, now at Playful Corp) publicly said years ago is true: Ensemble Studios was addicted to crunching. I lived, breathed, and slept that codebase. We had demos every 4-8 weeks or something. This time in my life was beyond intense. I totally understand why Microsoft shut us down, because we really needed to be put out of our collective misery.

I was more or less addicted to crunch at Ensemble. I remember working so much, and being so consumed with work on this game, that the muscles in my neck would basically "lock up". Working on all those demo milestones was a 3 year adventure. That team was so amazing, and we all got along so well. I could never do it again like that unless lives depended on it.

Anyhow, the engine/tools team on that project built a low-level, very 360-specific "game OS" in C++ for the simulation team. Why did we build a whole new engine from the ground up? Because the Age3 engine just completely melted down after Billy Khan and I ported it to 360. (That was 4 months of the most painful, mind numbing full-time coding, porting and debugging I've ever done.)

The Age3 360 port ran at ~7 FPS, on a single thread, and took 3-5 minutes to load. After I got the net code working on 360 (no easy task, because Age3 used the Win32 window message-based Winsock API's), we played a few brutally slow multiplayer games on the 360. It was pretty bad.

Of course, we could have spent months trying to optimize and thread this engine to get it above 30Hz. But Billy and I just rolled off Age3, where we spent months working on optimizing and tuning the engine to run well on PC's. I also had a bunch of new 360-specific rendering features I wanted to implement, and doing this in the old PC-centric codebase would have been a nightmare.

The HW1 engine consisted of many global managers, very heavy use of synchronous/asynchronous cross-thread messaging, and lightweight platform-specific wrappers built on top of the Win32 and D3D API's. The renderer, animation, sound, streaming, decompression, networking, and overlapped I/O systems were heavily multithreaded. (Overlapped I/O actually worked properly on Xbox 360's OS.) We used 360-specific D3D9 extensions that allowed us to compose command buffers from multiple threads, and we carefully managed all GPU physical memory ourselves just like a driver would. There are lots of other cool things we did on HW1 that I'll cover here on rainy days.

The original idea for using message passing for most of our parallelism in our next engine was from Bill Jackson, now CCO at Boss Fight Entertainment in Dallas. I implemented it and refined the idea before I really understood how useful it was. It was inspired by message passing and concurrency in Erlang. It worked well and was really fun to use, but was hard to debug. Something like 5,000 intra and inter thread messages were involved in loading a map in the background while Scaleform UI was playing back on its own core. We also had a simple job system, but most of our concurrency was implemented using message passing. (See this article on a similar Message Passing system by Nicholas Vining.)

We tried to follow our expression of the Unix philosophy on this game: Lots of little objects, tools, and services interacting in an ecosystem. Entire "game OS" services were designed to only send/receive and process messages on particular 360 CPU cores.

My manager and I created this powerful, highly abstracted virtual file I/O system with streaming support. The entire game (except the 360 executable) could quickly load over the network using TCP/IP, or off the hard drive or DVD using package files. Hot reloading was supported over the network, so artists could watch their textures, models, animations, terrain, and lights change in real-time. We had the entire company (artists, designers, programmers) using this system.

Something like singletons made no sense for the managers. These services were abstracting away one specific global piece of hardware or global C API, so why bother. I've been told the C-based Halo codebases "followed not strictly the same philosophy, but of the same mind".

This codebase was very advanced for its time. It made the next series of codebases I learned and enhanced feel 5-10 behind the times. I don't talk about it because this entire period of time in my life was so intense.

Wednesday, September 21, 2016

ETC1/2 vs. DXT1 texture compression benchmark

I'm using the same testing tool, dataset and methodology explained in my ETC1/2 benchmark. In this benchmark, I've added in my vanilla (non-RDO/CRN) DXT1 block encoder (really, its DXT1 endpoint optimizer class), which is derived from crunch's.

In 2009 my DXT1 encoder was as good or better than all available DXT1 compressors that I tested it against, such as squish, ATI Compressonator, NVidia's original and old NVDXT libary, and D3DX's. Not sure how much change has occurred in DXT1 compression since that time. I can also throw in other DXT1 encoders if there's interest.

RGB error metrics:


Here's just ETC2 vs. DXT1:


This is fascinating!

Next up: BC7.

Tuesday, September 20, 2016

Let's try DXT1 vs. ETC1/2 benchmarks

John Brooks at Blue Shift brought up this idea earlier. I think it's a great idea! I love good old DXT1 (or "BC1" as some call it). Let's see how ETC2 in particular compares against my old favorite.

Monday, September 19, 2016

Important note about PSNR

Yes, I know PSNR (and RMSE, etc.) is not an ideal quality metric for image and video compression. Keep in mind there is a large diversity of data stored as textures in modern games and applications: Albedo maps, specular maps, gloss maps, normal maps, light maps, various engine-specific multichannel control maps, 2D sprites, transparency (alpha) maps, satellite photos, cubemaps, etc. And let's not even talk about how anisotropic filtering, shading, normal mapping, shadowing, etc. impacts perceived quality once these textures are mapped onto 3D meshes.

RGB and Luma PSNR are simple and, in my experience writing and tuning crunch, reliable enough for practical usage. I'm not writing an image or video compressor, I'm writing a texture compressor.

How to compute PSNR (from an old Berkeley course)

This was part of Berkeley's CS294 Fall '97 courseware on "Multimedia Systems and Applications", but it got moved and disappeared. It was a useful little page so I'm duplicating it here for reference purposes:

https://web.archive.org/web/20090418023748/http://bmrc.berkeley.edu/courseware/cs294/fall97/assignment/psnr.html

https://web.archive.org/web/20090414211107/http://bmrc.berkeley.edu/courseware/cs294/fall97/index.html


Image Quality Computation

Back to Assignment ]

Signal-to-noise (SNR) measures are estimates of the quality of a reconstructed image compared with an original image. The basic idea is to compute a single number that reflects the quality of the reconstructed image. Reconstructed images with higher metrics are judged better. In fact, traditional SNR measures do not equate with human subjective perception. Several research groups are working on perceptual measures, but for now we will use the signal-to-noise measures because they are easier to compute. Just remember that higher measures do not always mean better quality.

The actual metric we will compute is the peak signal-to-reconstructed image measure which is called PSNR. Assume we are given a source image f(i,j) that contains N by N pixels and a reconstructed image F(i,j) where F is reconstructed by decoding the encoded version of f(i,j). Error metrics are computed on the luminance signal only so the pixel values f(i,j) range between black (0) and white (255).

First you compute the mean squared error (MSE) of the reconstructed image as follows


The summation is over all pixels. The root mean squared error (RMSE) is the square root of MSE. Some formulations use N rather N^2 in the denominator for MSE.

PSNR in decibels (dB) is computed by using


Typical PSNR values range between 20 and 40. They are usually reported to two decimal points (e.g., 25.47). The actual value is not meaningful, but the comparison between two values for different reconstructed images gives one measure of quality. The MPEG committee used an informal threshold of 0.5 dB PSNR to decide whether to incorporate a coding optimization because they believed that an improvement of that magnitude would be visible.

Some definitions of PSNR use 2552/MSE rather than 255/RMSE. Either formulation will work because we are interested in the relative comparison, not the absolute values. For our assignments we will use the definition given above.

The other important technique for displaying errors is to construct an error image which shows the pixel-by-pixel errors. The simplest computation of this image is to create an image by taking the difference between the reconstructed and original pixels. These images are hard to see because zero difference is black and most errors are small numbers which are shades of black. The typical construction of the error image multiples the difference by a constant to increase the visible difference and translates the entire image to a gray level. The computation is


You can adjust the constant (2) or the translation (128) to change the image. Some people use white (255) to signify no error and difference from white as an error which means that darker pixels are bigger errors.


References

A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995).

M. Rabbani and P.W. Jones, Digital Image Compression Techniques, Vol TT7, SPIE Optical Engineering Press, Bellevue, Washington (1991).

ETC1 and ETC1/2 Texture Compressor Benchmark

(This is a "sticky" blog post. I'll keep this page up to date as interesting or important events happen. Examples: When a new practical ETC encoder gets released, or when ETC codecs are significantly updated.)

The main purpose behind this particular benchmark is to conduct a deep survey of every known practical ETC1/2 encoder, so I can be sure basislib's ETC1 and universal encoders are very high quality. I want to closely understand where this space is at, and where it's going. This is exactly what I did while writing crunch. I need a very high quality, stable, and scalable ETC1/2 block parameter optimizer that works with potentially many thousands of input pixels. rg_etc1's internal ETC1 optimizer is the only thing I have right now that solves this problem.

I figured this data would be very useful to other developers, so here's a highest achievable quality benchmark of the following four practical ETC1/2 compressors:

  • etc2comp: A full-featured ETC1/2 encoder developed by engineers at Blue Shift and sponsored by Google. Supports both RGB and perceptual error metrics.
  • etcpak: Extremely fast, ETC1 and partial ETC2 (planar blocks only), RGB error metrics only
  • Intel ISPC Texture Compressor: A very fast ETC1 compressor, RGB error metrics only
  • basislib ETC1: An updated version of my open source ETC1 block encoder, rg_etc1. Supports both RGB and perceptual error metrics (unlike rg_etc1).

The test files were  ~1,500 .PNG textures from the larger test corpus I used to tune crunch. Each texture was compressed using each encoder, then unpacked using rg_etc1 modified to support the 3 new ETC2 block types (planar, T, and H).

Benchmarking like this is surprisingly tricky. The API's to all the encoders are different, most are not well documented, and even exactly how you compute PSNR (because there are multiple definitions each with slightly different equations) isn't super well defined. Please see the "developer feedback" notes below.

I've sanity checked these results by writing .KTX files, converting them to .PNG using Mali's GPU Texture Compression Tool (which thankfully worked, because the .KTX format is iffy when it comes to interchange), then computing PSNR's using ImageMagick's "compare" tool. Thanks to John Brooks at Blue Shift for helping me verify the data for etc2comp, and helping me track down and fix the effort=100.0 issue in the first release of this benchmark.

I also have performance statistics, which I'll cover in a future post. The perf. data I have for etcpak isn't usable for accurate timing right now, because the etcpak code I'm calling is only single threaded and includes some I/O.

This first graph compares all four compressors in ETC1 mode, using RGB (average) PSNR.

Error Metric: Avg. RGB


ETC1:


The next graph enables ETC2 support in the encoders that support it, currently just etc2comp and etcpak:

ETC1/2:


etc2comp in ETC2 mode really shines at the lower quality levels. At below approximately 32 dB it appears the minimum expected quality improvement from ETC2 is significant. Above ~32 dB, the minimum expected improvement drops down a bit, closer to ETC1's quality level. (Which seems to make sense, as ETC2 was designed to better handle blocks that ETC1 is weak at.)

etcpak doesn't support T and H blocks, so it suffers a lot here. This is why it's very important to pay attention to benchmarks like this one, because quality (even in ETC2-capable or aware compressors) can highly vary between libraries.

Error Metric: Perceptual



[IN PROGRESS]


Developer Feedback

  • ISPC: I had to copy ispc.exe into your project directory for it to build in my VS2015 solution. That brought down the "out of the box" experience of getting your stuff into my solution. On the upside, your API was dead simple to figure out and was very "pure" - as it should be. (However, you should rename "stride" to "stride_in_bytes". I've seen at least one programmer get it wrong and I had to help them.)
  • etcpak: Can you add a single API to do compression with multithreading, like etc2comp? And have it return a double of how much time it takes to actually execute, excluding file I/O stuff. Your codec is so fast than I/O times will seriously skew the statistics.
  • etc2comp: Hey, ETC1 is still extremely important. Both Intel, basislib, and rg_etc1 have higher ETC1 quality than etc2comp. Also, could you add some defines like this to etc.h so developers know how to correctly call the public Etc::Encode() API:

#define ETCCOMP_MIN_EFFORT_LEVEL (0.0f)
#define ETCCOMP_DEFAULT_EFFORT_LEVEL (40.0f)
#define ETCCOMP_MAX_EFFORT_LEVEL (100.0f)

Notes


  • 9/20: I fixed etc2comp's "effort" setting, added Intel's compressor, and removed the perceptual graphs (for now) to speed things up.
  • 9/20: Changed title and purpose of this post to a sticky benchmark page. I'm now moving into the public texture compression benchmarking space - why not? It's fun!