Saturday, December 10, 2016

"The Ballad of the Green Beret"

I heard this playing at the local Pagliacci's recently, and I realized this is one of the tunes my father used to play all the time. He was in Vietnam in I think '68 or '69, totally lost alone in the jungle, and was saved by a branch of the Special Forces called the Green Beret's.


Monday, November 28, 2016

Why Age3 used low poly skinned meshes

Age3 used CPU skinning of relatively low poly models (even in "high" model mode). To help improve this technical design misstep made by the Age3 team (before I joined the team near the end of production) I rewrote the skinning code to be multithreaded. Unfortunately, by the time I came on board the artists had already created a ton of low poly skinned meshes.

I also built the skinning DLL with Intel's compiler, so I was able to easily rewrite all the skinning code using SSE1/2 ops using compiler intrinsics. Back in those days MSVC's support for vector intrinsics was weaker than Intel's compiler. (I'm also the developer to blame for Age3's SSE requirement, which bit some owners of very early AMD processors who otherwise could have played the title at low frame rates.)

Anyhow, I mention this because if you play Age3 today, like on a 4k monitor, the game's terrain and other effects hold up pretty well. Except the skinned character models look terribly low poly by comparison. On Halo Wars I used GPU skinning, instanced rendering, and I heavily jobified the animation system.

Another little note about the Halo Wars engine

There's still a lot of misunderstanding out there about where the Halo Wars engine technology came from. Starting in very early 2005 the HW team wrote a new engine pretty much from scratch. The Age3 code was only single threaded, didn't use SIMD, and consumed huge amounts of RAM. (Age3 used over 32MB just for UTF16 strings - not good for a console game!) The "Bang!" engine ran at ~7Hz and took around three to five minutes to load on the early Xbox 360 devkits.

Colt McAnlis (now Google), Billy Khan (now at Id Software) and I wrote the entire Xbox 360-only renderer almost from scratch. We started out with Age3's particle renderer and my "wrench" demo deferred shading engine for SM 2.0 hardware. Ensemble Studios basically gave us a blank check to do whatever we wanted on Xbox 360. (What good times!)

Age3's particle engine (written partially or mostly by Graham Devine, now at Magic Leap) was so good that the artists refused to allow us to rewrite it. Billy and I threaded it by converting it into jobs, and we SIMD'ified all the key loops using Altivec ops. We also offloaded as many computations as we could into vertex/pixel shaders, to cut down on the very high CPU cost of the original code.

The Halo Wars particle engine would have ran circles around Age3's (once ported back to x86).

Please don't get me wrong, Age3 was a beautiful and fun game, and I loved working on it. The team was super easy and pleasant to work with. Just remember that Halo Wars was created by a very different team with different goals. We had some pretty awesome goals for the next Halo Wars, but the studio was shut down.

Sunday, October 23, 2016

RDO ETC1 compression examples

I've compressed the kodak test images using the prototype RDO ETC1 compressor I've been working on recently at various settings. You can download a .7z archive containing the RDO compressed .KTX files and unpacked PNG's here. The .KTX files can be loaded using the Mali Texture Compression Tool (v4.3.0).

Here are the unpacked images for 512 endpoints and 1024 selectors (1.65 average bits/texel vs. 2.85 average bits/texel for non-RDO ETC1):

























Non-RDO:
best_luma_psnr: Avg: 40.009226, Std Dev: 2.154732, Min: 35.193684, Max: 42.750275, Mean: 41.113007
best_luma_ssim: Avg: 0.983419, Std Dev: 0.002109, Min: 0.980131, Max: 0.989254, Mean: 0.983190
best_bits_per_texel: Avg: 2.851078, Std Dev: 0.341672, Min: 2.184774, Max: 3.482361, Mean: 2.822876
best_orig_size: Avg: 196676.000000, Std Dev: 0.000000, Min: 196676.000000, Max: 196676.000000, Mean: 196676.000000
best_compressed_size: Avg: 140136.166667, Std Dev: 16793.846305, Min: 107386.000000, Max: 171165.000000, Mean: 138750.000000

RDO (#endpoints_#selectors):

512_256:
rdo_luma_psnr: Avg: 31.638530, Std Dev: 2.891301, Min: 25.210732, Max: 35.657692, Mean: 33.023266
rdo_luma_ssim: Avg: 0.903939, Std Dev: 0.022998, Min: 0.839709, Max: 0.941335, Mean: 0.902615
rdo_bits_per_texel: Avg: 1.478541, Std Dev: 0.211604, Min: 1.075765, Max: 1.888489, Mean: 1.453206

512_512:
rdo_luma_psnr: Avg: 32.549770, Std Dev: 2.950959, Min: 25.927277, Max: 36.671211, Mean: 34.135223
rdo_luma_ssim: Avg: 0.916562, Std Dev: 0.020127, Min: 0.860491, Max: 0.950293, Mean: 0.915359
rdo_bits_per_texel: Avg: 1.555512, Std Dev: 0.211616, Min: 1.142314, Max: 1.969767, Mean: 1.533732

512_1024:
rdo_luma_psnr: Avg: 33.600601, Std Dev: 2.981399, Min: 26.842752, Max: 37.809361, Mean: 35.187038
rdo_luma_ssim: Avg: 0.928182, Std Dev: 0.017318, Min: 0.879742, Max: 0.957868, Mean: 0.926356
rdo_bits_per_texel: Avg: 1.648000, Std Dev: 0.208101, Min: 1.249207, Max: 2.055928, Mean: 1.623047

512_2048:
rdo_luma_psnr: Avg: 34.828563, Std Dev: 2.959008, Min: 27.984495, Max: 38.820568, Mean: 36.302998
rdo_luma_ssim: Avg: 0.939762, Std Dev: 0.014454, Min: 0.898490, Max: 0.964750, Mean: 0.938300
rdo_bits_per_texel: Avg: 1.765885, Std Dev: 0.208030, Min: 1.368184, Max: 2.174438, Mean: 1.735372

512_4096:
rdo_luma_psnr: Avg: 36.244860, Std Dev: 2.824295, Min: 29.513725, Max: 39.823002, Mean: 37.670746
rdo_luma_ssim: Avg: 0.951658, Std Dev: 0.011454, Min: 0.918562, Max: 0.971457, Mean: 0.950959
rdo_bits_per_texel: Avg: 1.924732, Std Dev: 0.210003, Min: 1.535360, Max: 2.343709, Mean: 1.893290

1024_4096:
rdo_luma_psnr: Avg: 36.375379, Std Dev: 2.881440, Min: 29.531380, Max: 40.141788, Mean: 37.697235
rdo_luma_ssim: Avg: 0.952464, Std Dev: 0.011512, Min: 0.918884, Max: 0.972384, Mean: 0.951676
rdo_bits_per_texel: Avg: 1.992114, Std Dev: 0.220525, Min: 1.569580, Max: 2.418762, Mean: 1.949666

Effect of ETC1 selector quantization on Luma SSIM/PSNR

This is like the previous post, except this time only the selectors are quantized while the endpoints are left alone. kodak test images, perceptual colorspace metrics:





Stats for non-RDO ETC1 compression:

best_luma_psnr: Avg: 40.009226, Std Dev: 2.154732, Min: 35.193684, Max: 42.750275, Mean: 41.113007
best_luma_ssim: Avg: 0.983419, Std Dev: 0.002109, Min: 0.980131, Max: 0.989254, Mean: 0.983190
best_bits_per_texel: Avg: 2.851078, Std Dev: 0.341672, Min: 2.184774, Max: 3.482361, Mean: 2.822876

RDO selectors 8192:

rdo_luma_psnr: Avg: 38.225255, Std Dev: 2.628415, Min: 31.853958, Max: 41.955276, Mean: 39.500504
rdo_luma_ssim: Avg: 0.966271, Std Dev: 0.007768, Min: 0.944449, Max: 0.981821, Mean: 0.966354
rdo_bits_per_texel: Avg: 2.366380, Std Dev: 0.231610, Min: 1.902201, Max: 2.793721, Mean: 2.337708

RDO selectors 4096:

rdo_luma_psnr: Avg: 36.581700, Std Dev: 2.874786, Min: 29.814810, Max: 40.718441, Mean: 37.796730
rdo_luma_ssim: Avg: 0.953993, Std Dev: 0.010954, Min: 0.922887, Max: 0.973516, Mean: 0.953305
rdo_bits_per_texel: Avg: 2.132147, Std Dev: 0.220503, Min: 1.668640, Max: 2.535848, Mean: 2.094666

RDO selectors: 2048:

rdo_luma_psnr: Avg: 35.129581, Std Dev: 2.967410, Min: 28.291447, Max: 39.650620, Mean: 36.413860
rdo_luma_ssim: Avg: 0.942579, Std Dev: 0.013760, Min: 0.903846, Max: 0.967203, Mean: 0.941114
rdo_bits_per_texel: Avg: 1.969779, Std Dev: 0.216071, Min: 1.506246, Max: 2.368530, Mean: 1.930033

RDO selectors 1024:

rdo_luma_psnr: Avg: 33.915408, Std Dev: 2.963184, Min: 27.143675, Max: 38.416290, Mean: 35.294361
rdo_luma_ssim: Avg: 0.931751, Std Dev: 0.016440, Min: 0.886028, Max: 0.960691, Mean: 0.929749
rdo_bits_per_texel: Avg: 1.848387, Std Dev: 0.216314, Min: 1.378805, Max: 2.245748, Mean: 1.809530

RDO selectors 512:

rdo_luma_psnr: Avg: 32.898390, Std Dev: 2.920482, Min: 26.292456, Max: 37.282799, Mean: 34.293579
rdo_luma_ssim: Avg: 0.920788, Std Dev: 0.019035, Min: 0.868281, Max: 0.953666, Mean: 0.918912
rdo_bits_per_texel: Avg: 1.753840, Std Dev: 0.215968, Min: 1.278585, Max: 2.150350, Mean: 1.717773

RDO selectors 256:

rdo_luma_psnr: Avg: 32.036631, Std Dev: 2.866251, Min: 25.595591, Max: 36.275482, Mean: 33.285240
rdo_luma_ssim: Avg: 0.909641, Std Dev: 0.021761, Min: 0.849128, Max: 0.946493, Mean: 0.907937
rdo_bits_per_texel: Avg: 1.673566, Std Dev: 0.215763, Min: 1.187663, Max: 2.065999, Mean: 1.631165

RDO selectors 128:

rdo_luma_psnr: Avg: 31.255766, Std Dev: 2.800476, Min: 24.977221, Max: 35.173336, Mean: 32.437733
rdo_luma_ssim: Avg: 0.896458, Std Dev: 0.024306, Min: 0.827130, Max: 0.934879, Mean: 0.895064
rdo_bits_per_texel: Avg: 1.600956, Std Dev: 0.215559, Min: 1.127218, Max: 1.991862, Mean: 1.550741

Saturday, October 22, 2016

Effect of ETC1 endpoint quantization on Luma SSIM/PSNR

In this test on the 24 kodak images I quantized the ETC1 block colors/intensity tables (or what I've been calling "endpoints", from DXT1/BC1 terminology) to 128 clusters, but the selectors were not quantized at all. 128 clusters for endpoints is at the edge of usability for many photos.

This test also adaptively limits blocks to only a single endpoint (verses a unique endpoint for each subblock), if doing so doesn't lower the block's PSNR by more than 1.25 dB.

Anyhow, these two graphs show that this process is quite  effective. Even at only 128 clusters, the overall SSIM is only reduced by around .01, while the bitrate is reduced by around .4 - .5 bits/texel.

The results look surprisingly good. I've made great progress on quality per bit over the previous few weeks, and I'll be posting images and .KTX files in a day or so.



Two more graphs, with 3 different endpoint quantization settings:


Overall stats:

ETC1 (no quantization):
best_luma_psnr: Avg: 40.009226, Std Dev: 2.154732, Min: 35.193684, Max: 42.750275, Mean: 41.113007
best_luma_ssim: Avg: 0.983419, Std Dev: 0.002109, Min: 0.980131, Max: 0.989254, Mean: 0.983190
best_bits_per_texel: Avg: 2.851078, Std Dev: 0.341672, Min: 2.184774, Max: 3.482361, Mean: 2.822876

128 endpoints:
rdo_luma_psnr: Avg: 38.042171, Std Dev: 1.874003, Min: 34.209053, Max: 41.065495, Mean: 38.749592
rdo_luma_ssim: Avg: 0.974083, Std Dev: 0.004284, Min: 0.960817, Max: 0.983318, Mean: 0.974376
rdo_bits_per_texel: Avg: 2.351300, Std Dev: 0.318168, Min: 1.788859, Max: 2.967855, Mean: 2.344340

512 endpoints:
rdo_luma_psnr: Avg: 39.239567, Std Dev: 2.001313, Min: 34.834538, Max: 41.839687, Mean: 40.379951
rdo_luma_ssim: Avg: 0.979648, Std Dev: 0.002847, Min: 0.973445, Max: 0.987098, Mean: 0.979329
rdo_bits_per_texel: Avg: 2.617640, Std Dev: 0.345818, Min: 2.031942, Max: 3.296285, Mean: 2.604553

1024 endpoints:
rdo_luma_psnr: Avg: 39.490915, Std Dev: 2.033055, Min: 34.942341, Max: 42.026814, Mean: 40.666183
rdo_luma_ssim: Avg: 0.980563, Std Dev: 0.002673, Min: 0.976034, Max: 0.987617, Mean: 0.980514
rdo_bits_per_texel: Avg: 2.693218, Std Dev: 0.356560, Min: 2.069397, Max: 3.390055, Mean: 2.668416

The next 2 graphs show RDO ETC1 compression on the kodak test images with endpoint quantization effectively disabled (for all practical purposes). Note that adaptive subblock utilization is still enabled here, so it's possible for a block's subblocks to be forced to use the same block colors/intensity tables (endpoints) if the quality loss is < 1.25 dB.

Tests like this are important, because it shows that the RDO compressor is able to utilize all the features available in ETC1: flip/non-flipped, differential/absolute block color encoding, subblocks, etc.



Overall stats:

rdo_luma_psnr: Avg: 39.766113, Std Dev: 2.066657, Min: 35.116722, Max: 42.367085, Mean: 40.845627
rdo_luma_ssim: Avg: 0.981710, Std Dev: 0.002428, Min: 0.978301, Max: 0.988114, Mean: 0.981266
rdo_bits_per_texel: Avg: 2.754947, Std Dev: 0.365874, Min: 2.098104, Max: 3.464823, Mean: 2.714681
rdo_orig_size: Avg: 196676.000000, Std Dev: 0.000000, Min: 196676.000000, Max: 196676.000000, Mean: 196676.000000
rdo_compressed_size: Avg: 135411.166667, Std Dev: 17983.452669, Min: 103126.000000, Max: 170303.000000, Mean: 133432.000000

best_luma_psnr: Avg: 40.009226, Std Dev: 2.154732, Min: 35.193684, Max: 42.750275, Mean: 41.113007
best_luma_ssim: Avg: 0.983419, Std Dev: 0.002109, Min: 0.980131, Max: 0.989254, Mean: 0.983190
best_bits_per_texel: Avg: 2.851078, Std Dev: 0.341672, Min: 2.184774, Max: 3.482361, Mean: 2.822876
best_orig_size: Avg: 196676.000000, Std Dev: 0.000000, Min: 196676.000000, Max: 196676.000000, Mean: 196676.000000
best_compressed_size: Avg: 140136.166667, Std Dev: 16793.846305, Min: 107386.000000, Max: 171165.000000, Mean: 138750.000000

The next graphs are just like the previous ones, except the adaptive subblock feature is disabled. They show that RDO ETC1 with no quantization is virtually identical to basic (highest quality, block by block) ETC1 compression.




Overall stats:

rdo_luma_psnr: Avg: 39.991337, Std Dev: 2.109917, Min: 35.276287, Max: 42.721352, Mean: 41.098907
rdo_luma_ssim: Avg: 0.982858, Std Dev: 0.002269, Min: 0.979608, Max: 0.988770, Mean: 0.982394
rdo_bits_per_texel: Avg: 2.853771, Std Dev: 0.348101, Min: 2.188131, Max: 3.518412, Mean: 2.828857
rdo_orig_size: Avg: 196676.000000, Std Dev: 0.000000, Min: 196676.000000, Max: 196676.000000, Mean: 196676.000000
rdo_compressed_size: Avg: 140268.541667, Std Dev: 17109.836167, Min: 107551.000000, Max: 172937.000000, Mean: 139044.000000

best_luma_psnr: Avg: 40.009226, Std Dev: 2.154732, Min: 35.193684, Max: 42.750275, Mean: 41.113007
best_luma_ssim: Avg: 0.983419, Std Dev: 0.002109, Min: 0.980131, Max: 0.989254, Mean: 0.983190
best_bits_per_texel: Avg: 2.851078, Std Dev: 0.341672, Min: 2.184774, Max: 3.482361, Mean: 2.822876
best_orig_size: Avg: 196676.000000, Std Dev: 0.000000, Min: 196676.000000, Max: 196676.000000, Mean: 196676.000000

best_compressed_size: Avg: 140136.166667, Std Dev: 16793.846305, Min: 107386.000000, Max: 171165.000000, Mean: 138750.000000

Thursday, October 20, 2016

Rate distortion performance of Basis ETC1 RDO+LZMA on the Kodak test set

At 3 quality levels, using REC709 perceptual colorspace metrics. This compares plain ETC1 (with no lossless compression), basislib highest quality ETC1+LZMA, and basislib RDO+LZMA.

"S" = selectors, "E" = endpoints.

crunch-style adaptive endpoint quantization at the block/subblock level is supported, but not at the macroblock (2x2 block) level yet. Also, the KTX writer backend is greedy, meaning it doesn't try to choose the best combination of selectors+endpoints that result in the least amount of compressed bits output by LZMA (or LZHAM). The lack of both features hurts compression. I have several other improvements to both quality and bitrate coming, but this is a good milestone.



With a few more quality levels: