Thursday, April 16, 2020

Yet another BC1 encoder benchmark

stb_dxt v1.09, icbc, rgbcx v1.12, original crunch, and Unity's optimized variant of crunch. Both 4 and 3 color blocks can be used, but transparent texels are not utilized to get black/dark texels in this benchmark. Across a diverse assortment of 100 textures (not just images).

Same benchmark except this time with 3-color transparent texels used for black or dark texels in rgbcx (purple samples):

Here's an update, now with nvdxt.exe (black sample) and ispc_texcomp (brown sample). Note that the nvdxt.exe time is approximate because I had to spawn nvdxt.exe and it loads a .png and saves a .dds file. I did spawn it twice, once without timing it, then immediately again timing it.

nvdxt.exe command line:

nvdxt.exe -nomipmap -quality_highest -rms_threshold 50 -file image.png -output -dxt1c -weight 1.0 1.0 1.0

Wednesday, April 15, 2020

.basis file format specification

[This is a work in progress, and the formatting isn't ideal. It will be copied & pasted into the Basis Universal wiki once it's done, and deleted from here.]

The Basis Universal GPU texture codec supports reading and writing ".basis" files. Currently the file format supports ETC1S or UASTC 4x4 texture data:

  • ETC1S is a simplified subset of ETC1.

The mode is always differential (diff bit=1), the Rd, Gd, and Bd color deltas are always (0,0,0), and the flip bit is always set. ETC1S texture data is fully 100% compliant with all existing software and hardware ETC1 decoders. Existing encoders can be easily modified to limit their output to ETC1S.

  • UASTC 4x4 is a 19 mode subset of the ASTC texture format. Its specification is here. UASTC texture data can always be losslessly transcoded to ASTC.

At a high level, a typical .basis file consists of multiple sections:

  • The file header
  • Optional ETC1S compressed endpoint/selector codebooks
  • Optional ETC1S Huffman table information
  • A required "slice" description array describing the resolutions and file offset/compressed sizes of each texture slice present in the file
  • 1 or more slices containing ETC1S or UASTC compressed texture data. 
  • For future expansion, the format supports an "extended" header which may be located anywhere in the file. This section contains .PNG-like chunked data. 

Apart from the header, which must always be present at the start of the file, the other sections can appear in any order.


// basis_file_header::m_tex_type
enum basis_texture_type
  cBASISTexType2D = 0,
  cBASISTexType2DArray = 1,
  cBASISTexTypeCubemapArray = 2,
  cBASISTexTypeVideoFrames = 3,
  cBASISTexTypeVolume = 4,

// basis_slice_desc::flags
enum basis_slice_desc_flags
  cSliceDescFlagsHasAlpha = 1,
  cSliceDescFlagsFrameIsIFrame = 2

// basis_file_header::m_tex_format enum basis_tex_format
  cETC1S = 0,
  cUASTC4x4 = 1

// basis_file_header::m_flags enum basis_header_flags
  cBASISHeaderFlagETC1S = 1.
  cBASISHeaderFlagYFlipped = 2,
  cBASISHeaderFlagHasAlphaSlices = 4

File Structures

All individual values in all file structures are byte aligned and always little endian. The structs have no padding (i.e. they are declared with #pragma pack(1)).

Struct basis_file_header

The file header must always be at the beginning of the file.

struct basis_file_header
  uint16      m_sig;              // 2 byte file signature
  uint16      m_ver;              // File version
  uint16      m_header_size;      // Header size in bytes, sizeof(basis_file_header) or 0x4D
  uint16      m_header_crc16;     // CRC16/genibus of the remaining header data

  uint32      m_data_size;        // The total size of all data after the header
  uint16      m_data_crc16;       // The CRC16 of all data after the header

  uint24      m_total_slices;     // The number of compressed slices 
  uint24      m_total_images;     // The total # of images
  byte        m_tex_format;       // enum basis_tex_format
  uint16      m_flags;            // enum basis_header_flags
  byte        m_tex_type;         // enum basis_texture_type
  uint24      m_us_per_frame;     // Video: microseconds per frame

  uint32      m_reserved;         // For future use
  uint32      m_userdata0;        // For client use
  uint32      m_userdata1;        // For client use

  uint16      m_total_endpoints;          // ETC1S: The number of endpoints in the endpoint codebook 
  uint32      m_endpoint_cb_file_ofs;     // ETC1S: The compressed endpoint codebook's file offset relative to the header
  uint24      m_endpoint_cb_file_size;    // ETC1S: The compressed endpoint codebook's size in bytes

  uint16      m_total_selectors;          // ETC1S: The number of selectors in the selector codebook 
  uint32      m_selector_cb_file_ofs;     // ETC1S: The compressed selector codebook's file offset relative to the header
  uint24      m_selector_cb_file_size;    // ETC1S: The compressed selector codebook's size in bytes

  uint32      m_tables_file_ofs;          // ETC1S: The file offset of the compressed Huffman codelength tables.
  uint32      m_tables_file_size;         // ETC1S: The file size in bytes of the compressed Huffman codelength tables.

  uint32      m_slice_desc_file_ofs;      // The file offset to the slice description array, usually follows the header

  uint32      m_extended_file_ofs;        // The file offset of the "extended" header and compressed data, for future use
  uint32      m_extended_file_size;       // The file size in bytes of the "extended" header and compressed data, for future use

  • m_sig is always 'B' * 256 + 's', or 0x4273.
  • m_ver is currently always 0x10.
  • m_header_size is sizeof(basis_file_header). It's always 0x4D.
  • m_header_crc16 is the CRC-16 of the remaining header data. See the "CRC-16" section for more information.
  • m_data_size, m_data_crc16: The size of all data following the header, and its CRC-16.
  • m_total_slices: The total number of slices, from [1,2^24-1]
  • m_total_images: The total number of images (where one image can contain multiple mipmap levels, and each mipmap level is a different slice).
  • m_tex_format: basis_tex_format. Either cETC1S (0), or cUASTC4x4 (1).
  • m_flags: A combination of flags from the basis_header_flags enum.
  • m_tex_type: The texture type, from enum basis_texture_type
  • m_us_per_frame: Microseconds per frame, only valid for cBASISTexTypeVideoFrames texture types.
  • m_total_endpoints, m_endpoint_cb_file_ofs, m_endpoint_cb_file_size: Information about the compressed ETC1S endpoint codebook: The total # of entries, the offset to the compressed data, and the compressed data's size.
  • m_total_selectors, m_selector_cb_file_ofs, m_selector_cb_file_size: Information about the compressed ETC1S selector codebook: The total # of entries, the offset to the compressed data, and the compressed data's size.
  • m_tables_file_ofs, m_tables_file_size: The file offset and size of the compressed Huffman tables for ETC1S format files. 
  • m_slice_desc_file_ofs: 
    The file offset to the array of slice description structures. There will be m_total_slices structures at this file offset.
  • m_extended_file_ofs, m_extended_file_size: The "extended" header, for future expansion. Currently unused.

Struct basis_slice_desc

struct basis_slice_desc
    uint24 m_image_index;  
    uint8 m_level_index;   
    uint8 m_flags;         

    uint16 m_orig_width;   
    uint16 m_orig_height;  

    uint16 m_num_blocks_x; 
    uint16 m_num_blocks_y; 

    uint32 m_file_ofs;     
    uint32 m_file_size;    

    uint16 m_slice_data_crc16; 

  • m_image_index: The index of the source image provided to the encoder (will always appear in order from first to last, first image index is 0, no skipping allowed)
  • m_level_index: The mipmap level index (mipmaps will always appear from largest to smallest)
  • m_flags: enum basis_slice_desc_flags
  • m_orig_width: The original image width (may not be a multiple of 4 pixels)
  • m_orig_height: The original image height (may not be a multiple of 4 pixels)
  • m_num_blocks_x: The slice's block X dimensions. Each block is 4x4 pixels. The slice's pixel resolution may or may not be a power of 2.
  • m_num_blocks_y: The slice's block Y dimensions. 
  • m_file_ofs: Offset from the header to the start of the slice's data
  • m_file_size: The size of the compressed slice data in bytes
  • m_slice_data_crc16: The CRC16 of the compressed slice data, for extra-paranoid use cases

CRC-16 Function

.basis files use CRC-16/genibus(aka CRC-16 EPC, CRC-16 I-CODE, CRC-16 DARC) format CRC-16's. Here's an example function in C++:

uint16_t crc16(const void* r, size_t size, uint16_t crc)
  crc = ~crc;
  const uint8_t* p = static_cast<const uint8_t*>(r);
  for ( ; size; --size)
    const uint16_t q = *p++ ^ (crc >> 8);
    uint16_t k = (q >> 4) ^ q;
    crc = (((crc << 8) ^ k) ^ (k << 5)) ^ (k << 12);
  return static_cast<uint16_t>(~crc);

This function is called with 0 in the final "crc" parameter when computing CRC-16's of file data.

Thursday, April 9, 2020

BC1 encoding initial endpoint determination benchmark

Benchmark of BC1 encoders using different methods to determine the initial endpoints: 

stb_dxt.h PCA: 35.754 dB, .551 us/block 
rgbcx.h PCA: 35.794, .651 
rgbcx.h PCA+inset: 35.925, .640 
rgbcx.h 2D LS+inset+opt round: 35.920 dB, .541 
rgbcx.h bounds+inset+XY covar: 35.836 dB, .472

This is across 100 textures, so even small avg. improvements are significant. Amazingly, the inset method (a few lines of code) buys rgbcx.h PCA .131 dB! All encoders should be doing this. You *must* pay attention to every little detail in these texture encoders.

Quality is performance in competitive texture block encoding, so even small boosts in quality allow us to dial down the # of total orders to check for the same average quality. This leads to a more competitive encoder.


- bounds+inset+XY covar method is Castano's/van Waveren's. 
All encoders should be applying the "inset" method describes in this paper, because from a quantization perspective it makes perfect sense.

- 2D LS is Humus's method, ported to mostly integer math, with added inset+optimal rounding to 565: 

- stb_dxt.h and rgbcx.h PCA is 3D integer PCA (3x3 covar+4 power iters, pick 2 colors along principle axis). 

- PCA+inset+optimal rounding does PCA, picks 2 colors, then lerps the 2 colors by 1/16 or 15/16, then optimal rounds to 565.

Wednesday, April 8, 2020

AMD GPU BC1 decoding lookup tables

Here are the lookup tables you can use to determine how AMD GPU's decode BC1 textures:

These tables were gathered straight from a Radeon RX 580 by using a small D3D9 app that rendered a textured BC1 quad with point sampling and did a CPU readback. I used this same D3D9 app on an NVidia 1080 and the pixels I read back exactly matched what the NV BC1 formulas on the web predicted, so I'm confident in the approach.

For selectors 0 and 1, the 5->8 and 6->8 endpoint conversion just uses bitshifts/OR's (same as ideal BC1). For 4-color selector 2, use the tables. For selector 3, just invert the low/high endpoints. (I've verified you can do this.) For 3-color selector 2, use the tables.

To access the tables, use [color0_component*32+color1_component], or *64 for 6-bits:
Block Compression (Direct3D 10) - Win32

Converting the tables to formulas sounds like an interesting puzzle.

Example showing exactly how to use the tables to decode AMD BC1:

Tuesday, April 7, 2020

CPU BC1 Encoding Pareto Frontier

rgbcx.h now defines the BC1 Pareto Frontier for high quality CPU BC1 encoding (i.e. it's stronger than all other available practical high quality CPU encoders for both performance and quality):



I didn't include AMD Compressonator's encoder because in previous benchmarks (conducted by others) it was beaten by a weaker version of rgbcx.h for both perf. and quality.

The overall CPU BC1 Pareto frontier is defined by ispc_texcomp (at low quality: ~33.1 dB) and rgbcx for any higher quality level. We're going to need SIMD to compete against ispc_texcomp BC1 (a weak stb_dxt clone), which is my next major goal.

To get rgbcx to compete against icbc for max. quality I had to add prioritized cluster fit support for 3-color blocks (not just 4).

It's possible to permit rgbcx to go to even higher quality levels by enlarging the total ordering tables. They're currently limited to 32 entries per total ordering.

I think rgbcx.h's max quality is slightly higher than icbc's HQ mode because prioritized cluster fit can afford to do optimal rounding and evaluate accurate MSE errors in every trial. Regular cluster fit can't afford to do so because it has to evaluate so many total orderings.


Saturday, April 4, 2020

New BC1 benchmark

Optimizing BC1 encoding is still useful and interesting because the same core algorithms are used in BC7 and ASTC/UASTC encoders. Most improvements made to BC1 encoding carry over nicely to the 2-bit and 3-bit selector modes of other formats.

Here's my latest benchmark:

The highest performing samples (above 37 dB) are rgbcx in 3-color block mode, where it can use transparent black colors (selector 3) for opaque black or very dark texels. (The only other BC1 encoder that might support this mode is the one in NVidia Texture Tools, but I'm not sure.) This technically turns opaque textures into textures with a useless alpha channel, but if the engine or shader just ignores alpha then this mode performs exceptionally well in the average case. The flags are cEncodeBC1Use3ColorBlocksForBlackPixels | cEncodeBC1Use3ColorBlocks

This mode is super useful because it allows the 3-color block encoder to focus the endpoints on the brighter texels within the block, potentially greatly increasing quality. Blocks with very dark or black texels are common in practice.

If your engine supports ignoring the alpha channel in sampled BC1 textures then everyone using BC1 should be using encoders that support this.


rgbcx.h flags:

- h is cEncodeBC1HighQuality
- ut is cEncodeBC1UseLikelyTotalOrderings
- ub is cEncodeBC1Use3ColorBlocksForBlackPixels
- 3 is cEncodeBC1Use3ColorBlocks

From the benchmarks I've seen it appears NVidia Texture Tools BC1 is around the same perf. as libsquish at slightly higher quality:

I believe this was rgbcx using 10 total orderings (the default setting). The max is 32, and every additional total ordering increases average quality. So at higher settings rgbcx is likely competitive against nvtt while being faster.

I'm currently working on integrating NVTT into my test app.

Friday, February 28, 2020

UASTC benchmark

RGB PSNR over a 1,048,576 4x4 block compression torture test (random blocks from 81 test textures):
    Near-opt BC7 (BC7E slower):   41.743
    astcenc_thorough:             40.892
    UASTC (veryslow)->ASTC        40.373
    UASTC (veryslow)->BC7         39.965
    UASTC (slower)->ASTC          40.163
    UASTC (slower)->BC7:          39.782
    UASTC (default)->ASTC         39.372
    UASTC (default)->BC7:         39.171
    UASTC (faster)->ASTC          39.269
    UASTC (fastest)->ASTC         34.654
    UASTC (fastest)->BC7          34.554
    ispc_texcomp ASTC alpha_slow: 39.768
    stb_dxt BC1 HIGHQUAL:         32.479
    UASTC (slower)->BC1:          32.148
    UASTC (fastest)->BC1          32.256
    UASTC (slower)->ETC1:         30.956
    UASTC (fastest)->ETC1:        30.113
    UASTC (slower)->R11:          37.942

The 4096x4096 .PNG is here.

The EAC R11 format is R PSNR, and is included for comparison purposes.

Notice that the UASTC->BC1 quality actually increased when going from "slower" to "fastest" mode. This is because in "fastest" mode, almost all the blocks used UASTC mode 0, which is more compatible with BC1. (UASTC has 1-2 bits of BC1 hints per block that allow the UASTC block to be converted directly to BC1 blocks, skipping real-time encoding.)