ispc_texcomp and libsquish both use SIMD instructions, while the others are scalar.
This is BC1's "Pareto Frontier", which is a key concept used in lossless compression benchmarking. (There are 2 BC1 codecs missing: the ones in NVidia Texture Tools and AMD Compressonator. I don't think either would change this graph in a fundamental way, but I'll add them.) This applies to BC7/UASTC/ASTC (RGB/RGBA Direct CEM's) too, because the same core algorithms are used on each subset. Any basic improvements made here will benefit all similar endpoint-centric texture formats. So this frontier matters to us a great deal. (Note that GPU encoders may be much faster in an absolute sense, but they all boil down to using the same basic algorithms which is what we're really interested in here.)
For BC1 encoding (and this applies to BC7/ASTC and GPU encoders too), here are the key things you should do to best balance quality vs. perf that I've learned:
1. Do PCA, find 2 colors in block furthest apart along this axis. Use these colors as initial endpoints.
Note it's slightly better (and faster/simpler) to use 2 colors from the block as initial endpoints - not the versions projected along the axis. (The "stb_dxt" approach.)