I took notes as I was porting my new BC7 encoder from C to ispc. First, be sure to read and re-read the
user guide,
performance guide, and
FAQ. This compiler tech kicks ass and I hope Intel keeps throwing resources at it. My initial port of 3k lines of C and initial stabs at vectorization was only ~2x faster, but after tuning the inner loops perf. shot up to over 5x vs. regular C code (on AVX). All without having to use a single ugly intrinsic instruction.
I'm new to ispc so hopefully I haven't made any mistakes below, but here's what I learned during the process:
If you're not starting from scratch, port your code to plain C with minimal dependencies and test that first. Make some simple helper functions like clamp(), min(), etc. that look like the ispc standard lib's, so when you do get to ispc you can easily switch them over to use the stdlib's (which is very important for performance).
Then port your C code to ispc, but put "uniform" on
ALL variables and pointers. Test that and make sure it still works. In my experience so far you should have minimal problems at this stage assuming you put uniforms everywhere. Now you can dive into vectorizing the thing. I would first figure out how things should be laid out in memory and go from there. You may be able to just vectorize the hotspots, or you may have to vectorize the entire thing (like I did which was hours of messing around with uniform/varying keywords).
While developing and recompiling over and over again, change your --target option to only target one specific instruction set temporarily: --target=avx (or SSE2, etc.). There's little use targeting a bunch of different instruction sets (SSE, SSE2, AVX, AVX2, etc.) while developing new code, at this point all you care about is getting it working correctly.
The mental model is like shaders but for the CPU. Conceptually, the entire program gang executes each instruction, but the results can be masked off on a per-lane basis. If you are comfortable with shaders you will get this model immediately. Just beware there's a lot of variability in the CPU cost of operations, and optimal code sequences can be dramatically faster than slower ones. Study the generated assembly of your hotspots in the debugger and experiment. CPU SIMD instruction sets seem more brittle than ones for GPU's (why?).
A single pointer deref can hide a super expensive gather or scatter. Don't ignore the compiler warnings. These warnings are critical and can help you understand what the compiler is actually doing with your code. Examine every gather and scatter and understand
why the compiler is doing them. If these operations are in your hotspots/inner loops then rethink how your data is stored in memory. (I can't emphasize this enough - scatters and gathers kill perf. unless you are lucky enough to have a
Xeon Phi.)
varying and uniform take on godlike properties in ispc. You must master them. A "varying struct" means the struct is internally expanded to contain X values for each member (one each for the size of the gang). sizeof(uniform struct) != sizeof(varying struct). While porting I had to check, recheck, and check again all uniform and varying keywords everywhere in my code.
You need to master pointers with ispc, which are definitely tricky at first. The pointee is
uniform by default, but the
pointer itself is varying by default which isn't always what you want. "varying struct *uniform ptr" is a uniform pointer to a varying struct (read it right to left). In most cases, I wanted varying struct's and uniform pointers to them.
Find all memset/memmove/memcpy's and examine them extremely closely. In many cases, they won't work as expected after vectorization. Check all sizeof()'s too. The compiler won't always give you warnings when you do something obviously dumb. In most cases I just replaced them with hand-rolled loops to copy/initialize the values, because once you switch to varying types all bets are off if a memset() will do the expected thing.
Sometimes, code sequences in vectorized code just don't work right. I had some code that inserted an element into a sorted list, that wouldn't work right until I rearranged it. Maybe it was something silly I did, but it pays to litter your code with assert()'s until you get things working.
assert()'s aren't automatically disabled in release, you must use "--opt=disable-assertions" to turn them off. assert()'s in vectorized code can be quite slow. The compiler should probably warn you about assert()'s when optimizations are enabled.
print("%", var); is how you print things (not "%u" or "%f" etc.). Double parentheses around the value means the lane was masked out. If using Visual Studio I wouldn't fully trust the locals window when debugging - use print().
Once you start vectorizing, either the compiler is going to crash, or the compiler is going to generate function prologs that immediately crash. Both events are unfortunately going to happen until it's more mature. For the func. prolog crashes, in most if not all cases this was due to a mismatch between the varying/uniform attributes of the passed in pointers to functions that didn't cause compiler errors or warnings. Check and double check your varying and uniform attributes on your pointers. Fix your function parameters until the crash goes away. These were quite painful early on. To help track them down, #if 0 out large sections of code until it works, then slowly bring code in until it fails.
The latest version of ispc (1.9.2) supports limited debugging with Visual Studio. Examining struct's with bool's doesn't seem to work, the locals window is very iffy but more or less works. Single stepping works. Profiling works but seems a little iffy.
If you start to really fight the compiler on a store somewhere, you've probably got something wrong with your varying/uniform keywords. Rethink your data and how your code manipulates it.
If you're just starting a port and are new to ispc, and you wind up with a "varying varying" pointer then it's ok to be paranoid. It's probably not really what you want.
Experienced obvious codegen issues with uniform shifts and logical or's of uint16 values. Once I casted them to uint32's the problems went away. Be wary of integer shifts, which I had issues with in a few spots.
Some very general/hand-wavy recommendations with vectorized code: Prefer SP math over DP. Prefer FP math over integer math. Prefer 32-bit integer math over 64-bit. Prefer signed vs. unsigned integers. Prefer FP math vs. looking stuff up from tables if using the tables requires gathering. Avoid uint64's. Prefer 32-bit int math intermediates vs. 8-bit. Prefer simpler algorithms that load from constant array entries in a table (so all lanes lookup at the same location in the table), vs. more complex algorithms that require table lookups with unique per-lane indices.
Study stdlib.ispc. Prefer stdlib's clamp() vs. custom functions, and prefer stdlib vs. your own stuff for min, max, etc. The compiler apparently will not divine that what you are doing is just a clamp, you should use the stdlib functions to get good SIMD code.
Use uniform's as much as you possibly can. Prefer to make loop iterators uniform by default. Make loop iterators uniform by default when you start iterating at 0, even if the high loop limit is varying.
Use cif() etc. on conditionals which will strongly be taken or not taken by the entire gang. Compilation can get noticeably slower as you switch to cif().
A few min's or max's and some boolean/bit twiddling ops can be much faster than the equivalent multiple if() statements. Study the SSE2 etc. instruction sets because there are some powerful things in there. Prefer building code out of helpers like select() from the stdlib for performance.
Things that usually make perfect sense in CPU code, like early outs, may actually just hurt you with SIMD code. If your early out checks have to check all lanes, and it's an uncommon early out, consider just removing or rethinking them.