Not long ago, I was doing something that needed to be fast and I tried the old t...

to3m · on June 8, 2012

Nobody would make this argument if, during testing, it transpired that reading files over a 100Mbit network connection was faster than reading files from the local SSD. Sure, one should always measure rather than pontificate. And yes, in this hypothetical case, one has measured, and it truly turns out it's quicker... to read over the network. But one should always consider: do the results make any sense whatsoever at all, or do they suggest that further investigation might be warranted?

And similarly, it would be an unusual CPU indeed - so unusual in fact that you'd probably already know that you were working on it, and would mention it as part of this crazy story of the time that divisions were faster than shifts - that managed to mess up single-bit shifts so badly that they were slower than an integer division.

Perhaps in a higher-level language, one might find that x>>1 were faster than x/2, for some reason unrelated to the relative speeds of the two operations. But if one is working in C or C++ (and I assume this is the case, if one is bothering with stuff like replacing divides with shifts), and yet the shift is still the slower, then this is most unexpected, and probably evidence that something has gone wrong.

(The division algorithm is inherently more complicated than a shift, and so divisions are always going to take more time than a shift. You can expect a shift to take 1 cycle on a modern CPU. A division would take more like 20. It will probably be the slowest integer instruction possible, and it might not be accessible from every pipeline. Sample division latencies for vaguely modern CPUs: 20+ cycles on x86, ~35 cycles on MIPS, ~20 cycles on POWER, ARM doesn't seem to have it. Shifts, on the other hand - always 1 cycle. ARM even gives them away for free.)

chrisbennet · on June 17, 2012

I just noticed your response. The machine I timed it on was an older Lenova laptop with some sort of dual core Intel processor - indeed not an "unusual" CPU.

In the end, it's results are what matter even if you (or I) don't understand why. For all I know, the compiler parallelized some of the code with divisions and didn't for the shift using code. The thing is, you can't just look at what "should" be fastest and my experience was a good reminder of that.

In the old days, one optimization technique we used was to "unroll loops" i.e. instead of:

for (int i=0; i<8; i++) a[i] += b[i];

We'd do this:

// Unrolled loop a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; ... a[7] += b[7];

The unroll loop has less clock cycle according to the book. Great - unless the larger "faster" code no longer fits in the CPU cache, now it's actually slower. The only way to tell is to measure it.

Here is the code. Substitute ">>1" for the "/2" portions and compare.

// pBuffer contains a single horizontal line in a bitmap // destBuffer will contain the first derivative convolution i.e. if you graph it, it would look like peaks at the edges of things in the image

void Convolution1D_FastFirstDerivative(int* pBuffer, int sizeOfBuffer, int destBuffer) { int lastPos = sizeOfBuffer - 1; int isum = 0;

// Special case start of buffer where filter "hangs off edge" isum = -pBuffer[0] / 2; // 0.5 //sum += pBuffer[0] * 0; // * 0 isum += pBuffer[1] / 2; destBuffer[0] = isum;

for (int i=1; i<sizeOfBuffer-1; i++) { isum = -pBuffer[i-1] / 2; // * -0.5 //sum += pBuffer[lastPos] * 0; // * 0 isum += pBuffer[i+1] / 2; // * 0.5 destBuffer[i] = isum; }

// Special case end of buffer where filter "hangs off edge" isum = -pBuffer[lastPos-1] / 2; // * -0.5 //sum += pBuffer[lastPos] * 0; // * 0 isum += pBuffer[lastPos] / 2; // * 0.5 destBuffer[lastPos] = isum;

}

Sorry, I wish I could fix the formatting.