> The second implementation is faster because it doesn't require any memory accesses.
This is not necessarily true. If this function is in a hot loop, and the entire lookup table fits in L1 cache you can treat KIND_TABLE[id] as only taking ~1-2 cycles.
> Jeff Dean also posed this question in his talk: How long it does it take to quicksort 1 billion 4-byte integers? It surprised me that he estimated it by simply counting the number of branch mispredictions, which is described in this post.
This is no longer true. In state of the art implementations quicksort is implemented in a branchless fashion. See https://github.com/orlp/pdqsort for details (I'm the author of pdqsort, old username on HN). The trick is to first replace directly swapping misplaced elements in a loop by first finding a buffer of elements on the left that should be on the right, and a buffer of elements that are on the right but should be on the left. This can be done in a branchless manner:
buffer_num = 0; buffer_max_size = 64;
for (int i = 0; i < buffer_max_size; ++i) {
// With branch:
if (elements[i] < pivot) { buffer[buffer_num] = i; buffer_num++; }
// Without:
buffer[buffer_num] = i; buffer_num += (elements[i] < pivot);
}
These buffers can then also be swapped branchlessly.
AFAIK this is not quite right. They take only 1/4th of a cycle on average, but that is because of pipelining. if you have a dependency on the result of an ALU you will still have to wait the full latency (1 cycle) before you can continue.
This makes sense as the clock is also somewhat of the 'driving force' for pushing signals through the chip from one part to the other. (some architectures have 'zero cost' operations I believe, but these are usually baked into the pipeline and have to be turned on-or-off depending on need).
This is unrelated but I wonder what GCC's new platform specific-method generator would do to this section of code [0]. There are a few other bits in here that look like all vector-ops and I'd love to see what would happen to this after the compiler get's a crack at it.
Also if someone could tell me if this is a stupid idea: what would be the result of writing a version of this sort as a const-expr for fixed size arrays. For instacne pdqsort512 to sort a 512 length set. Just a thought. I'd love to see what G++ is able to hob-cobble together.
Thanks for the link on sorting -- that looks really interesting!
I think it's fair to say that the expression is no slower than the lookup table. Even if missing the cache is rare, the penalty is large, so the average case will be significantly worse.
I freely admit the problem is contrived, but I think most people would agree it is actually an optimization! If you really want to nitpick, you might want to count the "opportunity cost" of some other 221 bytes not being in L1 cache.
I haven't read about the rest of your problem much, but I usually think in terms of _cache pressure_. A lookup table can be very fast, but adds to the cache pressure. A large implementation with many special smart cases can also be very fast, but also adds to the cache pressure (instructions also need to be loaded from memory).
Another thing I don't believe you mention is that not every n-instruction solution is equal (even if we only use single cycle instructions as possibilities). This is because of instruction-level parallelism.
(a _ b) _ (c _ d) only takes two cycles to evaluate on a modern CPU.
I presume they are placeholders to represent any binary operator, i.e. +, -, etc. The operator should be the same in all the placeholders or have the same precedence for the two expressions to be equivalent.
It is. You can use the `setl` instruction to conditionally set a register depending on whether the previous `cmp` was less. Then, you simply do an `add` with the produced value, which is either 0 or 1.
Your example is incorrect. But even if it were correct, a much better way of visualizing this is by unrolling the loop (as is done in my pdqsort implementation): https://godbolt.org/g/qWkmG4
The odd writing style of first doing all comparisons is intentional - it increases the parallelism and reduces the data dependency in the generated code. Note how all the colors are mixed through eachother: this is a good sign.
Also note that there are no branches at all in the generated code. It's 'straight' code that the CPU can just march through at full speed. This is what makes it so stupidly fast.
I think you misunderstood the OP (or maybe I have?) - it wasn't about parallelism, just about that one use of a comparison inside the addition. I showed that CPUs don't need to generate conditional jumps for that, that's all.
EDIT: To clarify, my example was completely contrived, I just reused your variable names for some sense of familiarity.
Exactly, that's what I asked about. I didn't know about setl and thus couldn't figure out how to get around the conditional branch. Thanks for the explanation!
That's applicable if you compile for 386 or a newer x86 and your compiler generates the code you expect. On AArch32 (ARM) you could use conditional MOV and on AArch64 you could use CSET. I'm sure that other architectures also have similar instructions, however it's not a guarantee that you'll end up with branchless machine code.
This is not necessarily true. If this function is in a hot loop, and the entire lookup table fits in L1 cache you can treat KIND_TABLE[id] as only taking ~1-2 cycles.
> Jeff Dean also posed this question in his talk: How long it does it take to quicksort 1 billion 4-byte integers? It surprised me that he estimated it by simply counting the number of branch mispredictions, which is described in this post.
This is no longer true. In state of the art implementations quicksort is implemented in a branchless fashion. See https://github.com/orlp/pdqsort for details (I'm the author of pdqsort, old username on HN). The trick is to first replace directly swapping misplaced elements in a loop by first finding a buffer of elements on the left that should be on the right, and a buffer of elements that are on the right but should be on the left. This can be done in a branchless manner:
These buffers can then also be swapped branchlessly.