My comments here are not to the posted article, but the one referenced at the to...

Coding_Cat · on Dec 31, 2016

>the entire lookup table fits in L1 cache you can treat KIND_TABLE[id] as only taking ~1-2 cycles.

You can do so, but on modern processors it's ~4 cycles to access the L1 cache and

        return 175 & id & ((id ^ 173) + 11)

takes 3 cycles (cylce 1, do the first & and the ^; cycle two, do the +; cycle 3, do the second &).

pmalynin · on Dec 31, 2016

Actually on Intel CPUs ALU operations take 1/3 of a cycle and the processor can schedule them back to back.

Coding_Cat · on Dec 31, 2016

AFAIK this is not quite right. They take only 1/4th of a cycle on average, but that is because of pipelining. if you have a dependency on the result of an ALU you will still have to wait the full latency (1 cycle) before you can continue.

This makes sense as the clock is also somewhat of the 'driving force' for pushing signals through the chip from one part to the other. (some architectures have 'zero cost' operations I believe, but these are usually baked into the pipeline and have to be turned on-or-off depending on need).

gravypod · on Dec 31, 2016

This is unrelated but I wonder what GCC's new platform specific-method generator would do to this section of code [0]. There are a few other bits in here that look like all vector-ops and I'd love to see what would happen to this after the compiler get's a crack at it.

Also if someone could tell me if this is a stupid idea: what would be the result of writing a version of this sort as a const-expr for fixed size arrays. For instacne pdqsort512 to sort a 512 length set. Just a thought. I'd love to see what G++ is able to hob-cobble together.

[0] - https://github.com/orlp/pdqsort/blob/master/pdqsort.h#L227-L...

orlp · on Dec 31, 2016

It's not a vector operation, even though it's written in a fashion similar to it. However, it can be implemented in a very interesting fashion.

Consider when c0, c3 and c6 are true, and the rest are false. Then the end result is this:

    offsets_l[num_l + 0] = i + 0;
    offsets_l[num_l + 1] = i + 3;
    offsets_l[num_l + 2] = i + 6;
    num_l += 3;

On a little-endian machine we could write this as following (assuming there is space after offsets_l):

    *(uint64_t*) (offsets_l + num_l) = (i + 0) + ((i + 3) << 8) + ((i + 6) << 16);

Which is equivalent to:

    *(uint64_t*) (offsets_l + num_l) = (i + 0) + (i + 3)*256 + (i + 6)*65536;

And if we regroup the parenthesis and add common factors:

    *(uint64_t*) (offsets_l + num_l) = 65793 * i + 393984;

If we precalculate these magic constants for each combination of c0..7 we get a 256-element lookup table, and code that looks like this:

    int idx = c0 | c1 << 1 | c2 << 2 | c3 << 3 | c4 << 4 | c5 << 5 | c6 << 6 | c7 << 7;
    *(uint64_t*) (offsets_l + num_l) = muls[idx] * i + adders[idx];
    num_l += popcnt[idx];

Sadly, it turns out that this approach isn't faster. It's cool nonetheless.

duckingtest · on Dec 31, 2016

>If this function is in a hot loop, and the entire lookup table fits in L1 cache you can treat KIND_TABLE[id] as only taking ~1-2 cycles.

In isolation, yes, but is that ever actually the case? It's close to certain there's something else that can use L1 cache.

chubot · on Dec 31, 2016

Thanks for the link on sorting -- that looks really interesting!

I think it's fair to say that the expression is no slower than the lookup table. Even if missing the cache is rare, the penalty is large, so the average case will be significantly worse.

I freely admit the problem is contrived, but I think most people would agree it is actually an optimization! If you really want to nitpick, you might want to count the "opportunity cost" of some other 221 bytes not being in L1 cache.

orlp · on Dec 31, 2016

I haven't read about the rest of your problem much, but I usually think in terms of _cache pressure_. A lookup table can be very fast, but adds to the cache pressure. A large implementation with many special smart cases can also be very fast, but also adds to the cache pressure (instructions also need to be loaded from memory).

Another thing I don't believe you mention is that not every n-instruction solution is equal (even if we only use single cycle instructions as possibilities). This is because of instruction-level parallelism.

(a _ b) _ (c _ d) only takes two cycles to evaluate on a modern CPU.

a _ (b _ (c _ d)) takes three.

smcl · on Dec 31, 2016

What are the underscore operators here? I'm curious, this may be an established convention but I have not seen it before

stygiansonic · on Dec 31, 2016

I presume they are placeholders to represent any binary operator, i.e. +, -, etc. The operator should be the same in all the placeholders or have the same precedence for the two expressions to be equivalent.

orlp · on Dec 31, 2016

They are just arbitrary placeholders for any operator.

codeflo · on Dec 31, 2016

    buffer_num += (elements[i] < pivot);

Is this really branchless though? I'm curious what the generated assembler code looks like.

looki · on Dec 31, 2016

It is. You can use the `setl` instruction to conditionally set a register depending on whether the previous `cmp` was less. Then, you simply do an `add` with the produced value, which is either 0 or 1.

EDIT: You can see this here https://godbolt.org/g/ssy4am

orlp · on Dec 31, 2016

Your example is incorrect. But even if it were correct, a much better way of visualizing this is by unrolling the loop (as is done in my pdqsort implementation): https://godbolt.org/g/qWkmG4

The odd writing style of first doing all comparisons is intentional - it increases the parallelism and reduces the data dependency in the generated code. Note how all the colors are mixed through eachother: this is a good sign.

Also note that there are no branches at all in the generated code. It's 'straight' code that the CPU can just march through at full speed. This is what makes it so stupidly fast.

looki · on Dec 31, 2016

I think you misunderstood the OP (or maybe I have?) - it wasn't about parallelism, just about that one use of a comparison inside the addition. I showed that CPUs don't need to generate conditional jumps for that, that's all.

EDIT: To clarify, my example was completely contrived, I just reused your variable names for some sense of familiarity.

codeflo · on Dec 31, 2016

Exactly, that's what I asked about. I didn't know about setl and thus couldn't figure out how to get around the conditional branch. Thanks for the explanation!

looki · on Dec 31, 2016

I'm glad I understood correctly! No problem.

lgeek · on Dec 31, 2016

That's applicable if you compile for 386 or a newer x86 and your compiler generates the code you expect. On AArch32 (ARM) you could use conditional MOV and on AArch64 you could use CSET. I'm sure that other architectures also have similar instructions, however it's not a guarantee that you'll end up with branchless machine code.