SIMD / GPU Friendly Branchless Binary Search

0xfaded · on June 20, 2017

I was stumped as to how this was possible, until I saw that it relied on memory access relative to a computed offset, for which as far as I know there are no SIMD instructions. I guess it could be argued that a texture in a shader would provide similar functionality.

Also of interest are sorting networks https://en.wikipedia.org/wiki/Sorting_network which can be used to perform branchless sort on parallel hardware.

nhaehnle · on June 21, 2017

Since the title mentioned GPU-friendliness: GPUs are built all around scatter/gather memory accesses. The actual performance of memory accesses of course still depends on locality. The first load in a parallel binary search will be fast, since all threads will load the same element, later loads can get progressively worse.

That said, the code in the article doesn't actually improve anything specifically for GPUs, because standard binary searches are already "branchless" for practical purposes. Really, the only difference between the article and a standard binary search is that the code in the article has manually unrolled the search loop. That, plus some (not all) binary search implementations can do an early-out if the comparison happens to hit equality.

None of these changes make a SIMD- or GPU-specific difference, because loop unrolling doesn't make a difference in terms of control-flow divergence.

Tuna-Fish · on June 21, 2017

> memory access relative to a computed offset, for which as far as I know there are no SIMD instructions.

Sure there are?

VGATHERDPD and friends. They have been there for a couple of generations, but are only now getting fast.

flgr · on June 20, 2017

I think in order to really reap the benefits of this you'll want to actually reorganize the in-memory layout of the trees in order to make sure all elements needed for the comparison end up in the same cache line.

I haven't been following this closely, but the last time I checked scatter-gather loads were really really slow.

Chapter 3.3 from page 27 (PDF page 43) on of this be interesting: https://www.researchgate.net/profile/Florian_Gross/publicati...

Also contains a survey of some other related data structures and algorithms.

cperciva · on June 21, 2017

As the saying goes, "memory is the new disk". Data structures like B-trees which were designed to improve the performance of data accesses from disk are now useful for improving the performance of algorithms in memory.

delhanty · on June 21, 2017

Does anybody have any data on whether external sorting [1] is the way to go these days for CPU based sorts?

That is, use quicksort for chunks up to the size of the L1/L2/L3 cache (depending on benchmarking) and then mergesort the chunks.

[1] https://en.wikipedia.org/wiki/External_sorting

adrianN · on June 21, 2017

You can have a look at the best sorting algorithms (for various definitions of "best") here http://sortbenchmark.org/

cperciva · on June 21, 2017

That's for big sorts. The question here is about sorting with a single CPU.

adrianN · on June 21, 2017

Look at the Joule category.

cperciva · on June 21, 2017

I don't have any data readily available, but it's generally understood that you want a cache-aware sort, yes. Usually I see people opt for samplesort rather than sorting chunks and merging, though.

delhanty · on June 21, 2017

Thanks for your opinion.

I did not know about samplesort - found it on Wikipedia.

0xfaded · on June 21, 2017

Thank you, scatter-gather was the google term I was looking for, though I couldn't immediately find any architectures that implement a vector scatter-gather. Are you aware of any?

pandaman · on June 21, 2017

Not sure what could be a scalar scatter/gather, but NV and AMD GPUs do the vector one (you can read write n values from n offsets with a single instruction).

wbl · on June 21, 2017

The Cray absolutely did. How else do you do sparse matrix multiply?

lukego · on June 21, 2017

Is it reasonable to estimate that the performance of this code will be bounded by log2(n) memory round-trips per lookup?

If so then could you double performance to log2(n)/2 lookups by issuing parallel memory requests for location n and also both candidates for location n+1?

jsnell · on June 21, 2017

Have a look at https://arxiv.org/abs/1509.05053 from pkhuong. They discuss this issue and other variants that paper. So e.g. branchless binary search, the branchy binary search + speculative execution as a form of prefetching, branchless binary search with explicit prefetching.

lukego · on June 21, 2017

Seriously awesome paper!

The answer seems to be "yes" judging by their dramatic conclusion:

> Our conclusion is contrary to our expectations and to previous findings and goes against conventional wisdom, which states that accesses to RAM are slow, and should be minimized. A more accurate statement, that accounts for our findings, is that accesses to RAM have high latency and this latency needs to be mitigated.

In other words: to optimize memory access you should think like a network engineer - treat RAM as a "long fat pipe" and do the usual optimizations like pipelining that you would in TCP or HTTP or ....

Related: "Computer architecture for network engineers." https://github.com/lukego/blog/issues/18.

roel_v · on June 21, 2017

Does anyone know of any comprehensive resources that cover various techniques to do things branchless or otherwise optimize for gpu? I learned about masking the other day, but it seems this sort of stuff is only covered in dispersed blog posts and institutional knowledge.

ctchocula · on June 22, 2017

Professional CUDA C Programming by Cheng, Grossman and McKercher would be my recommendation. They show you how to do many optimization, motivating them using real-world examples like: matrix addition, reduction of n elements, and matrix transpose among others. They give the code, then show you what the code looks like after optimization 1, 2, 3, etc. along with what kind of speed-up they were able to obtain over the naive implementation using each optimization.

roel_v · on June 23, 2017

Oh that's great - I actually have it laying here next to me, but I haven't opened it yet because I went with OpenCL and I haven't even tried all options in OpenCL yet, but then I guess I need to reprioritize. Thanks.

maxk42 · on June 20, 2017

This is not branchless.

It's if-less. But the ternary operator (?:) is a branching instruction.

colejohnson66 · on June 20, 2017

Not necessarily. A compiler can optimize ?x:0 into a multiply. For example,

  f(x) ? y : 0

can also be written as

  (int)(bool)f(x) * y

Atrix256 · on June 20, 2017

Here's a godbolt to play with as well, showing how it does actually compile down to being branchless in clang (you can switch it to use other compilers too) https://goo.gl/bywp8b

pjscott · on June 21, 2017

In case anybody was wondering, the x86 assembly is doing the multiply trick. The rather confusing "lea" instruction -- Load Effective Address -- is used as a space-efficient way of saying "take one register, multiply by 1, 2, 4, or 8, add another register, and put it in a third register". The thing in the register that gets multiplied by that constant is either a 1 or a 0.

If the constants had been less convenient numbers, like 42, the compiler could have instead used a conditional move instruction -- more general-purpose, but I believe slightly larger and often ever-so-slightly slower.

pandaman · on June 21, 2017

It's a conditional instruction but it can be implemented without branching. The difference is that you can execute arbitrary instructions after a branch but the select (operator-?) does not change the instruction flow. SIMD processors can do parallel select but cannot do parallel diverging branches.

Tenobrus · on June 20, 2017

This, and using explicit multiplication as an alternative depending on your compiler, is mentioned in the article.

monocasa · on June 21, 2017

Huh, neat. This is really similar to the underlying mechanism used in the compiler that only used x86 mov instructions.

https://github.com/xoreaxeaxeax/movfuscator

partycoder · on June 21, 2017

Not really:

1) Doesn't obfuscate

2) Comparisons operators compile very similarly to a conditional statement (e.g: if, ternary operator, etc)

https://godbolt.org/g/xXecW2

I rewrote the OP code using ifs, and it compiles to exactly the same instructions. You can then add -O3 -msse -msse2 as compiler flags, code will be optimized about the same way.

monocasa · on June 21, 2017

Yes really.

movfuscator doesn't really obfuscate all that much, other than just being weird.

It's whole shitck is to compute offsets into it's memory space differently based on the input data, but not change the instruction stream based on the data. That way computations that shouldn't contribute to the final result are simply given a scratch offset that isn't read back, but the instruction stream never deviates.

This SIMD code is also ultimately computing different offsets into memory space (contributing 0 to that lane's offset on each iteration if that SIMD lane has already found the node it's looking for), but keeping the instruction stream consistent across the 'separate' lanes.

falcolas · on June 20, 2017

If I'm doing my mental compilation correctly, these can't actually be parallelized, due to 'ret' being referenced and set by each line. I'm curious if this is actually the case, and if an actual parallel implementation is possible (and how fast).

alexlarsson · on June 20, 2017

Yeah, the data dependencies means this will never be fast.

Additionally, "(haystack[4] <= needle) ? 4 : 0" is basically a branch. I mean, a smart compiler could make it a conditional move, but that is in no way guaranteed.

Atrix256 · on June 20, 2017

This isn't a parallelized implementation, no. To have that, imagine that instead of a size_t for ret, that it was something like size_t[4], and that each operation you did acted on all 4 items at once (this is how SIMD operations work, they do the same thing to each item in a vector of values).

You would end up returning a size_t[4], which would have the results of your 4 searches in it, which ran in parallel.

Regarding the ternary operator, you can always check your assembly (here's some that shows it https://godbolt.org/g/BM6tfc), but the article also shows how to do it with a multiply instead of the ternary operator.

The biggest thing sticking out of this as being slow would be the cache unfriendliness of binary searching, but the link at the end shows how to clear that up (which is not the point of the article, but the ideas are compatible and combinable).

MereInterest · on June 20, 2017

I think the branching could be avoided by using a cast from boolean to integer, "(haystack[4] <= needle)*4", instead of a ternary operator. It doesn't help at all with the data lookups, though, which is the bigger problem.

pacman128 · on June 21, 2017

It depends on the optimization level. Using no optimization, gcc 5.4 will use branches for this case, but using -O2, it avoids the branches (on x86_64). Verified by using gcc -S to show the assembly generated.

pranjalv123 · on June 20, 2017

Sure, but you could run several searches in parallel on a SIMD architecture with this.

falcolas · on June 20, 2017

You could always run several searches in parallel with normal binary searches as well, since this offers no substantial advantages over a branching method.

Unless I'm missing something major?

mistercow · on June 21, 2017

How would you run several searches in parallel on a single core without doing something like this?

gwern · on June 20, 2017

Reminds me a little of https://en.wikipedia.org/wiki/Sorting_network

FrozenVoid · on June 21, 2017

Try GCC vector intrinsics which allows comparing scalar to vector.

partycoder · on June 21, 2017

Technically a comparison operator would be a form of branching, because you have comparisons and jumps based off flags, same as what happens when you compile an if statement.

It becomes more evident when you see the generated code. https://godbolt.org/g/Mu9KBY

e.g:

    cmp     rax, QWORD PTR [rbp-24] ; compare a and b
    ja      .L2                     ; is a above b?, if so jump to .L2, otherwise proceed

In fact this statement:

    size_t ret = (haystack[4] <= needle) * 4;

...is 100% equivalent to this once compiled:

    register size_t ret;
    if (haystack[4] <= needle) {
        ret = 4;
    } else {
        ret = 0;
    }

So the code is not really branchless, just if-less.

partycoder · on June 21, 2017

To the trolls that downvoted my post without explaining why, here is the proof: https://godbolt.org/g/xXecW2

I rewrote the function using if statements making it compile to exactly the same instructions.

If you optimize the code with compiler flags (e.g: -msse, -msse2 for SSE/SSE2, -O3 for optimization) the generated instructions are exactly the same.

My point being: the code in the blog post is not branchless. The absence of if statements doesn't mean it's branchless.

If you care to disagree here by downvoting I dare you to reply explaining why you think this is wrong.

grenoire · on June 21, 2017

You didn't read the post properly. There is a subsection for a version without ternary operators.

partycoder · on June 21, 2017

The version I compared against was the one using numeric comparison ( <= ) not ternary operators. It's in the link I shared.

So I hope you revert your downvote or provide proof of what I am saying is incorrect.

Atrix256 · on June 21, 2017

You are completely right, I'll mention this in the post (In that case do a branchless comparison based on subtraction or similar?). Fwiw when doing simd, you'd use intrinsics which wouldn't have this issue, but it is a valid point and completely true. Thanks!

partycoder · on June 21, 2017

"You are right!" + no upvote = bad

Atrix256 · on June 21, 2017

I'm not the one who downvoted you, but I am the one who wrote the article :P