Hacker News new | past | comments | ask | show | jobs | submit login
SIMD / GPU Friendly Branchless Binary Search (demofox.org)
82 points by Atrix256 on June 20, 2017 | hide | past | favorite | 46 comments



I was stumped as to how this was possible, until I saw that it relied on memory access relative to a computed offset, for which as far as I know there are no SIMD instructions. I guess it could be argued that a texture in a shader would provide similar functionality.

Also of interest are sorting networks https://en.wikipedia.org/wiki/Sorting_network which can be used to perform branchless sort on parallel hardware.


Since the title mentioned GPU-friendliness: GPUs are built all around scatter/gather memory accesses. The actual performance of memory accesses of course still depends on locality. The first load in a parallel binary search will be fast, since all threads will load the same element, later loads can get progressively worse.

That said, the code in the article doesn't actually improve anything specifically for GPUs, because standard binary searches are already "branchless" for practical purposes. Really, the only difference between the article and a standard binary search is that the code in the article has manually unrolled the search loop. That, plus some (not all) binary search implementations can do an early-out if the comparison happens to hit equality.

None of these changes make a SIMD- or GPU-specific difference, because loop unrolling doesn't make a difference in terms of control-flow divergence.


> memory access relative to a computed offset, for which as far as I know there are no SIMD instructions.

Sure there are?

VGATHERDPD and friends. They have been there for a couple of generations, but are only now getting fast.


I think in order to really reap the benefits of this you'll want to actually reorganize the in-memory layout of the trees in order to make sure all elements needed for the comparison end up in the same cache line.

I haven't been following this closely, but the last time I checked scatter-gather loads were really really slow.

Chapter 3.3 from page 27 (PDF page 43) on of this be interesting: https://www.researchgate.net/profile/Florian_Gross/publicati...

Also contains a survey of some other related data structures and algorithms.


As the saying goes, "memory is the new disk". Data structures like B-trees which were designed to improve the performance of data accesses from disk are now useful for improving the performance of algorithms in memory.


Does anybody have any data on whether external sorting [1] is the way to go these days for CPU based sorts?

That is, use quicksort for chunks up to the size of the L1/L2/L3 cache (depending on benchmarking) and then mergesort the chunks.

[1] https://en.wikipedia.org/wiki/External_sorting


You can have a look at the best sorting algorithms (for various definitions of "best") here http://sortbenchmark.org/


That's for big sorts. The question here is about sorting with a single CPU.


Look at the Joule category.


I don't have any data readily available, but it's generally understood that you want a cache-aware sort, yes. Usually I see people opt for samplesort rather than sorting chunks and merging, though.


Thanks for your opinion.

I did not know about samplesort - found it on Wikipedia.


Thank you, scatter-gather was the google term I was looking for, though I couldn't immediately find any architectures that implement a vector scatter-gather. Are you aware of any?


Not sure what could be a scalar scatter/gather, but NV and AMD GPUs do the vector one (you can read write n values from n offsets with a single instruction).


The Cray absolutely did. How else do you do sparse matrix multiply?


Is it reasonable to estimate that the performance of this code will be bounded by log2(n) memory round-trips per lookup?

If so then could you double performance to log2(n)/2 lookups by issuing parallel memory requests for location n and also both candidates for location n+1?


Have a look at https://arxiv.org/abs/1509.05053 from pkhuong. They discuss this issue and other variants that paper. So e.g. branchless binary search, the branchy binary search + speculative execution as a form of prefetching, branchless binary search with explicit prefetching.


Seriously awesome paper!

The answer seems to be "yes" judging by their dramatic conclusion:

> Our conclusion is contrary to our expectations and to previous findings and goes against conventional wisdom, which states that accesses to RAM are slow, and should be minimized. A more accurate statement, that accounts for our findings, is that accesses to RAM have high latency and this latency needs to be mitigated.

In other words: to optimize memory access you should think like a network engineer - treat RAM as a "long fat pipe" and do the usual optimizations like pipelining that you would in TCP or HTTP or ....

Related: "Computer architecture for network engineers." https://github.com/lukego/blog/issues/18.


Does anyone know of any comprehensive resources that cover various techniques to do things branchless or otherwise optimize for gpu? I learned about masking the other day, but it seems this sort of stuff is only covered in dispersed blog posts and institutional knowledge.


Professional CUDA C Programming by Cheng, Grossman and McKercher would be my recommendation. They show you how to do many optimization, motivating them using real-world examples like: matrix addition, reduction of n elements, and matrix transpose among others. They give the code, then show you what the code looks like after optimization 1, 2, 3, etc. along with what kind of speed-up they were able to obtain over the naive implementation using each optimization.


Oh that's great - I actually have it laying here next to me, but I haven't opened it yet because I went with OpenCL and I haven't even tried all options in OpenCL yet, but then I guess I need to reprioritize. Thanks.


This is not branchless.

It's if-less. But the ternary operator (?:) is a branching instruction.


Not necessarily. A compiler can optimize ?x:0 into a multiply. For example,

  f(x) ? y : 0
can also be written as

  (int)(bool)f(x) * y


Here's a godbolt to play with as well, showing how it does actually compile down to being branchless in clang (you can switch it to use other compilers too) https://goo.gl/bywp8b


In case anybody was wondering, the x86 assembly is doing the multiply trick. The rather confusing "lea" instruction -- Load Effective Address -- is used as a space-efficient way of saying "take one register, multiply by 1, 2, 4, or 8, add another register, and put it in a third register". The thing in the register that gets multiplied by that constant is either a 1 or a 0.

If the constants had been less convenient numbers, like 42, the compiler could have instead used a conditional move instruction -- more general-purpose, but I believe slightly larger and often ever-so-slightly slower.


It's a conditional instruction but it can be implemented without branching. The difference is that you can execute arbitrary instructions after a branch but the select (operator-?) does not change the instruction flow. SIMD processors can do parallel select but cannot do parallel diverging branches.


This, and using explicit multiplication as an alternative depending on your compiler, is mentioned in the article.


Huh, neat. This is really similar to the underlying mechanism used in the compiler that only used x86 mov instructions.

https://github.com/xoreaxeaxeax/movfuscator


Not really:

1) Doesn't obfuscate

2) Comparisons operators compile very similarly to a conditional statement (e.g: if, ternary operator, etc)

https://godbolt.org/g/xXecW2

I rewrote the OP code using ifs, and it compiles to exactly the same instructions. You can then add -O3 -msse -msse2 as compiler flags, code will be optimized about the same way.


Yes really.

movfuscator doesn't really obfuscate all that much, other than just being weird.

It's whole shitck is to compute offsets into it's memory space differently based on the input data, but not change the instruction stream based on the data. That way computations that shouldn't contribute to the final result are simply given a scratch offset that isn't read back, but the instruction stream never deviates.

This SIMD code is also ultimately computing different offsets into memory space (contributing 0 to that lane's offset on each iteration if that SIMD lane has already found the node it's looking for), but keeping the instruction stream consistent across the 'separate' lanes.


If I'm doing my mental compilation correctly, these can't actually be parallelized, due to 'ret' being referenced and set by each line. I'm curious if this is actually the case, and if an actual parallel implementation is possible (and how fast).


Yeah, the data dependencies means this will never be fast.

Additionally, "(haystack[4] <= needle) ? 4 : 0" is basically a branch. I mean, a smart compiler could make it a conditional move, but that is in no way guaranteed.


This isn't a parallelized implementation, no. To have that, imagine that instead of a size_t for ret, that it was something like size_t[4], and that each operation you did acted on all 4 items at once (this is how SIMD operations work, they do the same thing to each item in a vector of values).

You would end up returning a size_t[4], which would have the results of your 4 searches in it, which ran in parallel.

Regarding the ternary operator, you can always check your assembly (here's some that shows it https://godbolt.org/g/BM6tfc), but the article also shows how to do it with a multiply instead of the ternary operator.

The biggest thing sticking out of this as being slow would be the cache unfriendliness of binary searching, but the link at the end shows how to clear that up (which is not the point of the article, but the ideas are compatible and combinable).


I think the branching could be avoided by using a cast from boolean to integer, "(haystack[4] <= needle)*4", instead of a ternary operator. It doesn't help at all with the data lookups, though, which is the bigger problem.


It depends on the optimization level. Using no optimization, gcc 5.4 will use branches for this case, but using -O2, it avoids the branches (on x86_64). Verified by using gcc -S to show the assembly generated.


Sure, but you could run several searches in parallel on a SIMD architecture with this.


You could always run several searches in parallel with normal binary searches as well, since this offers no substantial advantages over a branching method.

Unless I'm missing something major?


How would you run several searches in parallel on a single core without doing something like this?



Try GCC vector intrinsics which allows comparing scalar to vector.


Technically a comparison operator would be a form of branching, because you have comparisons and jumps based off flags, same as what happens when you compile an if statement.

It becomes more evident when you see the generated code. https://godbolt.org/g/Mu9KBY

e.g:

    cmp     rax, QWORD PTR [rbp-24] ; compare a and b
    ja      .L2                     ; is a above b?, if so jump to .L2, otherwise proceed
In fact this statement:

    size_t ret = (haystack[4] <= needle) * 4;
...is 100% equivalent to this once compiled:

    register size_t ret;
    if (haystack[4] <= needle) {
        ret = 4;
    } else {
        ret = 0;
    }
So the code is not really branchless, just if-less.


To the trolls that downvoted my post without explaining why, here is the proof: https://godbolt.org/g/xXecW2

I rewrote the function using if statements making it compile to exactly the same instructions.

If you optimize the code with compiler flags (e.g: -msse, -msse2 for SSE/SSE2, -O3 for optimization) the generated instructions are exactly the same.

My point being: the code in the blog post is not branchless. The absence of if statements doesn't mean it's branchless.

If you care to disagree here by downvoting I dare you to reply explaining why you think this is wrong.


You didn't read the post properly. There is a subsection for a version without ternary operators.


The version I compared against was the one using numeric comparison ( <= ) not ternary operators. It's in the link I shared.

So I hope you revert your downvote or provide proof of what I am saying is incorrect.


You are completely right, I'll mention this in the post (In that case do a branchless comparison based on subtraction or similar?). Fwiw when doing simd, you'd use intrinsics which wouldn't have this issue, but it is a valid point and completely true. Thanks!


"You are right!" + no upvote = bad


I'm not the one who downvoted you, but I am the one who wrote the article :P




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: