I was stumped as to how this was possible, until I saw that it relied on memory access relative to a computed offset, for which as far as I know there are no SIMD instructions. I guess it could be argued that a texture in a shader would provide similar functionality.
Since the title mentioned GPU-friendliness: GPUs are built all around scatter/gather memory accesses. The actual performance of memory accesses of course still depends on locality. The first load in a parallel binary search will be fast, since all threads will load the same element, later loads can get progressively worse.
That said, the code in the article doesn't actually improve anything specifically for GPUs, because standard binary searches are already "branchless" for practical purposes. Really, the only difference between the article and a standard binary search is that the code in the article has manually unrolled the search loop. That, plus some (not all) binary search implementations can do an early-out if the comparison happens to hit equality.
None of these changes make a SIMD- or GPU-specific difference, because loop unrolling doesn't make a difference in terms of control-flow divergence.
I think in order to really reap the benefits of this you'll want to actually reorganize the in-memory layout of the trees in order to make sure all elements needed for the comparison end up in the same cache line.
I haven't been following this closely, but the last time I checked scatter-gather loads were really really slow.
As the saying goes, "memory is the new disk". Data structures like B-trees which were designed to improve the performance of data accesses from disk are now useful for improving the performance of algorithms in memory.
I don't have any data readily available, but it's generally understood that you want a cache-aware sort, yes. Usually I see people opt for samplesort rather than sorting chunks and merging, though.
Thank you, scatter-gather was the google term I was looking for, though I couldn't immediately find any architectures that implement a vector scatter-gather. Are you aware of any?
Not sure what could be a scalar scatter/gather, but NV and AMD GPUs do the vector one (you can read write n values from n offsets with a single instruction).
Is it reasonable to estimate that the performance of this code will be bounded by log2(n) memory round-trips per lookup?
If so then could you double performance to log2(n)/2 lookups by issuing parallel memory requests for location n and also both candidates for location n+1?
Have a look at https://arxiv.org/abs/1509.05053 from pkhuong. They discuss this issue and other variants that paper. So e.g. branchless binary search, the branchy binary search + speculative execution as a form of prefetching, branchless binary search with explicit prefetching.
The answer seems to be "yes" judging by their dramatic conclusion:
> Our conclusion is contrary to our expectations and to previous findings and goes against conventional wisdom, which states that accesses to RAM are slow, and should be
minimized. A more accurate statement, that accounts for our findings, is that accesses to RAM have high latency and this latency needs to be mitigated.
In other words: to optimize memory access you should think like a network engineer - treat RAM as a "long fat pipe" and do the usual optimizations like pipelining that you would in TCP or HTTP or ....
Does anyone know of any comprehensive resources that cover various techniques to do things branchless or otherwise optimize for gpu? I learned about masking the other day, but it seems this sort of stuff is only covered in dispersed blog posts and institutional knowledge.
Professional CUDA C Programming by Cheng, Grossman and McKercher would be my recommendation. They show you how to do many optimization, motivating them using real-world examples like: matrix addition, reduction of n elements, and matrix transpose among others. They give the code, then show you what the code looks like after optimization 1, 2, 3, etc. along with what kind of speed-up they were able to obtain over the naive implementation using each optimization.
Oh that's great - I actually have it laying here next to me, but I haven't opened it yet because I went with OpenCL and I haven't even tried all options in OpenCL yet, but then I guess I need to reprioritize. Thanks.
Here's a godbolt to play with as well, showing how it does actually compile down to being branchless in clang (you can switch it to use other compilers too)
https://goo.gl/bywp8b
In case anybody was wondering, the x86 assembly is doing the multiply trick. The rather confusing "lea" instruction -- Load Effective Address -- is used as a space-efficient way of saying "take one register, multiply by 1, 2, 4, or 8, add another register, and put it in a third register". The thing in the register that gets multiplied by that constant is either a 1 or a 0.
If the constants had been less convenient numbers, like 42, the compiler could have instead used a conditional move instruction -- more general-purpose, but I believe slightly larger and often ever-so-slightly slower.
It's a conditional instruction but it can be implemented without branching. The difference is that you can execute arbitrary instructions after a branch but the select (operator-?) does not change the instruction flow. SIMD processors can do parallel select but cannot do parallel diverging branches.
I rewrote the OP code using ifs, and it compiles to exactly the same instructions. You can then add -O3 -msse -msse2 as compiler flags, code will be optimized about the same way.
movfuscator doesn't really obfuscate all that much, other than just being weird.
It's whole shitck is to compute offsets into it's memory space differently based on the input data, but not change the instruction stream based on the data. That way computations that shouldn't contribute to the final result are simply given a scratch offset that isn't read back, but the instruction stream never deviates.
This SIMD code is also ultimately computing different offsets into memory space (contributing 0 to that lane's offset on each iteration if that SIMD lane has already found the node it's looking for), but keeping the instruction stream consistent across the 'separate' lanes.
If I'm doing my mental compilation correctly, these can't actually be parallelized, due to 'ret' being referenced and set by each line. I'm curious if this is actually the case, and if an actual parallel implementation is possible (and how fast).
Yeah, the data dependencies means this will never be fast.
Additionally, "(haystack[4] <= needle) ? 4 : 0" is basically a branch. I mean, a smart compiler could make it a conditional move, but that is in no way guaranteed.
This isn't a parallelized implementation, no. To have that, imagine that instead of a size_t for ret, that it was something like size_t[4], and that each operation you did acted on all 4 items at once (this is how SIMD operations work, they do the same thing to each item in a vector of values).
You would end up returning a size_t[4], which would have the results of your 4 searches in it, which ran in parallel.
Regarding the ternary operator, you can always check your assembly (here's some that shows it https://godbolt.org/g/BM6tfc), but the article also shows how to do it with a multiply instead of the ternary operator.
The biggest thing sticking out of this as being slow would be the cache unfriendliness of binary searching, but the link at the end shows how to clear that up (which is not the point of the article, but the ideas are compatible and combinable).
I think the branching could be avoided by using a cast from boolean to integer, "(haystack[4] <= needle)*4", instead of a ternary operator. It doesn't help at all with the data lookups, though, which is the bigger problem.
It depends on the optimization level. Using no optimization, gcc 5.4 will use branches for this case, but using -O2, it avoids the branches (on x86_64). Verified by using gcc -S to show the assembly generated.
You could always run several searches in parallel with normal binary searches as well, since this offers no substantial advantages over a branching method.
Technically a comparison operator would be a form of branching, because you have comparisons and jumps based off flags, same as what happens when you compile an if statement.
You are completely right, I'll mention this in the post (In that case do a branchless comparison based on subtraction or similar?). Fwiw when doing simd, you'd use intrinsics which wouldn't have this issue, but it is a valid point and completely true. Thanks!
Also of interest are sorting networks https://en.wikipedia.org/wiki/Sorting_network which can be used to perform branchless sort on parallel hardware.