Hacker News new | past | comments | ask | show | jobs | submit login
Bitwise Binary Search: Elegant and Fast (orlp.net)
112 points by usefulcat on May 18, 2023 | hide | past | favorite | 25 comments



> However, I’ve found that some compilers, e.g. GCC on x86-64 will refuse to make this variant branchless. I hate how fickle compilers can be sometimes, and I wish compilers exposed not just the likely/unlikely attributes, but also an attribute that allows you to mark something as unpredictable to nudge the compiler towards using branchless techniques like CMOV’s.

There has been a Clang bug open about this for more than 4 years: https://bugs.llvm.org/show_bug.cgi?id=40027

Naturally, this came up in the context of binary searches, but it is the same for any tree traversal where each branch has maximum entropy (which means it's optimal from a traversal time point of view). Clang is especially reluctant to emit predicated instructions (this is a major pet peeve of mine).

In my codebase I ended up having a function

    template <class T, class I, class Compare>
    I argmax(const T& a, const T& b, I i, I j, Compare cmp) {
      return cmp(a, b) ? j : i;
    }
which I then specialize for various integer types and code it with inline ASM to use a CMOV. This allows to write all kinds of binary searches and tree traversals without having to do inline ASM every time. I'd really prefer not to have to do that, but the performance impact is significant.

In this post, I don't think there is a comparison with a classical binary search implemented in a branch-free fashion? Searching along dyadic intervals is a cute trick, but it's not clear whether the improvement comes from that or simply from being branch-free.


Interesting, I’ve often heard that introducing inline assembly is likely to confuse the optimizer more than anything else. Do you happen to have any examples of this technique that you can share?


Generally if you specify the list of clobbers tightly and don’t spoil memory things can end up mostly reasonable.


> In this post, I don't think there is a comparison with a classical binary search implemented in a branch-free fashion? Searching along dyadic intervals is a cute trick, but it's not clear whether the improvement comes from that or simply from being branch-free.

There isn't. Ultimately I felt that the article should be an algorithmic one on bitwise binary search, not one about finding the best binary search implementation. The latter changes heavily based on compiler, architecture, input distribution, cache size, etc, and there's no way a simple article could do it justice.

For your curiosity I have implemented a classic binary search branch-free as well:

    template<typename It, typename T, typename Cmp>
    It lower_bound_classic_branchless(It begin, It end, const T& value, Cmp comp) {
        size_t n = 1 + end - begin;
        size_t i = -1;
        while (n > 1) {
            i += comp(begin[i + n/2], value) ? n/2 : 0;
            n -= n/2;
        }
        return begin + i + 1;
    }
It turns out that it performs exactly the same number of comparisons as my lower_bound_overlap.

For larger sizes, lower_bound_overlap is better, on Apple M1: https://i.imgur.com/dCvlZZj.png

For smaller sizes they are essentially the same: https://i.imgur.com/04s4M6r.png

Looking at the assembly, we see that lower_bound_overlap has a (naturally) longer preamble before the main loop, but has a tighter core loop.

lower_bound_overlap:

    ; %bb.0:
        subs x8, x1, x0
        b.eq LBB14_4
    ; %bb.1:
        asr x8, x8, #2
        clz x9, x8
        mov x10, #-9223372036854775808
        lsl x11, x8, #1
        and x11, x11, #0xfffffffffffffffc
        ldr w11, [x0, x11]
        lsr x10, x10, x9
        ldr w9, [x2]
        sub x8, x8, x10
        cmp w11, w9
        csinv x8, x8, xzr, lo
        lsr x10, x10, #1
        cbz x10, LBB14_3
    LBB14_2:                                ; =>This Inner Loop Header: Depth=1
        add x11, x10, x8
        ldr w12, [x0, x11, lsl #2]
        cmp w12, w9
        csel x8, x11, x8, lo
        lsr x10, x10, #1
        cbnz x10, LBB14_2
    LBB14_3:
        add x8, x0, x8, lsl #2
        add x0, x8, #4                      ; =4
    LBB14_4:
        ret

lower_bound_classic_branchless

    ; %bb.0:
        sub x8, x1, x0
        add x8, x8, #4                      ; =4
        asr x8, x8, #2
        cmp x8, #2                          ; =2
        b.lo LBB8_3
    ; %bb.1:
        ldr w10, [x2]
        mov x9, #-1
    LBB8_2:                                 ; =>This Inner Loop Header: Depth=1
        lsr x11, x8, #1
        add x12, x9, x11
        ldr w12, [x0, x12, lsl #2]
        cmp w12, w10
        csel x12, x11, xzr, lo
        add x9, x12, x9
        sub x8, x8, x11
        cmp x8, #1                          ; =1
        b.hi LBB8_2
        b LBB8_4
    LBB8_3:
        mov x9, #-1
    LBB8_4:
        add x8, x0, x9, lsl #2
        add x0, x8, #4                      ; =4
        ret


Thanks for checking!

> It turns out that it performs exactly the same number of comparisons as my lower_bound_overlap.

This is surprising, shouldn't the number of comparisons be identical to the standard implementation?


> This is surprising, shouldn't the number of comparisons be identical to the standard implementation?

No, to make it branchless (in an efficient way, it can be done regardless of course), you always take ceil(n / 2) as your remaining size, whereas the branchy version chooses either floor(n / 2) or ceil(n / 2) as appropriate.

And in fact, you don't want the number of comparisons to be identical to the standard implementation, for a branchless implementation. Because this makes the branch on loop exit unpredictable.

I did find that by changing the implementation slightly:

    template<typename It, typename T, typename Cmp>
    It lower_bound(It begin, It end, const T& value, Cmp comp) {
        size_t n = 1 + end - begin;
        size_t i = -1;
        while (n > 1) {
            size_t half = n / 2;
            size_t m = i + half;
            if (comp(begin[m], value)) i = m;
            n -= half;
        }
        return begin + i + 1;
    }

that clang gives slightly better code on the Apple M1 (avoiding an unnecessary add in the tight loop):

    ; %bb.0:
        sub x8, x1, x0
        add x8, x8, #4                      ; =4
        asr x8, x8, #2
        cmp x8, #2                          ; =2
        b.lo LBB8_3
    ; %bb.1:
        ldr w10, [x2]
        mov x9, #-1
    LBB8_2:                                 ; =>This Inner Loop Header: Depth=1
        lsr x11, x8, #1
        add x12, x11, x9
        ldr w13, [x0, x12, lsl #2]
        cmp w13, w10
        csel x9, x12, x9, lo
        sub x8, x8, x11
        cmp x8, #1                          ; =1
        b.hi LBB8_2
        b LBB8_4
    LBB8_3:
        mov x9, #-1
    LBB8_4:
        add x8, x0, x9, lsl #2
        add x0, x8, #4                      ; =4
        ret

which results in the classic algorithm beating the overlap method on smaller sizes, but not on larger sizes, on Apple M1: https://i.imgur.com/VjD5EK7.png


As the article says, overlapping halves turns out faster than using powers of two in the general case. But for small enough searches the bitwise method can be run in vector registers! I talked about this at [0], slides as zipped html at [1] (and I forgot the last click on the multi-byte data slide that shifts the indices for the final result!). The overlap method is probably also possible, but the simpler bitwise operations work out better in SIMD.

[0] https://youtu.be/paxIkKBzqBU?t=610

[1] https://www.dyalog.com/uploads/conference/dyalog19/presentat...


Done in hardware decades ago.

https://en.wikipedia.org/wiki/Successive-approximation_ADC

Branchless and everything.


I made an implementation of that same articles algorithm in Python using C-extensions, and it beat the stdlib implementation across many sizes of arrays. https://github.com/juliusgeo/branchless_bisect



There's something wrong in the comparison calculation.

To begin with, when n is exactly 2^k-1, I would expect all the functions to show k comparisons, including Skarupke's. In the plot, all the curves overlap at 64, 127 and 255 except Skarupke's which is off by 1.

When n is not 2^k-1, again Skarupke's curve is nearly always 1 above the green one (lower_bound_overlap). That can't be correct: analyzing lower_bound_skarupke() and lower_bound_overlap(), I expect the former to never perform more comparisons than the latter. In fact if, after the first compare, the target value lays in the left partition, they both perform other k compares, k being log2(bit_floor(n)). However if the target value lays in the right partition, then lower_bound_overlap() performs again k compares, while lower_bound_skarupke() recalculates k based on the size of the right partition. Since the right partition can never be bigger than n/2, the new k can't be larger than the previous one.

For this reason I expect the blue curve to copy the green one for all values of n right of the midpoint between two powers of 2, e.g. between 64+32 and 128, or between 128+64 and 255, and to perform better for the other values of n.


> To begin with, when n is exactly 2^k-1, I would expect all the functions to show k comparisons, including Skarupke's. In the plot, all the curves overlap at 64, 127 and 255 except Skarupke's which is off by 1.

I'm afraid you misread the plot, although I admit it is hard to see. All the curves overlap at 31, 63, 127, etc, except Skarupke's. This is the raw data: https://gist.github.com/orlp/782369f5c047447bbaa924c13bdc6b3...

The reason Skarupke's method is less efficient is because it does not take into account that the optimal size for binary search is 2^k - 1, so it is comparing at powers of two. This causes it to need an extra comparison, note the final line of the algorithm:

   return begin + comp(*begin, value);
None of my methods have this extra comparison at the end. For simplicity, let's analyze n = 7 for Skarupke's method:

    size_t step = bit_floor(length); // step = 4

    // First comparison.
    if (step != length && comp(begin[step], value)) {
        // For simplicity, let's assume the comparison fails.
    }

    // Second comparison.
    step /= 2; // step = 2
    if (comp(begin[step], value)) begin += step;

    // Third comparison.
    step /= 2; // step = 1
    if (comp(begin[step], value)) begin += step;

    step /= 2; // step = 0, break loop

    // Four comparison.
    return begin + comp(*begin, value);
Oops, that's four comparisons instead of the optimal three.


> I'm afraid you misread the plot

No, I misread Skarupke's algorithm: I completely missed the last comparison. You're right.

Now, I wonder if your "overlap" algorithm, but with Skarupke's partitioning scheme, wouldn't perform even better, on average?


From the article, "A commonly heard advice is to not use binary search for small arrays, but to use a linear search instead. I find that not to be true on the Apple M1 for integers, at least compared to my branchless binary search, when searching a runtime-sized but otherwise fixed size array".

I expect the linear search is better when the data being searched isn't yet in cache because it allows the memory subsystem to better predict what the CPU will access next. I'm not sure how large this effect is. It would be interesting to see the benchmarks redone on uncached data.


Random Idea: It might be interesting to try to implement a bitwise binary search in x86 (or other ISA) as a single additional x86 (or other ISA) instruction in microcode (it could be implemented behind the scenes with multiple supporting microcode steps/instructions) -- the idea being that if it could be implemented in microcode, it could potentially be even faster than a software (compiling to multiple x86 (or other ISA) instructions) version...

Anyway, great article!


That would only help if the the bottleneck of your algorithm is getting the code from memory to the CPU to decode and for the CPU to decode the instructions. But if your code is that big then it wouldn't fit microcode either.

Pipelined CPUs are insanely efficient at executing code at the maximum speed the execution units would permit, so having more complex code encoded in a macro instruction offers diminishing returns until it actually becomes detrimental because the surface area you could devote for general caches is now occupied for stuff that doesn't help all that much.


>That would only help if the the bottleneck of your algorithm is getting the code from memory to the CPU to decode and for the CPU to decode the instructions.

In the context you use them in, kindly define: "bottleneck", "algorithm", "code", and "decode"...

>But if your code is that big then it wouldn't fit microcode either.

How big is the code for binary search ?

>Pipelined CPUs are insanely efficient at executing code at the maximum speed the execution units would permit, so having more complex code encoded in a macro instruction offers diminishing returns until it actually becomes detrimental because the surface area you could devote for general caches is now occupied for stuff that doesn't help all that much.

Does microcode occupy the "surface area" devoted to "general caches" ?


I'm sorry I don't have time at the moment to go deeper on this but I'll add two things that may be useful for the reader:

1. The Intel Optimization Reference, under Section 3.5.1, advises:

"Favor single-micro-operation instructions."

"Avoid using complex instructions (for example, enter, leave, or loop) that have more than 4 micro-ops and require multiple cycles to decode. Use sequences of simple instructions instead."

2. By surface I meant literally silicon surface area. That and transistor count (and other aspects like fan-out and clock domains etc) are the major aspects you trade when engineering a CPU.

If you need larger microcode ROM to store larger microprograms you also need more bits to address into the microcode ROM and that makes microcode programs even larger! All this consumes surface area and transistors that could be devoted to something else.

Sure, if you could fit the binary search instruction microcode in the existing spare space of the microcode ROM you wouldn't have that problem but you'd still be competing with other potential use cases of those alleged "microcode speedups". What about a UTF-8 string length instruction, would that be more important? Etc


>"1. The Intel Optimization Reference, under Section 3.5.1, advises:

'Favor single-micro-operation instructions.'

'Avoid using complex instructions (for example, enter, leave, or loop) that have more than 4 micro-ops and require multiple cycles to decode. Use sequences of simple instructions instead.'"

Do you know WHY exactly it says that? Please explain.

>2. By surface I meant literally silicon surface area. That and transistor count (and other aspects like fan-out and clock domains etc) are the major aspects you trade when engineering a CPU.

How is hardware transistor count, fan-out, and clock domains, etc. relative to adding to/changing the software microcode instructions for a CPU via a software microcode patch?

>If you need larger microcode ROM to store larger microprograms you also need more bits to address into the microcode ROM and that makes microcode programs even larger! All this consumes surface area and transistors that could be devoted to something else.

What CPU are we talking about?

>"Sure, if you could fit the binary search instruction microcode in the existing spare space of the microcode ROM"

This is what we are talking about.

>"you wouldn't have that problem"

No you wouldn't.

>"but you'd still be competing with other potential use cases of those alleged "microcode speedups". What about a UTF-8 string length instruction, would that be more important? Etc"

What are all of the "other potential use cases"?

How would adding an additional microcode instruction to a CPU that had room for it compete with all of the "other potential use cases"?

What exactly do you mean by "UTF-8 string length instruction"?

What exactly do you mean by "string length instruction"?

?


One thing I'm confused about: Did the author try vectorizing the linear search implementation? (Of course, it is possible that even if they did not, the compiler did.) I would imagine that vectorization is behind the advice to use linear searches on small arrays.


> I would imagine that vectorization is behind the advice to use linear searches on small arrays.

I always thought a core part is that searching linearly is simply very cache efficient due to easy prefetching of linear data and that this is simply faster when you don’t have to search too much. Of course, that and vectorisation together is even better.


The benchmarking code is available, I did not: https://github.com/orlp/bitwise-binary-search. The compiler also did not appear to have autovectorized it.

That said, I would expect the overhead of vectorization to dominate at smaller sizes, and the fact that linear search is... well... linear to dominate at larger sizes. There might be a sweet spot for a few dozen elements where it would be the best, but I think it'd be rather niche (and strictly limited to floats/integers).

Regarding cache effects of linear search, this entire article is under the assumption your data is in-cache, if not your results will vary (wildly).


> We want to find the lower bound of some element x in this array.

This should be “the greatest lower bound” (aka infimum).


One of my favorite bloggers on this platform.


That is very kind of you :)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: