I wonder how much, if any, this can be improved by prefetching the indexes you might access next step?
I haven't tested it directly on binary search, but I've done some synthetic tests of the same approach. The answer seems to be: the results can be really good!
There are several issues, though.
First, your gains will usually be in increased throughput rather than reduced latency. And this is usually a tradeoff that requires greater consumption of memory bandwidth, as you're loading data you don't need. If you are running on a single-core, this is usually a good tradeoff. If you are already parallel on all cores, it may not be.
The second is that you need to be able to "batch" your requests. That is, you need to be prefetching for the next request while you are handling the next level for the current request. The shorter the final processing time once you have the answer, the larger your batches need to be.
The last is that each CPU has a limit on the number of outstanding requests for cachelines from RAM: usually 10. This limits the size of your batches in accordance with Little's Law. The question then becomes whether that batch size is possible, and how much benefit it provides.
But there is a sweet spot where you can get large gains (2-3x single processor throughput) if you have enough requests available, if you can tolerate a slightly slower response time, and if the processing required once you have a result is significant (comparable to memory latency).
Oh, and you also need to be willing to write the critical sections in assembly. Even if you figure out the combination of fragile syntax and compiler optionss that produces the branch-free logic that you need, convincing an optimizing compiler to make your apparently unnecessary prefetches at exactly the point in time you need them to be is next to impossible.
This paper isn't quite on this exact topic, but leans heavily in the same direction:
I haven't tested it directly on binary search, but I've done some synthetic tests of the same approach. The answer seems to be: the results can be really good!
There are several issues, though.
First, your gains will usually be in increased throughput rather than reduced latency. And this is usually a tradeoff that requires greater consumption of memory bandwidth, as you're loading data you don't need. If you are running on a single-core, this is usually a good tradeoff. If you are already parallel on all cores, it may not be.
The second is that you need to be able to "batch" your requests. That is, you need to be prefetching for the next request while you are handling the next level for the current request. The shorter the final processing time once you have the answer, the larger your batches need to be.
The last is that each CPU has a limit on the number of outstanding requests for cachelines from RAM: usually 10. This limits the size of your batches in accordance with Little's Law. The question then becomes whether that batch size is possible, and how much benefit it provides.
But there is a sweet spot where you can get large gains (2-3x single processor throughput) if you have enough requests available, if you can tolerate a slightly slower response time, and if the processing required once you have a result is significant (comparable to memory latency).
Oh, and you also need to be willing to write the critical sections in assembly. Even if you figure out the combination of fragile syntax and compiler optionss that produces the branch-free logic that you need, convincing an optimizing compiler to make your apparently unnecessary prefetches at exactly the point in time you need them to be is next to impossible.
This paper isn't quite on this exact topic, but leans heavily in the same direction:
Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou, Arun Raghavan, Kenneth A Ross http://www.cs.columbia.edu/~orestis/sigmod15.pdf