A "simple" memory layout for cache friendliness (assuming you don't have to do updates on your tree; there exists a more complicated cache-friendly dynamic tree structure, too) is a recursive van Emde Boas layout. This memory layout will achieve performance no more than a (small) constant multiple worse than an omniscent cache in terms of the number of cache misses, and it requires almost no tuning to work in any caching hierarchy.
More generally: in the last 10-15 years there has been a lot of work on developing algorithms that are compatible with caching, but which do not need to be tuned to the details of a particular cache hierarchy on which they run. For more info, see
As SSE/AVX registers get wider and wider you might as well compare an entire cache line (or two) at a time. But the overhead of building the level order k-tree means you need to do a whole lot of lookups for each insert... so it doesn't apply to that many problems. Unless you're building a search engine. Then it applies a lot.
I also really liked the nice package there for testing it out yourself, creating results package and clear email instructions for contributing the results.
Is there any good way of running these kinds of experiments on a variety of hardware possibly with other people's help?
Binary search doesn't make much sense on modern processors anyways.
We tend to end up comparing keys that are much smaller than the cache lines - and the memory access takes so long compared to the actual calculation that you may as well check everything in the cacheline "while you're at it".
Which ends up meaning that in practice, you may as well do a k-ary search.
I wonder how much, if any, this can be improved by prefetching the indexes you might access next step?
Poul-Henning Kamp already wrote about it (https://web.archive.org/web/20150523191845/http://queue.acm....). Long story short: with the correct data structure, and assuming you are going to be limited by your storage, you can expect from little wins if your RAM is empty to a 10x win when your RAM is full (and you have to fault pages)
Not quite the same thing, because you want to pull in a fixed number of bytes at a time, which can be a variable number of keys. But yes, pretty close.
And yes, there is nothing new in the field of CS, ever, it seems. Although note that that is talking about a B-heap (with insertion / etc), whereas this is a b-tree on straight lookups.
I wonder how much, if any, this can be improved by prefetching the indexes you might access next step?
I haven't tested it directly on binary search, but I've done some synthetic tests of the same approach. The answer seems to be: the results can be really good!
There are several issues, though.
First, your gains will usually be in increased throughput rather than reduced latency. And this is usually a tradeoff that requires greater consumption of memory bandwidth, as you're loading data you don't need. If you are running on a single-core, this is usually a good tradeoff. If you are already parallel on all cores, it may not be.
The second is that you need to be able to "batch" your requests. That is, you need to be prefetching for the next request while you are handling the next level for the current request. The shorter the final processing time once you have the answer, the larger your batches need to be.
The last is that each CPU has a limit on the number of outstanding requests for cachelines from RAM: usually 10. This limits the size of your batches in accordance with Little's Law. The question then becomes whether that batch size is possible, and how much benefit it provides.
But there is a sweet spot where you can get large gains (2-3x single processor throughput) if you have enough requests available, if you can tolerate a slightly slower response time, and if the processing required once you have a result is significant (comparable to memory latency).
Oh, and you also need to be willing to write the critical sections in assembly. Even if you figure out the combination of fragile syntax and compiler optionss that produces the branch-free logic that you need, convincing an optimizing compiler to make your apparently unnecessary prefetches at exactly the point in time you need them to be is next to impossible.
This paper isn't quite on this exact topic, but leans heavily in the same direction:
http://supertech.csail.mit.edu/cacheObliviousBTree.html
More generally: in the last 10-15 years there has been a lot of work on developing algorithms that are compatible with caching, but which do not need to be tuned to the details of a particular cache hierarchy on which they run. For more info, see
Demaine, Erik D. "Cache-Oblivious Algorithms and Data Structures." http://erikdemaine.org/papers/BRICS2002/paper.pdf
Erik Demaine's 6.046 lectures on this topic are available on YouTube: https://www.youtube.com/watch?v=cJOHERGcGm4 and https://www.youtube.com/watch?v=zjUDy6a5vx4
One cache-oblivious dynamic tree structure is also due to Demaine: http://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/