Cache-friendly binary search

kwantam · on June 25, 2015

A "simple" memory layout for cache friendliness (assuming you don't have to do updates on your tree; there exists a more complicated cache-friendly dynamic tree structure, too) is a recursive van Emde Boas layout. This memory layout will achieve performance no more than a (small) constant multiple worse than an omniscent cache in terms of the number of cache misses, and it requires almost no tuning to work in any caching hierarchy.

http://supertech.csail.mit.edu/cacheObliviousBTree.html

More generally: in the last 10-15 years there has been a lot of work on developing algorithms that are compatible with caching, but which do not need to be tuned to the details of a particular cache hierarchy on which they run. For more info, see

Demaine, Erik D. "Cache-Oblivious Algorithms and Data Structures." http://erikdemaine.org/papers/BRICS2002/paper.pdf

Erik Demaine's 6.046 lectures on this topic are available on YouTube: https://www.youtube.com/watch?v=cJOHERGcGm4 and https://www.youtube.com/watch?v=zjUDy6a5vx4

One cache-oblivious dynamic tree structure is also due to Demaine: http://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/

lorenzhs · on June 25, 2015

van Emde Boas trees need quite a few tricks in order for an implementation to be truly competitive, but then they can be really quick: "Engineering a Sorted List Data Structure for 32 Bit Keys", Roman Dementiev, Lutz Kettner, Jens Mehnert, Peter Sanders. http://algo2.iti.kit.edu/dementiev/files/veb.pdf - code: http://people.mpi-inf.mpg.de/~kettner/proj/veb/

cbsmith · on June 25, 2015

Don't discount cache conscious structures like HAT-trie's either. Very zippy.

ntoshev · on June 25, 2015

vEB layout has expensive index calculation though, BFS might still be faster.

enigmo · on June 25, 2015

Taken to it's logical extreme: k-Ary Search on Modern Processors

https://people.mpi-inf.mpg.de/~rgemulla/publications/schlege...

As SSE/AVX registers get wider and wider you might as well compare an entire cache line (or two) at a time. But the overhead of building the level order k-tree means you need to do a whole lot of lookups for each insert... so it doesn't apply to that many problems. Unless you're building a search engine. Then it applies a lot.

deathy · on June 25, 2015

Somewhat related maybe? "Memory Layouts for Binary Search" https://news.ycombinator.com/item?id=9511939

I also really liked the nice package there for testing it out yourself, creating results package and clear email instructions for contributing the results.

Is there any good way of running these kinds of experiments on a variety of hardware possibly with other people's help?

serialx · on June 25, 2015

Or you can use binary search with slightly off-center binary or quaternary (four-way) searches to provide more consistent lookup times:

http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-patholo...

TheLoneWolfling · on June 25, 2015

Binary search doesn't make much sense on modern processors anyways.

We tend to end up comparing keys that are much smaller than the cache lines - and the memory access takes so long compared to the actual calculation that you may as well check everything in the cacheline "while you're at it".

Which ends up meaning that in practice, you may as well do a k-ary search.

I wonder how much, if any, this can be improved by prefetching the indexes you might access next step?

rakoo · on June 25, 2015

> Which ends up meaning that in practice, you may as well do a k-ary search.

So, basically, use b-trees ? (https://en.wikipedia.org/wiki/B-tree)

> https://queue.acm.org/detail.cfm?id=1814327

Poul-Henning Kamp already wrote about it (https://web.archive.org/web/20150523191845/http://queue.acm....). Long story short: with the correct data structure, and assuming you are going to be limited by your storage, you can expect from little wins if your RAM is empty to a 10x win when your RAM is full (and you have to fault pages)

TheLoneWolfling · on June 25, 2015

Not quite the same thing, because you want to pull in a fixed number of bytes at a time, which can be a variable number of keys. But yes, pretty close.

And yes, there is nothing new in the field of CS, ever, it seems. Although note that that is talking about a B-heap (with insertion / etc), whereas this is a b-tree on straight lookups.

nkurz · on June 25, 2015

I wonder how much, if any, this can be improved by prefetching the indexes you might access next step?

I haven't tested it directly on binary search, but I've done some synthetic tests of the same approach. The answer seems to be: the results can be really good!

There are several issues, though.

First, your gains will usually be in increased throughput rather than reduced latency. And this is usually a tradeoff that requires greater consumption of memory bandwidth, as you're loading data you don't need. If you are running on a single-core, this is usually a good tradeoff. If you are already parallel on all cores, it may not be.

The second is that you need to be able to "batch" your requests. That is, you need to be prefetching for the next request while you are handling the next level for the current request. The shorter the final processing time once you have the answer, the larger your batches need to be.

The last is that each CPU has a limit on the number of outstanding requests for cachelines from RAM: usually 10. This limits the size of your batches in accordance with Little's Law. The question then becomes whether that batch size is possible, and how much benefit it provides.

But there is a sweet spot where you can get large gains (2-3x single processor throughput) if you have enough requests available, if you can tolerate a slightly slower response time, and if the processing required once you have a result is significant (comparable to memory latency).

Oh, and you also need to be willing to write the critical sections in assembly. Even if you figure out the combination of fragile syntax and compiler optionss that produces the branch-free logic that you need, convincing an optimizing compiler to make your apparently unnecessary prefetches at exactly the point in time you need them to be is next to impossible.

This paper isn't quite on this exact topic, but leans heavily in the same direction:

Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou, Arun Raghavan, Kenneth A Ross http://www.cs.columbia.edu/~orestis/sigmod15.pdf

opcvx · on June 25, 2015

Sure this is marginally better if you have a static structure, in all other cases you lose.

the8472 · on June 25, 2015

static or read-mostly.