B-Heap vs. Binary Heap (2010)

slx26 · on March 18, 2018

A great reminder that spatial locality can be critical to performance. But that's actually quite well-known, and maybe a more important lesson would be: don't take performance for granted, even in standard libraries of well stablished languages or algorithms, there might be much room for improvement for your use case (if you need it).

This said, the article feels quite arrogant. As other commenters say, the idea behind the optimization is nothing new, just some basic concepts well applied. Which is a job well done, but probably doesn't justify the tone (that at least I perceive) from the article.

meuk · on March 18, 2018

Agreed, an 'I was able to optimize this'-tone would be better suited than the current 'I am the only one who does this right'.

The author is misguided if he thinks that an algorithm being called 'optimal' will have the best result in his particular use case. It usually means that the algorithm has optimal asymptotic runtime. In particular, asymptotic analysis completely ignores caching issues.

That said, the technical details are still worth reading if you have the time. I also agree with the author that CS programs are somewhat losing touch with reality.

vanderZwan · on March 18, 2018

I read it as the opposite: he's calling out the educated programmers' arrogance for thinking that they have already figured things out because they know which algorithms are theoretically optimal, and that they don't need to bother looking at the actual use-case.

And he's not putting himself above falling into that trap either:

> So why are you, and I, still doing it wrong?

Also, this is an article that was published in an ACM magazine, under the header of "The Bike Shed" - which I would assume is a "provocative debate"-type of column. Keep both that target audience and intent behind the article in mind when evaluating the overall tone of it.

meuk · on March 18, 2018

But the article is named "you're doing it wrong", not "we're doing it wrong". I know enough people who would not fall in this trap. Maybe I misinterpreted the tone, but I think it has a slight air of arrogance -- but apart from that it's a fine article.

chucksmash · on March 18, 2018

"You're doing it wrong" was a pretty common meme in 2011, I wouldn't take too much from it being the title of the piece other than the author was trying to be playful.

meuk · on March 18, 2018

Ah, true, I didn't pick up on that. Maybe I did take the tone too seriously.

vanderZwan · on March 18, 2018

Well, historical context is easy to lose, especially with internet memes. I recently came across an older article[0] with this sentence in the middle:

> (The Imgur previously featured in this story doesn’t credit him at all, which is some bogus, Fat Jew-style shit.)

Being unaware that "Fat Jew" was the handle of someone who got internet famous but was caught stealing other people's jokes, I was really puzzled for a second what could have warranted that random anti-Semitic statement.

[0] https://news.avclub.com/ice-t-has-some-startling-information...

[1] https://www.reddit.com/r/wholesomebpt/comments/841k3c/ice_t_...

anas-hossam · on March 18, 2018

But the article is named "you're doing it wrong", not "we're doing it wrong". I know enough people who would not fall in this trap. Maybe I misinterpreted the tone, but I think it has a slight air of arrogance.

rocqua · on March 18, 2018

I find it interesting that he is really concerned with page-faults as opposed to cache misses. Most of the recent stuff I've read on performance focuses much more on the cache.

Usually, I've seen using swap-space met with derision. I'd only expect page-faults to really happen when using a memory-mapped file. Again on the note of caches, I'd expect more fanfare be made regarding the TLB when accessing a different page than actual page-misses.

pdpi · on March 18, 2018

He’s concerned with page faults because his workload involves disk access. The admonitions about cache locality matter more when your workload fits in RAM.

Also, he’s not using swap as such, he’s using memory-mapped IO. If you have to hit the disk, you have to hit the disk, might as well try to do it in an OS-aware fashion.

blattimwind · on March 18, 2018

The internal caches of the CPU are orders of magnitude faster than handling a page fault (which means: CPU traps, switch to kernel, VM lookup, issue I/O request, block for completion, switch to user).

TFA's context is a larger-than-memory data structure mapped from disk, where heap block size = page size.

jay-anderson · on March 18, 2018

> This said, the article feels quite arrogant.

Agreed. Its tone is very off-putting. From the article:

> Not just wrong as in not perfect, but wrong as in wasting half, or more, of your performance.

I wish articles would move away from this use of 'wrong'. Neither of those definitions are what I understand when I read 'wrong'. If you mean sub-optimal than use that word instead. Feels like much of the rest of the article is built around this misused terminology.

userbinator · on March 18, 2018

It's pretty much the same reason why databases use B-trees and not binary trees --- when you have to consider the nonuniform latency of storage access, traditional algorithmic analysis that assumes every memory address is accessible in the same time as any other becomes extremely misleading.

vbezhenar · on March 18, 2018

Every memory access could be limited with some big enough number (just assume that there's no cache in your CPU), so traditional analysis is fine. People usually don't like that memory access is so slow, it's another matter. O(N^2) might be faster than O(NlnN) if multiplier for the latter algorithm is bad enough.

martincmartin · on March 18, 2018

As I like to put it: "Random Access Memory isn't random access any more."

abainbridge · on March 18, 2018

I agree with his rant about "CS departments' algorithmic analysis" failing to cover these real-world hardware issue. That was certainly true of my ~1997 undergrad CS course at the University of St Andrews, that was otherwise exemplary.

The only thing I don't understand is why his data structure for finding the least-recently-used item wasn't just a linked list - where each time you touch an object you move it to one end of the list, and when you want to discard an object you remove from the other end. That'd be o(1) and would cause, at most, one page fault per update and none for the add and remove cases (since the two ends would surely be paged "in"). And it'd be a lot easier than his fancy thing. I guess I missed something.

It might be worth pointing out that NVMe or Intel Optane would solve his problem (1) without his fancy data structure too. Some might go as far as to say that while his fancy data structure is out dated now, the stuff that the CS departments teach became correct again. Maybe St Andrews had the right idea all along.

1) Or huge-pages or NUMA node pinning, to some extent.

fmap · on March 18, 2018

> I agree with his rant about "CS departments' algorithmic analysis" failing to cover these real-world hardware issue. That was certainly true of my ~1997 undergrad CS course at the University of St Andrews, that was otherwise exemplary.

I counter with similarly anecdotal evidence from 2012. :)

While we had a normal "Algorithms and Datastructures" course that taught the classical theory, there was also a separate course about "Algorithm Engineering" which was (among other things) about performance in the presence of the memory hierarchy. The upshot is that there are different cost models that you have to use in the analysis of your algorithms (e.g., IO complexity), but the techniques that you use for the analysis are the same.

---

To be honest, I think the system is working as intended here. Undergraduate courses are not really designed to push people to the forefront of current research. They're intended to give you the basic skills you need to catch up to modern developments yourself.

I think what's happening here is that people remember "optimality proofs" from their algorithms courses and take that to mean that there is nothing more to be done (that's the sentiment I got from this blog post anyway). The problem with this is that in this context "optimality" referred to a technical definition - optimal with respect to some specific simplified model of computation. The result is probably not optimal with respect to a more complicated model!

kilburn · on March 18, 2018

> I don't understand is why his data structure for finding the least-recently-used item wasn't just a linked list

Because he is not talking about expiring the Least Recently Used item, but to expire the items whose "cache time" has expired. That is, when you add a new item to the cache, you add it with some lifetime (e.g.: an integer). Then you have some clock (another integer) and what Varnish wants to do is expire all items whose lifetime is lower than the current clock.

With a (doubly-)linked list you would have to traverse the list on insertion to place the item in the correct spot...

abainbridge · on March 19, 2018

That explains it. Thanks.

lorenzhs · on March 18, 2018

I wholeheartedly agree that many—if not most—algorithms classes don’t sufficiently talk about the limitations of the models used and how to handle this. I’m a big fan of Algorithm Engineering, which is best imagined as a cycle of design, analysis, implementation and experiments, with each stage influencing the next. This has led to some exceptional results (e.g. the route planning algorithms used by Google/Bing/Apple Maps, which are around 5 orders of magnitude faster than Dijkstra’s algorithm by using clever pre-processing that exploits the structure of road networks. Their asymptotic query complexity on general graphs is still the same as Dijkstra’s, iirc)

Nitpick, but your list-based LRU cache’s operations wouldn’t be in o(1) but rather O(1) – little oh means strictly slower growth, in this case sub-constant. Probably just a typo though. And it could be further optimised for locality by using a cyclic buffer instead of a linked list (-> no pointer chasing + sequential access), but see kilburn’s answer (the structure in the paper solves a different problem)

chucksmash · on March 18, 2018

Do you have an article you would recommend on the route planning optimizations? That sounds interesting.

lorenzhs · on March 18, 2018

There’s a great recent-ish survey paper at https://arxiv.org/pdf/1504.05140.pdf which will give you pointers to papers in all directions :)

zombieprocesses · on March 18, 2018

> I agree with his rant about "CS departments' algorithmic analysis" failing to cover these real-world hardware issue.

I disagree. CS algorithms is about the logical/mathematical understanding of algorithms. If you want to understand hardware implications, you are generally taught that in an comp architecture class. Blaming algorithms class for the lack of consideration for real world hardware issues is like blaming the computational class because the turing machine doesn't factor in hardware issues. You abstract it out to make general logical observations.

lorenzhs · on March 18, 2018

I disagree. It should at least cover what aspects should be considered when designing an algorithm for use in the real world. You dismiss the entire field of experimental algorithmics as if it were nothing more than a bunch of algorithmicists who know some computer architecture things, but there’s a lot more to it.

I’m not arguing against teaching classical algorithm design and analysis, they’re both incredibly important imho. But we should stop ignoring the shortcomings of the models used, hand-waving them away with “the rest is just a bit of engineering”. It’s not.

repsilat · on March 18, 2018

There are theoretical models of random-access machines with fixed-size caches, on which data structures like B-trees and others can be (and are) analysed for asymptotic performance. It's usually not taught until late undergrad or graduate level though.

(You may have heard the terms "cache aware" and "cache oblivious" used to describe algorithms. These terms come out of theoretical CS, describing algorithms with good performance on these "RAM with cache" machines when the cache size is known or unknown ahead of time, respectively.)

alecco · on March 18, 2018

A similar but more sophisticated use of this was Intel Research's seminal paper on FAST tree (index). Also from 2010.

http://www.webislands.net/pubs/FAST__SIGMOD10.pdf

dang · on March 18, 2018

Past discussions:

https://news.ycombinator.com/item?id=5586889

https://news.ycombinator.com/item?id=1426211

frankmcsherry · on March 18, 2018

If anyone wants to read more, there are (pre this article) papers on cache friendly/oblivious priority queues, e.g.

http://erikdemaine.org/papers/BufferTrees_STOC2002/paper.pdf

The rough gist is that if the only thing you need to be able to do is insert and pop-min, then you can buffer up roughly a page's worth of incomplete work for each page, getting amortized performance that tracks the optimal number of cache transfers at each level of the cache.

Further, in the paper above you are introduced (in the related work) to the citation:

"A. LaMarca and R. E. Ladner. The influence of caches on the performance of heaps. Journal of Experimental Algorithmics, 1, 1996."

which describes the author's "B-Heap" (at least, it describes a b-ary heap where you pick b to match your cache line/page size).

I'm guessing the problem isn't that academics don't understand these things so much that practicing programmers don't always read very much.

parhamn · on March 18, 2018

> ..., not to mention wasting enormous amounts of hardware and electricity.

I wonder how legitimate this concern is from an environmental angle. On one hand a lot of code in the world probably has much worse performance ratios than than the one described in the article -- using python is sufficient to cause an order of performance loss. On the other hand data centers are using more renewable energy sources and reducing their carbon output.

Is this only a wasted money problem? How worth while is it to write "green code"? Back of the envelope calculations on how much developer comfort costs would be interesting.

throwaway080383 · on March 18, 2018

I had a quick question about this part: """ Two details are worth noting:

* Once we leave a VM page through the bottom, it is important for performance that both child nodes live in the same VM page, because we are going to compare them both with their parent.

* Because of this, the tree fails to expand for one generation every time it enters a new VM page in order to use the first two elements in the page productively. """

Is the point just that you want to pack 8 elements into each page, but if the tree expands the first generation of a page, then you end up only being able to pack in 6?

matte_black · on March 18, 2018

Binary Heap was always my favorite data structure because of how it would improve my A* pathfinding performance. Guess I should have used B-Heap.

fithisux · on March 18, 2018

Despite the arrogance, I would be more interested if there were some more mathematical oriented explanation and/or a self-contained code sample like for example the TinyQueue javascript implementation (any language is welcome though.

kqr · on March 18, 2018

Do I understand this correctly: to optimise for virtual memory, do the things you'd do to optimise for the CPU cache except on a larger scale?

blattimwind · on March 18, 2018

Yes. Because virtual memory is backed by a cache (the page cache); so optimizing algorithms to be "more cache friendly" works.

daddosi · on March 18, 2018

There is a joke inthere about economics but i cant think of it right now

anonytrary · on March 18, 2018

Care to explain further? I am genuinely interested in the reference.