Since this is the third recent item on HN featuring the impact of branch prediction on code performance, let's make sure we keep our heads on straight: a good algorithm will outperform a bad algorithm that's tuned for branch prediction.
Quicksort infamously abuses branch prediction, yet is still the standard against which all other sorting algorithms are compared. Some work is being done to improve Quicksort's branch prediction performance [1], but that's mostly focusing on choosing better pivots.
A perfect sorting algorithm, something like a very large sorting network, will necessarily make any processor's branch predictor give up and go home -- and that'll still be faster to execute than a naive algorithm tuned for branch prediction.
On the other hand, quicksort scans memory in-order. Random access to memory is much, much worse than bad branch prediction. Using a heap in merge sort, for example, screws with both branch prediction and memory access.
For sorting, if you're minimizing the number of comparisons, the result of each compare should be random and independent of all others. So the only way to have "good" branch prediction when sorting an array of random numbers is to do lots of extra, superfluous compares, e.g. using an O(n^2) algorithm instead of an O(n log n) one.
The base case for std::sort in glibc is insertion sort. It's used for less than 6 elements. Nice and linear memory access patterns.
Agreed. The post wasn't meant to suggest anybody should go and optimize! Just that simple looking code can be tricky/interesting.
Also, even artificially skewing the pivot in quicksort as suggested in the paper you reference doesn't work except on /very/ long pipeline machines, because it worsens cache performance & increases instruction count. Also, even then, it's expensive to randomly sample to generate a skewed pivot (a cost they negelect in that paper, assuming it is free)
Couple years ago I had a fairly simple program which collected some data into a linked list and displayed it at the end. It took about 2 seconds to load and process the data from disk. I realized it might be neat to optionally sort the output. Since I didn't feel like doing serious work, I just added a bubble sort to my linked list which took a comparison function and then bubble-sorted by pointer swapping.
Guess what time it added? less than a 10th of a second. Completely negligible. It saved me a few hours (as writing bubble sort for a linked list takes maybe 5 minutes and has no chance of complicated bugs) and I doubt anyone will ever notice any slowdown.
So what's the lesson? Little performance tweaks hardly matter in the average program nowadays because everything's disk or network bound (IO bound). For the average program it's not worth writing more complex code that will be quicker (once you fix the hard-to-see bugs) until you actually know the slowness is an issue.
Well sure, but who writes their own sorting algorithms these days anyways? It's important to understand the characteristics of various sorting algorithms so you can choose the best one, but that "best one" is almost always going to be implemented by some standard library.
It makes sense to use the most optimal version out of the gate in these cases.
It's important to understand the characteristics of various sorting algorithms so you can choose the best one
I disagree - most languages offer a sort function which is nealry optimal in the average case. You very rarely choose unless you are doing something where the sort performance is extremely important.
However, uunderstanding various sort algorithms and their intricacies is a great way to demonstrate fundamental algorithm design as well as some data structures. This is why it's important - so when you do write code, you come with a huge amount of insight into how to go about it.
> Well sure, but who writes their own sorting algorithms these days anyways?
This was (is still?) in the gnu flex program source:
/* We sort the states in sns so we
* can compare it to oldsns quickly.
* We use bubble because there probably
* aren't very many states.
*/
bubble (sns, numstates);
Whoever wrote this should know better. Their mistake of not using the language's built-in sorting functionality would at least be forgivable if they had used insertion sort, but maybe we should blame academic institutions for exposing programmers to bubble sort in the first place. Want to show newbie programmers the difference between an O(N^2) and O(N*logN) algorithm? Fine. Insertion sort is O(N^2) too, just like bubble sort, though there is no practical case where bubble sort out-performs insertion sort, and you certainly wouldn't use bubble sort to speed up quicksort.
Understanding of the characteristics comes from understanding the implementation. I can not explain why it takes longer to insert in the middle of vector vs a list without explaining at least a rudimentary cartoon of the implementation.
Insertion sort is significantly faster (esp in branch prediction, they have similar cache locality, insertion sort has few comparisons and fewer moves and fewer overall instructions), is almost the same to implement, and is actually the fastest possible way to sort arrays of size 4 (and probably true up to about 30). So it's actually useful to know insertion sort, whereas bubblesort is actually useless.
I don't know about insertion sort - you need to do a nested loop through the already-sorted part of the array to find where it should go (which requires knowing how much is already sorted), and move multiple values around when you insert it (easy in a linked list, harder in an array (another nested loop!)).
Several years ago, I had an exam question asking what sort algorithm one should use in different situations, one of which was "automatically sorting a bibliography list when printing a research paper". I answered that while bubble sort is normally terrible, it's a great choice here if you're hand-rolling a sort instead of using a library. The data set couldn't possibly be large enough for the sort run-time to matter, particularly in comparison to printing time, and it's the simplest to correctly program.
There are circumstances in which bubble sort really is that bad, but on small data sets that are already disk/network/peripheral IO bound, it's just fine.
Without telling us the size of the data set that anecdote is meaningless. Other algorithms are _asymptotically_ better than bubble sort, i.e., when you get a certain amount of data, other algorithms perform significantly better. So this choice matters when the input can be arbitrarily large.
And it's false to say everything's disk/network bound these days. There will always be important data structures and other code that needs to scale, where the right algorithm makes the difference.
I'm not sure I buy this as a lesson. Choosing between two asymptotically (very) different algorithms (probably linear vs quadratic in your case) isn't really a performance tweak in my book, even though as you point out it wasn't dominant in your case.
Even the world's slowest disk will be faster than a CPU executing a sufficiently silly algorithm.
Why not use insertion sort - as an on-line algorithm, it can be used to immediately sort the list as it's build.
The implementation is straight-forward: instead of appending new elements to the end of the list, insert them in the correct place.
My personal go-to algorithm for linked lists is an unstable merge-sort variant with explicit stack[1] - while not as simple as insertion sort, performance is far superior.
I agree. Sometimes I just need to sort a very small data set, like with 4-10 items. When built-in functions aren't available, the performance loss of using bubblesort is negligible, and the time I save not having to worry about implementation is not insignificant.
Something similar happened to me, with a linked list as well (on a linked list, bubble sort is much easier to implement than other common sorting methods). Everything worked fine, until data volume increased by 2 orders of magnitude.
My theory about bubble sort is that a lot of people remembers it just because the name is so damn cute. You can't say bubble without smiling. So I say that we should rename quick sort to something a bit nicer, something first year comp sci students can remember instead of bubble sort. And, for good measure, bubble sort should be "terrible sort" or something like that.
Perhaps "crap sort" is appropriate, in that the performance of the algorithm itself is crap, and the name also evokes similar intuitive parallels to "bubble sort", since (as we know) crap floats to the top just as bubbles do.
I'd imagine it's the very first standard algorithm most people learned as part of their CS education as well. It's easy to implement, easy to understand, and lays the basis for the more complex sorters coming later...
I've asked this as an interview question in the past. It's a great question as it has so many layers. Very few folks ever groked the role the branch predictor plays, but those who did have always gone on to do well.
Who actually uses bubble sort except as a class assignment? The whole advantage to the algorithm is it's easy to understand and implement. You learn that one first and then you move on to more practical stuff.
In commercial programming it's pretty rare to run across a situation where writing your own sorting routine makes sense.
Bubble sort is actually extremely useful when temporal locality matters more than absolute sort order.
Example: Games often bubble sort world objects by depth. There is a small performance gain to be had by drawing closer objects first, so that you can depth-cull expensive pixels behind them. Game engines can run several iterations of bubble sort, but stop before sorting the set completely, since the z-buffer will ensure correctness. Over the span of several frames, the incremental bubble sort will achieve a total sort, but the incremental approach bounds cost more tightly. Since the depth of objects only changes relatively when you look around, it's usually a pretty good approximation of "perfect" behavior.
Is this very common? Because that would certainly explain the behavior I see in games like WoW where turning my camera drops the framerate, but once I stop turning the framerate jumps back up to normal. I always assumed this had something to do with loading textures or models or geometry or something, but I was never really satisfied with that.
Your assumption was right. The choice of sort is purely an optimization to allow for fast depth-culling, and shouldn't result in noticeable graphical glitches. However, pipelining textures and models based on viewing angles is common. This is certainly what you are seeing.
I wrote Bubble Sort into a very well known game a few years ago, for the case of sorting tiny numbers of view objects. It simply outperformed other sorts. Other sorts were employed for other cases.
This example is neat. Generalizing it a bit... it seems that if you have to visit data frequently in other processing, and you need it ordered later, or for performance, throwing in a swap while doing the visits may help optimize later steps.
This of course assumes that the swap won't mess up the processing algorithm.
It also is an optimization, so best left for later stages of development.
Anyway, thanks for showing this example - it gives me another trick to try for a couple projects I'm working on. :)
I actually use bubble sort early in development when the items to be sorted are complex. Its inefficiency means that most items will be compared against most other items so this will help catch any problems in the complexity of the items and their ordering.
Far more efficient sorting algorithms do not behave very well if the comparison functions give wrong answers and this can be especially difficult to detect.
The bubble sort then helps with the test suite - you can compare the results from the brute force bubble sort against the optimised sort that your framework/language provides.
If you're worried about an incorrect comparison function, you could implement it two different ways.
You could also directly verify (probabilistically) the properties of a total ordering [1]. If your comparison function is "comp" and returns -1, 0, or 1, the tests might go like this:
trials = 1000
while trials > 0:
a = rand_element()
b = rand_element()
c = rand_element()
aa = comp(a, a)
ab = comp(a, b)
ac = comp(a, c)
ba = comp(b, a)
bb = comp(b, b)
bc = comp(b, c)
ca = comp(c, a)
cb = comp(c, b)
cc = comp(c, c)
#reflexive
assert(aa == 0)
assert(bb == 0)
assert(cc == 0)
#antisymmetric
assert(ab == -ba)
assert(ac == -ca)
assert(bc == -cb)
#transitive
assert((ab != bc) or (ab == ac))
assert((ac != cb) or (ac == ab))
assert((ba != ac) or (ba == bc))
assert((bc != ca) or (bc == ba))
assert((ca != ab) or (ca == cb))
assert((cb != ba) or (cb == ca))
assert((ab != 0) or (ac == bc))
assert((ac != 0) or (ab == cb))
assert((ba != 0) or (bc == ac))
assert((bc != 0) or (ba == ca))
assert((ca != 0) or (cb == ab))
assert((cb != 0) or (ca == ba))
trials -= 1
Usually the data is a combination of externally received information and internal calculations/annotations. I just use copious asserts, and one line:
assert bubble_sort(data) == sort(data)
Since this always runs during development it will automatically pick up internal and external changes.
What you laid out is explicitly testing everything against everything else and is 100% comprehensive. Bubble sort is less than 100% comprehensive but still gives good coverage, while optimised sorts just mean you get lucky in the face of buggy comparisons.
I have used bubble sort in a commercial program before: there are plenty of embedded microprocessors with a minimal libc that does not include qsort. For sorting a small list (5-20 items) bubble sort is a fine choice because it is dead simple to implement.
I deny that. Its may be easy to understand the idea; the code is no more enlightening than any other encoded algorithm. A couple of loops,comparing pairs of objects, triple assignment with a temp - what's that all about? You may be able to explain it/ teach it easier because its smaller, but the code is not transparent.
I'd suggest a recursive algorithm would code up far more transparently. E.g. Sort first half of the list; sort 2nd half of the list; merge the two half-lists. That's obvious at the top level; the details of merge are all that's left to explain. It also demonstrates dissecting a problem into smaller problems, so is a double lesson.
1) Any comparison-based sorting algorithm that isn't branchless will mis-predict often.
2) The article completely ignores the possibility of conditional-move instructions, which would cut the mis-predicted branches in this case to zero (or one if we add the "are we done?" check and always predict "no").
I'm not sure you're quite right about 1. Depending on what you mean exactly. For example, insertion sort is O(n^2) but mispredicts O(n) times. That doesn't seem often to me.
There are (artificial) mergesorts that execute O(nlogn) branches but also mispredict O(n) times.
See for example
"branch mispredictions don't affect mergesort:"
http://www.cphstl.dk/Paper/Quicksort/sea12.pdf
In the case of 2. I agree, although compiler support still isn't great -- I think. For almost all cases (integers, floats, strings) radix sorting would be even better -- fewer instructions, no comparison branches.
Hmm, you're right. My thinking was, "if we know enough about the structure of the problem that we can make sure we're guessing right much more than half the time, we can probably turn that into a better O()", which insertion sort as an example doesn't really violate per se - it is possible to trade off that knowledge for better asymptotic complexity, but clearly the claim wasn't quite right as broadly as I stated it.
Is there an analysis of some other sorting algorithms and how the behave with branch prediction? I did some google searching and found one paper on merge sort and branch prediction.... specifically thinking about Mergesort, quicksort, heapsort, and introsort.
The paper linked to, and the tech report its based on, provide exactly that - an experimental analysis of sorting and branch predictors. Both are available at http://paulbiggar.com/research/#sorting.
Well, all the efficient comparison-based sorting algorithms (i.e. O(n log n) time) will cause Omega(n log n) mispredictions. So in a sense they're all the same*
At the same time, quicksort has the very nice property that if it accidentally picks a somewhat unbalanced pivot it naturally biases its branches making them easier to predict. This essentially cheapens the comparison branches it executes. I don't really know if that helps in practice though.
*Unless you jump through hoops/change the model of computation, one such way being conditional move instructions. There are more impractial ways too.
One good way is to use radix sort (it's generally faster for other reasons) and it never compares keys.
David Musser's (author of Introsort) paper "A Portable Cache Profiler Based on Source-Level Instrumentation"[1] looks at the cache behavior for Intro, Merge, and Heapsort in it's examples section. Not directly correlated, but possibly interesting nonetheless.
I'm confused about one part of your analysis. You write that Q^k_l = Q^(k-1)_(l+1) * (1 - 1/(l+1)), but it's not obvious to me why the multipler, (1 - 1/(l+1)), is correct in general (past the second sweep). I went so far as to write a little bit of code to calculate various values of Q^3_l, exactly, from values of Q^2_(l+1), for an 8-element array, and found that it did not hold. (I assume that you were assuming that the initial array permutations were picked uniformly from the set of all permutations). Have I done something wrong or missed something totally obvious?
I think it's right. One way to convince yourself is just to add counters to the implementation of bubble sort, run it for a larger input, and then verify the probabilities (including the conditional probabilities) match what I claim.
You're correct that every permutation of {1, ..., n} is assumed equally likely as an input.
This way of thinking about it might help:
All that "bubble_sweep" does to the input array is to "right-shift" the left-to-right maxima, and "left-shift" the elements between them. The left-shifted elements aren't re-ordered, so they still have the usual probability of being a left-to-right-maxima.
In a bit more detail:
Let L be the sequence of indices, from left-to-right, of left-to-right maxima in the array. Now in s^{th} sweep, let M be the sequence L concatenated with n - s. Assume M has r elements in total now.
I think you could actually implement a (very, very) inefficient "bubble_sweep" that worked along these lines. I think that makes it a bit clearer that, apart from at left-to-right maxima from a previous iteration, we just have subsequences of the original (uniform random) permutation, so the 1 - 1/(l+1) probability holds.
I hope that's clear(er) and I didn't make too many mistakes. The sun is shining and I must go ride my bike now!
If the largest number in the array start at index 0, then in the first sweep `n-sweep_number=n-1` "maybe-swaps" turn into `n-1` swaps, each moving the largest number one spot to the right. So it ends in position `n-1`, which is exactly where it belongs.
Quicksort infamously abuses branch prediction, yet is still the standard against which all other sorting algorithms are compared. Some work is being done to improve Quicksort's branch prediction performance [1], but that's mostly focusing on choosing better pivots.
A perfect sorting algorithm, something like a very large sorting network, will necessarily make any processor's branch predictor give up and go home -- and that'll still be faster to execute than a naive algorithm tuned for branch prediction.
[1]: http://www.cs.auckland.ac.nz/~mcw/Teaching/refs/sorting/quic...