Java has switched from Mergesort to TimSort

jrockway · on Aug 10, 2009

I always get scared when I see a factorial term in big-O notation, but it's worth noting that this is better than n log n.

  n log n = log ( n ^ n ) = log ( n * n * n * ... )

while

  log ( n! ) = log ( n * (n-1) * (n-2) * ... )

n log n's repeating n does not get smaller, but log n!'s does. So it's slightly better.

(I had to think this through, so I thought posting it might help someone else.)

jbn92 · on Aug 10, 2009

http://en.wikipedia.org/wiki/Stirling%27s_approximation says that log(n!) is O(n log n) anyway...

philwelch · on Aug 10, 2009

Big O means an upper bound. n is also O(n log n),

In other words, while log(n!) is no worse than n log n, it is still better than n log n.

swolchok · on Aug 11, 2009

log(n!) = Theta(n log n)

Proof: let's take it as granted log(n!) = O(n log n). It remains to show that log(n!) = Omega(n log n).

log(n!) = sum of log i for i from 1 to n

>= sum of log (n/2) for i from n/2 to n (dropping half the terms (all nonnegative) and under-approximating the rest; technically, we need to floor or ceil n/2, but it doesn't matter)

= (n/2) log (n/2)

= (n/2) (log n - log 2)

= (1/2) n log n - n log 2 / 2

Hopefully, I don't need to bother proving that c(n log n) + kn = Theta(n log n) for constants c and k. We've shown that log n! is Omega(n log n) and O(n log n), so we have log n! = Theta(n log n) by definition.

So, the root of this thread is wrong (and so is the parent). Furthermore, the fact that log(n!) is smaller than n log n by a constant is irrelevant when comparing log(n!) to Theta(n log n). What would not be meaningless is comparing formulae for the exact number of comparisons performed by different algorithms.

jdoliner · on Aug 11, 2009

As the article mentions log( n! ) is the theoretical lower bound on the comparison sorting unstructured data. This is a neat little argument so I thought I'd share it with you. Given an unsorted list of data (of length n), sorting is equivalent the discovering the permutation that will return the elements to their correct order. Now there are a total of n! possible permutations that we need to sift through. Now each comparison can be thought of as a yes-no, since we've all played 20 questions, I assume we all have noticed that with m yes-no questions we can distinguish between 2^m possibilities. Thus we need 2^m = n! ie m = log2(n!). Giving us our bound.

callahad · on Aug 10, 2009

And here's the bug in Sun's tracker that suggested switching in the first place: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6804124

kingkilr · on Aug 10, 2009

... this same tripe was submitted to reddit, they haven't actually switched yet, this is just a random paste of the timsort.txt file that exists within the Python SVN repository.

callahad · on Aug 10, 2009

Agreed. The link is solely a description of timsort and has nothing to do with Java. There is, however, a bug in Sun's tracker suggesting that timsort be used for java.util.Arrays.sort. That bug has a state of "Accepted," which to my naive understanding indicates that Sun has consented to implementing the change. But then again, I don't have any experience with Java bug triage stages.

A better title for the submission would be "Timsort - Python's pragmatic merge sort" or something along those lines.

sketerpot · on Aug 10, 2009

Also, Android switched to timsort quite a while ago -- I think that's where the Java timsort code is coming from.

amichail · on Aug 10, 2009

How does TimSort improve (if at all) on what is published on the topic in academia?

mbrubeck · on Aug 10, 2009

Here's another link to the documentation (since svn.python.org is not responding for me): http://bugs.python.org/file4451/timsort.txt

TimSort is essentially merge sort plus optimizations. The major optimization is that it identifies already-sorted subsequences within the input, and sorts these "runs" plus some surrounding elements using a fast binary insertion sort. This means that for input that isn't completely random, it doesn't need to do as many comparisons as a pure mergesort. TimSort has also been tuned based on empirical testing on real-world data/hardware.

amichail · on Aug 10, 2009

Adaptive sorting has been studied in academia:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.8...

akeefer · on Aug 10, 2009

Academic work on subjects like sorting tends to focus more on things like big-O runtimes and less on implementation details that lead to reducing the various constants that big-O simplifies out, or that arise from real-world data patterns rather than totally randomized or idealized data.