Hacker News new | past | comments | ask | show | jobs | submit login
Java has switched from Mergesort to TimSort (pocoo.org)
58 points by fogus on Aug 10, 2009 | hide | past | favorite | 13 comments



I always get scared when I see a factorial term in big-O notation, but it's worth noting that this is better than n log n.

  n log n = log ( n ^ n ) = log ( n * n * n * ... )
while

  log ( n! ) = log ( n * (n-1) * (n-2) * ... )
n log n's repeating n does not get smaller, but log n!'s does. So it's slightly better.

(I had to think this through, so I thought posting it might help someone else.)


http://en.wikipedia.org/wiki/Stirling%27s_approximation says that log(n!) is O(n log n) anyway...


Big O means an upper bound. n is also O(n log n),

In other words, while log(n!) is no worse than n log n, it is still better than n log n.


log(n!) = Theta(n log n)

Proof: let's take it as granted log(n!) = O(n log n). It remains to show that log(n!) = Omega(n log n).

log(n!) = sum of log i for i from 1 to n

>= sum of log (n/2) for i from n/2 to n (dropping half the terms (all nonnegative) and under-approximating the rest; technically, we need to floor or ceil n/2, but it doesn't matter)

= (n/2) log (n/2)

= (n/2) (log n - log 2)

= (1/2) n log n - n log 2 / 2

Hopefully, I don't need to bother proving that c(n log n) + kn = Theta(n log n) for constants c and k. We've shown that log n! is Omega(n log n) and O(n log n), so we have log n! = Theta(n log n) by definition.

So, the root of this thread is wrong (and so is the parent). Furthermore, the fact that log(n!) is smaller than n log n by a constant is irrelevant when comparing log(n!) to Theta(n log n). What would not be meaningless is comparing formulae for the exact number of comparisons performed by different algorithms.


As the article mentions log( n! ) is the theoretical lower bound on the comparison sorting unstructured data. This is a neat little argument so I thought I'd share it with you. Given an unsorted list of data (of length n), sorting is equivalent the discovering the permutation that will return the elements to their correct order. Now there are a total of n! possible permutations that we need to sift through. Now each comparison can be thought of as a yes-no, since we've all played 20 questions, I assume we all have noticed that with m yes-no questions we can distinguish between 2^m possibilities. Thus we need 2^m = n! ie m = log2(n!). Giving us our bound.


And here's the bug in Sun's tracker that suggested switching in the first place: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6804124


... this same tripe was submitted to reddit, they haven't actually switched yet, this is just a random paste of the timsort.txt file that exists within the Python SVN repository.


Agreed. The link is solely a description of timsort and has nothing to do with Java. There is, however, a bug in Sun's tracker suggesting that timsort be used for java.util.Arrays.sort. That bug has a state of "Accepted," which to my naive understanding indicates that Sun has consented to implementing the change. But then again, I don't have any experience with Java bug triage stages.

A better title for the submission would be "Timsort - Python's pragmatic merge sort" or something along those lines.


Also, Android switched to timsort quite a while ago -- I think that's where the Java timsort code is coming from.


How does TimSort improve (if at all) on what is published on the topic in academia?


Here's another link to the documentation (since svn.python.org is not responding for me): http://bugs.python.org/file4451/timsort.txt

TimSort is essentially merge sort plus optimizations. The major optimization is that it identifies already-sorted subsequences within the input, and sorts these "runs" plus some surrounding elements using a fast binary insertion sort. This means that for input that isn't completely random, it doesn't need to do as many comparisons as a pure mergesort. TimSort has also been tuned based on empirical testing on real-world data/hardware.


Adaptive sorting has been studied in academia:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.8...


Academic work on subjects like sorting tends to focus more on things like big-O runtimes and less on implementation details that lead to reducing the various constants that big-O simplifies out, or that arise from real-world data patterns rather than totally randomized or idealized data.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: