Hacker News new | past | comments | ask | show | jobs | submit login
On the Worst-Case Complexity of TimSort (dagstuhl.de)
204 points by pelario on Aug 31, 2018 | hide | past | favorite | 74 comments



I am surprised to see the author of TimSort (Tim Peters) does not have a Wikipedia entry. Seems to me he has enough claim to (wiki)fame.


It's probably just that nobody's gotten around to writing one.


The linked java test file, http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java - still crashes the latest Java 10.0.2 with an 'Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 49'. Amazing! I wonder if this makes some web services vulnerable.. if the user can submit a just-so array of ints to be sorted? But it does seem like it would require uploading a really huge array (>4GB?)


This look like https://bugs.openjdk.java.net/browse/JDK-8203864, which has the following additional information:

"While working on a proper complexity analysis of the algorithm, we realised that there was an error in the last paper reporting such a bug (http://envisage-project.eu/wp-content/uploads/2015/02/sortin...). This implies that the correction implemented in the Java source code (changing Timsort stack size) is wrong and that it is still possible to make it break. This is explained in full details in our analysis: https://arxiv.org/pdf/1805.08612.pdf"


That is not additional information. That bug was created by the authors of this post, and the link is a reference to the exact paper linked to in this post.


Crashing out of execution doesn't really seem like an actionable attack vector, unless that exception bubbles all the way to the application invocation.


Causing a crash is an excellent vector for a denial-of-service attack.


If you can make your service allocate 10G of memory, it's already a denial of service attack.


Usually unexpected exception results in a HTTP 500 response (or something similar) and few lines in the log. It does not result in stopped server.


It's an exception, not a crash.


Java http servers don't work this way. An exception bubbles up until something catches it; it only crashes the process if there is no handler. Every java http server wraps the request processing loop and catches exceptions so that it can provide a 500 Internal Server Error response.

There are likely a million ways malformed input can make a java server throw an exception; that's typically how java handles malformed input. It just gets caught and returned as an error to the client.


Chances are you'll get an OutOfMemoryException first :)


The Java text file has a comment asking the user to allocate more memory for jvm.


Yes but that hardly works when you try to exploit a system..


  arrayToSort[sum] = 1;
This is just blatant programmer error. The code is attempting to assign a value to a slot in an array of a fixed size, which does not exist.

Use:

  Integer[] arrayToSort = new Integer[2000000000];
No error.


You are incorrect. Look at the error message more closely:

    Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 49
        at java.util.ComparableTimSort.pushRun(Unknown Source)
        at java.util.ComparableTimSort.sort(Unknown Source)
        at java.util.Arrays.sort(Unknown Source)
        at Test.main(Test.java:80)
The error occurs inside TimSort, not at

    arrayToSort[sum] = 1;
It's a bug with TimSort going out of bounds (which obviously shouldn't happen ever), not the Test.


The equivalent Python code works fine, though.

    >>> a=[0]*sum(rls)
    >>> sum=-1
    >>> for i in rls:
    ...   sum += i
    ...   a[sum] = 1
    ... 
    >>> a.sort()
    >>> 
Takes a real good while, too.


That's because the array to be sorted is packed and inflated, so as to be the worst possible input for that kind of sort.

Worst-case complexity.

Also, I needed to bump my JVM heap up to 16GB (not 9GB as recommended), just to run it.


Amazing there is still something to find in a sort algorithm.


Especially when targeting realistic machine models, there are a lot of things that are suboptimal about the classical sorting algorithms like quicksort or mergesort. For example, a quicksort with perfect choice of pivot will incur a branch miss with probability 50% for every element. That's not something classical complexity analysis measures, but on actual CPUs, branch misses have quite an impact (something like 10–15 cycles). Shameless plug for something I once built for a lecture to demonstrate this: https://github.com/lorenzhs/quicksort-pivot-imbalance

For sorting algorithms that take the behaviour of modern CPUs into account, check out ips4o (https://arxiv.org/abs/1705.02257, code: https://github.com/SaschaWitt/ips4o) or for a simpler algorithm that's still much faster than quicksort in most cases, blockquicksort (https://arxiv.org/abs/1604.06697, code: https://github.com/weissan/BlockQuicksort). Note that both papers were published in the last two years :)

Of course these algorithms are much more complex and error-prone to implement and use some additional memory, which may explain why they're not used in standard library implementations of popular languages.


You probably want to use pdqsort, since it detects and switches to heapsort for any killer inputs. https://drive.google.com/file/d/0B1-vl-dPgKm_T0Fxeno1a0lGT0E...


Thanks, I totally forgot about pdqsort! If I recall correctly, it's partially based on blockquicksort, but includes a lot of other improvements. It's a very interesting algorithm, much faster than introsort etc. However, I believe that IPS4o is still faster overall (I think I asked Sascha, one of the authors of IPS4o, about this a while ago, but I don't have any measurements at hand), even before accounting for IPS4o's parallel implementation. You should definitely check out both :)


I think I need to swear off of involving myself in order of complexity conversations and just get back to doing useful work. "The difference between theory and practice is that in theory there is no difference." and as processors evolve that just becomes more and more true for complexity calculations.

It's still a useful and helpful mental exercise but what matters is if I can do an operation cheaper and more reliably than you can. Order of complexity informs that decision but doesn't define it.

Even a moderately sized C is proportional to log(n) for quite a lot of data sets most of us actually work with. Conversely, adding, subtracting or comparing two numbers of arbitrary precision (eg, bignum) takes log(n) time, not O(1) time. Very few algorithms we call < O(n) are implementable in less than O(n log n) as n -> ∞

Complexity analysis is step 2. Step 1 being admitting you have a bottleneck. But there are a lot of other steps after those, with a lot of challenging, specialized work.


It all depends on your machine model. You're right in that the RAM model doesn't model the performance characteristics of real-world computers very precisely. That's why it's a model :) Other models exist that can be used to analyse various aspects of algorithms, e.g. the External Memory model (which can be applied to any level of the memory hierarchy, e.g. to quantify data transfer between cache and RAM). Or you can count the number of hard-to-predict branches, etc.

We almost always assume that a machine word is large enough to describe the input size n. This usually implies constant-time operations on log(n) bits. Not doing so would clutter up notation and complicate analysis, and the whole point of using models is avoiding that where reasonably possible.

I mean you could take this thinking to the extreme and say that as we live in three-dimensional space, the wire length to access more and more memory has to grow at least with the cube root of the size of the memory, because that memory needs to physically be stored somewhere at the end of the day. That's not helpful for analysing algorithms, though :)


But for instance in a world where horizontal scalability is almost a given, we have to allow that the cost of sending a message between 2 arbitrary servers is going to be at least log(n) time, because if you build a star topology your entire server room would be wires.

So if you're doing a distributed hash, the base cost of fetching two values to compare them is not only nothing to sneeze at, it is probably fundamental to how you solve the problem.


Now you’re in some unspecified distributed model without even defining your variables. Again, models are there to allow us to reason about things without going into every detail. It might make more sense to treat the time to open a connection and the time to send one machine word over an established connection as variables in asymptotic analysis. Then you can reason about sequential processing time, communication volume, and latency in one term.


Out of curiosity, what do you read/follow that makes stuff like this discoverable?


I'm a PhD student in algorithmics / algorithm engineering, so I work with a lot of people who do stuff like this, even though my research isn't related to sorting. Super Scalar Sample Sort (which ips4o is based on) was co-authored by my advisor, Peter Sanders, and I re-implemented it a few years ago in modern C++ just for the fun of it. Turns out that was a lot nicer to read than the original code (which isn't public) and a bit faster, too. I put that on GitHub where the blockquicksort authors found it and contacted me and we traded some emails and made some improvements. Sometime later my advisor and two colleagues came up with an idea for in-place super-scalar sample sort, out of which eventually ips4o was born.

So, uh, I guess I just work with the right people :) Sorry that I can't give you anything concrete. All of these papers were presented at ESA (European Symposium on Algorithms), though, so that's a good venue to follow. But beware, ESA has a theory track that's a lot bigger than the experimental track, and papers published there can be somewhat unapproachable ;)


Ahh, that makes sense, thanks!

I'd recently asked elsewhere 'I've always wondered is there a good mathematical representation of an algorithm that is useful for algebraic manipulation? Like producing a quicksort from bubble sort.' And got linked this paper [0]. This is only time I've heard the word 'Algoritmics' since that.

Any interesting 'entry level' reads you could send my way?

[0]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45....


Thanks for the link, that was fun to read (well, skim).

I don't think the term 'algorithmics' appears very often in publications, it's more of an umbrella term for many things. The stuff we work in our group is sort of the practical side of theoretical computer science, in that our focus is on algorithms that can be efficiently implemented and don't just look good on paper. The methodology is called Algorithm Engineering, it's described quite well in https://en.wikipedia.org/wiki/Algorithm_engineering. Apart from ESA's track B, the major conferences there are Alenex and SEA (Symposium on Experimental Algorithms). All three are open access.

It's difficult to recommend anything in particular, not only because the scope is very broad, but also because most papers aren't written for a wide audience. Good writing is not something that academics optimize for (perversely, good writing can be seen as a negative point, in that if a paper is easy to understand, it may be rejected for being too simple), nor is it taught. Maybe something like https://github.com/papers-we-love/papers-we-love could serve as a starting point?


For people who aren't Ph.D. students, it may help to contact a local CS department and see what they have going on. There should at least be a grad student colloquium meeting once a week or so. I attended that at Penn a few times, although it was too far above me to really follow (lots of PL theory). More recently I've been going to a Database Reading Group at Portland State University (http://datalab.cs.pdx.edu/dbrg/index.php), which is a great way to get a taste of current research in that niche. Databases are less rarefied, too. :-)


This is awesome!


Example of the crazy optimizations that apply here: I once got a 20% speedup in a standard recursive search algorithm by abandoning the native function call stack and making my own search-specific stack data structure. Since these algorithms call the same function over and over, making the call stack do minimum work is actually significant.


TimSort is an immensely complex sorting algorithm, full of magic numbers and difficult to analyse formally (e.g. the invariants it preserves are multi-line disjunctions). I'm not surprised there are still bugs to be found in it. I do think that issues like this make the case for using simpler, clearer algorithms even at a slight performance penalty.


> Amazing there is still something to find in a sort algorithm.

While more of an implementation detail, you might enjoy:

https://ai.googleblog.com/2006/06/extra-extra-read-all-about...

if you haven't seen it.

Discussed at the time and later, eg:

https://news.ycombinator.com/item?id=1130463

https://news.ycombinator.com/item?id=14906429


How is it that the abstract is talking about "Java version" and "Python version" when discussing computational complexity? Aren't algorithms algorithms, independent of the language you're implementing them in?


From TFA's introduction:

> there are actually not one, but two main versions of TimSort. The first version of the algorithm contained a flaw, which was spotted in [5]: while the input was correctly sorted, the algorithm did not behave as announced (because of a broken invariant). This was discovered by De Gouw and his co-authors while trying to prove formally the correctness of TimSort. They proposed a simple way to patch the algorithm, which was quickly adopted in Python, leading to what we consider to be the real TimSort. This is the one we analyze in Sections 3 and 4. On the contrary, Java developers chose to stick with the first version of TimSort, and adjusted some tuning values (which depend on the broken invariant; this is explained in Sections 2 and 5) to prevent the bug exposed by [5]. Motivated by its use in Java, we explain in Section 5 how, at the expense of very complicated technical details, the elegant proofs of the Python version can be twisted to prove the same results for this older version.

[5] Stijn De Gouw, Jurriaan Rot, Frank S de Boer, Richard Bubel, and Reiner Hähnle. Open- JDK’s Java.utils.Collection.sort() is broken: The good, the bad and the worst case. In International Conference on Computer Aided Verification, pages 273–289. Springer, 2015.


It’s taking about the implementations in the respective standard libraries, which are apparently different.


From the article:

> In fact, there are two slightly different versions of TimSort that are currently implemented in Python and in Java respectively.


Yes. If the java have a different complexity it is a different algorithm.

To the writers defense, they have to algorithm in pseudo code in the article


> If the java have a different complexity it is a different algorithm.

It doesn't seem wrong to me to talk about different versions of the same algoritm when there are only minor differences.


Right, like how Quicksort can be pretty different depending on how you choose the pivot. It's still Quicksort, but there's different variants.


yes but they will share the same complexity


You can make the average-case perform in O(n^2) by always pivoting on the smallest number in the array. Nobody would do _that_, but it shows that pivot choice can affect complexity.

Ergo, computer scientists researching the algorithm mathematically must consider the effect of choice of pivot.


Depending on how you choose the pivot, the worst-case of complexity of Quicksort can be O(n^2) or O(n log n)


No, worst-case complexity of real quicksort is always O(n^2), regardless of pivot choice strategy (even with stochastic pivot choice, although you’d have to get very unlucky to hit that worst case). You can make the average case better or worse though.

The only way of making quicksort’s worst-case runtime O(n log n) is by limiting recursion depth, as done e.g. in introsort. But that’s no longer quicksort.


Isn't there a linear time median selection algorithm, which allows you to always select a pivot in the middle of the sorted part and create two equal halves? This produces a worst-case O(n log n) quick sort, which is no longer quick due to the big constant hidden in O notation.


Correct, Quicksort with Quickselect for pivot choice.


"quickselect" is a selection algorithm that uses a partial quicksort in order to do a select. You're essentially saying to write quicksort using quicksort.

Quickselect requires a pivot choosing strategy; the problem is not only the same as quicksort's, it is the problem from quicksort.

According to Wikipedia, in the worst case, it is O(n²).[1] But that's not strictly correct, IMO. Regardless, it doesn't answer the OP's question of "is there a selection algorithm that operates in worst case O(n)"

[1]: https://en.wikipedia.org/wiki/Quickselect

[2]: https://news.ycombinator.com/item?id=17888755 and the parent comment; specifically, the median-of-medians algorithm is a worst-case O(n) selection algorithm.


This is wrong.

See https://en.m.wikipedia.org/wiki/Quicksort, section "Selection-based pivoting".


Specifically,

> A variant of quickselect, the median of medians algorithm, chooses pivots more carefully, ensuring that the pivots are near the middle of the data (between the 30th and 70th percentiles), and thus has guaranteed linear time – O(n). This same pivot strategy can be used to construct a variant of quicksort (median of medians quicksort) with O(n log n) time. However, the overhead of choosing the pivot is significant, so this is generally not used in practice.



Given that rho can vary with the input and is completely arbitrary value, shouldn’t be also called n?

Memories on the subject are not great so might be saying bullshit in here


In the worst case, rho is equal to n, and you get O(n log n). However, O(n + n log rho) gives a better description of how it performs on partially sorted arrays.


And in the best case (already sorted array), it's equal to 1 and the algorithm performs as O(n), which is nice to prove in one go.

In some other typical cases (otherwise sorted array with one element inserted, two sorted arrays appended to each other) rho is 3 and 2, so also O(n).


All sorting algorithms can trivially be made to perform in O(n) in the "already sorted" scenario without worsening the worst case complexity, so that isn't really helpful.


It can be done only by adding an initial check that does only that. But this means that the algorithm speed doesn’t gradually increase with partially sorted sequences.

For instance, timsort is also very fast if only a single element is unsorted, or only two elements, or only three elements. These are not special cases explicitly handled, its just the natural way the algorithm works.


Nitpick: \rho = n/2 in the worst case, if n > 1, but that still gives you O(n log n).


Why n/2? If the array is, for example, sorted in the reverse order, then there is no monotonous run at all, in which case I believe the algorithm considers each element from the array being a run in itself, giving n runs.


Runs can be increasing or decreasing, a reversed array is a single run. Worst case is n/2 because you can always split the array in pairs (if it alternates a high and a low value for example).


Ah I didn't know it also exploited reversed runs. Amazing.


Exactly. Sometimes those are called "adaptive algorithms", in the sense that the complexity depends on some properties of the input, so even though the worst case complexity is still O(n log n), for many outputs it will do much better.


It shouldn't be called n because then `n log n` and `n log m` convey different meanings. In the first case, `n` is one and the same variable, where in the second, `n` and `m` are independent of each other.

You can call it `m` or `rho` or whatever, just use a different variable.


Here, ρ is not independent of n, though


It sort of is though. You can have a really large N with rho = 1, just like you can have a really small N with rho=N. They're orthogonal variables, and they both impact the run time.


Rho is always 1 to n as defined.

In randomized input with uniform statistics, should be on average (n - log n) which also gives a handle on theta notation complexity.


n varies with the input as well :) But yes, since ρ is bounded by n you can reduce the complexity to O(n log n) again, but i think the important part here is to distinguish the complexity against other sorting algorithms and Timsort improves things for specific inputs as they note.


Not good that Java's sort still has bugs.


I'd say it the other way around : 1/ it's amazing that a code that is used in countless occurences can still have a bug; 2/ (I've studied sorting algorithms for a while) finding such a bug is very clever... Kudo's to the authors.


What's amazing is how little understanding we seem to have over such critical widely-used code. Java opted for a complex and arguably unproven algorithm, so it was always a risk. We now have functional languages that are able to express provably correct sorting algorithms, so the bar is getting higher.


The java bug (https://bugs.openjdk.java.net/browse/JDK-8203864) has some more information on the history, including a reference to an earlier paper from 2015 relating to the same problem.

Apparently, the earlier paper had an error, so the implemented fix was not complete.

Still intriguing, though!


It looks deceptively like it but 'kudos' is not plural of a kudo (or reference to some awesome thing Kudo once did).


… and even if it were a plural the apostrophe would be misplaced.


Is it really not? The dictionary says otherwise, albeit by back-formation (as is the English way).

https://www.merriam-webster.com/dictionary/kudo




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: