The linked java test file, http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java - still crashes the latest Java 10.0.2 with an 'Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 49'. Amazing! I wonder if this makes some web services vulnerable.. if the user can submit a just-so array of ints to be sorted? But it does seem like it would require uploading a really huge array (>4GB?)
"While working on a proper complexity analysis of the algorithm, we realised that there was an error in the last paper reporting such a bug (http://envisage-project.eu/wp-content/uploads/2015/02/sortin...). This implies that the correction implemented in the Java source code (changing Timsort stack size) is wrong and that it is still possible to make it break. This is explained in full details in our analysis: https://arxiv.org/pdf/1805.08612.pdf"
That is not additional information. That bug was created by the authors of this post, and the link is a reference to the exact paper linked to in this post.
Crashing out of execution doesn't really seem like an actionable attack vector, unless that exception bubbles all the way to the application invocation.
Java http servers don't work this way. An exception bubbles up until something catches it; it only crashes the process if there is no handler. Every java http server wraps the request processing loop and catches exceptions so that it can provide a 500 Internal Server Error response.
There are likely a million ways malformed input can make a java server throw an exception; that's typically how java handles malformed input. It just gets caught and returned as an error to the client.
You are incorrect. Look at the error message more closely:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 49
at java.util.ComparableTimSort.pushRun(Unknown Source)
at java.util.ComparableTimSort.sort(Unknown Source)
at java.util.Arrays.sort(Unknown Source)
at Test.main(Test.java:80)
The error occurs inside TimSort, not at
arrayToSort[sum] = 1;
It's a bug with TimSort going out of bounds (which obviously shouldn't happen ever), not the Test.
Especially when targeting realistic machine models, there are a lot of things that are suboptimal about the classical sorting algorithms like quicksort or mergesort. For example, a quicksort with perfect choice of pivot will incur a branch miss with probability 50% for every element. That's not something classical complexity analysis measures, but on actual CPUs, branch misses have quite an impact (something like 10–15 cycles). Shameless plug for something I once built for a lecture to demonstrate this: https://github.com/lorenzhs/quicksort-pivot-imbalance
Of course these algorithms are much more complex and error-prone to implement and use some additional memory, which may explain why they're not used in standard library implementations of popular languages.
Thanks, I totally forgot about pdqsort! If I recall correctly, it's partially based on blockquicksort, but includes a lot of other improvements. It's a very interesting algorithm, much faster than introsort etc. However, I believe that IPS4o is still faster overall (I think I asked Sascha, one of the authors of IPS4o, about this a while ago, but I don't have any measurements at hand), even before accounting for IPS4o's parallel implementation. You should definitely check out both :)
I think I need to swear off of involving myself in order of complexity conversations and just get back to doing useful work. "The difference between theory and practice is that in theory there is no difference." and as processors evolve that just becomes more and more true for complexity calculations.
It's still a useful and helpful mental exercise but what matters is if I can do an operation cheaper and more reliably than you can. Order of complexity informs that decision but doesn't define it.
Even a moderately sized C is proportional to log(n) for quite a lot of data sets most of us actually work with. Conversely, adding, subtracting or comparing two numbers of arbitrary precision (eg, bignum) takes log(n) time, not O(1) time. Very few algorithms we call < O(n) are implementable in less than O(n log n) as n -> ∞
Complexity analysis is step 2. Step 1 being admitting you have a bottleneck. But there are a lot of other steps after those, with a lot of challenging, specialized work.
It all depends on your machine model. You're right in that the RAM model doesn't model the performance characteristics of real-world computers very precisely. That's why it's a model :) Other models exist that can be used to analyse various aspects of algorithms, e.g. the External Memory model (which can be applied to any level of the memory hierarchy, e.g. to quantify data transfer between cache and RAM). Or you can count the number of hard-to-predict branches, etc.
We almost always assume that a machine word is large enough to describe the input size n. This usually implies constant-time operations on log(n) bits. Not doing so would clutter up notation and complicate analysis, and the whole point of using models is avoiding that where reasonably possible.
I mean you could take this thinking to the extreme and say that as we live in three-dimensional space, the wire length to access more and more memory has to grow at least with the cube root of the size of the memory, because that memory needs to physically be stored somewhere at the end of the day. That's not helpful for analysing algorithms, though :)
But for instance in a world where horizontal scalability is almost a given, we have to allow that the cost of sending a message between 2 arbitrary servers is going to be at least log(n) time, because if you build a star topology your entire server room would be wires.
So if you're doing a distributed hash, the base cost of fetching two values to compare them is not only nothing to sneeze at, it is probably fundamental to how you solve the problem.
Now you’re in some unspecified distributed model without even defining your variables. Again, models are there to allow us to reason about things without going into every detail. It might make more sense to treat the time to open a connection and the time to send one machine word over an established connection as variables in asymptotic analysis. Then you can reason about sequential processing time, communication volume, and latency in one term.
I'm a PhD student in algorithmics / algorithm engineering, so I work with a lot of people who do stuff like this, even though my research isn't related to sorting. Super Scalar Sample Sort (which ips4o is based on) was co-authored by my advisor, Peter Sanders, and I re-implemented it a few years ago in modern C++ just for the fun of it. Turns out that was a lot nicer to read than the original code (which isn't public) and a bit faster, too. I put that on GitHub where the blockquicksort authors found it and contacted me and we traded some emails and made some improvements. Sometime later my advisor and two colleagues came up with an idea for in-place super-scalar sample sort, out of which eventually ips4o was born.
So, uh, I guess I just work with the right people :) Sorry that I can't give you anything concrete. All of these papers were presented at ESA (European Symposium on Algorithms), though, so that's a good venue to follow. But beware, ESA has a theory track that's a lot bigger than the experimental track, and papers published there can be somewhat unapproachable ;)
I'd recently asked elsewhere 'I've always wondered is there a good mathematical representation of an algorithm that is useful for algebraic manipulation? Like producing a quicksort from bubble sort.' And got linked this paper [0]. This is only time I've heard the word 'Algoritmics' since that.
Any interesting 'entry level' reads you could send my way?
Thanks for the link, that was fun to read (well, skim).
I don't think the term 'algorithmics' appears very often in publications, it's more of an umbrella term for many things. The stuff we work in our group is sort of the practical side of theoretical computer science, in that our focus is on algorithms that can be efficiently implemented and don't just look good on paper. The methodology is called Algorithm Engineering, it's described quite well in https://en.wikipedia.org/wiki/Algorithm_engineering. Apart from ESA's track B, the major conferences there are Alenex and SEA (Symposium on Experimental Algorithms). All three are open access.
It's difficult to recommend anything in particular, not only because the scope is very broad, but also because most papers aren't written for a wide audience. Good writing is not something that academics optimize for (perversely, good writing can be seen as a negative point, in that if a paper is easy to understand, it may be rejected for being too simple), nor is it taught. Maybe something like https://github.com/papers-we-love/papers-we-love could serve as a starting point?
For people who aren't Ph.D. students, it may help to contact a local CS department and see what they have going on. There should at least be a grad student colloquium meeting once a week or so. I attended that at Penn a few times, although it was too far above me to really follow (lots of PL theory). More recently I've been going to a Database Reading Group at Portland State University (http://datalab.cs.pdx.edu/dbrg/index.php), which is a great way to get a taste of current research in that niche. Databases are less rarefied, too. :-)
Example of the crazy optimizations that apply here: I once got a 20% speedup in a standard recursive search algorithm by abandoning the native function call stack and making my own search-specific stack data structure. Since these algorithms call the same function over and over, making the call stack do minimum work is actually significant.
TimSort is an immensely complex sorting algorithm, full of magic numbers and difficult to analyse formally (e.g. the invariants it preserves are multi-line disjunctions). I'm not surprised there are still bugs to be found in it. I do think that issues like this make the case for using simpler, clearer algorithms even at a slight performance penalty.
How is it that the abstract is talking about "Java version" and "Python version" when discussing computational complexity? Aren't algorithms algorithms, independent of the language you're implementing them in?
> there are actually not one, but two main versions of TimSort. The first version of the algorithm contained a flaw, which was spotted in [5]: while the input was correctly sorted, the algorithm did not behave as announced (because of a broken invariant). This was discovered by De Gouw and his co-authors while trying to prove formally the correctness of TimSort. They proposed a simple way to patch the algorithm, which was quickly adopted in Python, leading to what we consider to be the real TimSort. This is the one we analyze in Sections 3 and 4. On the contrary, Java developers chose to stick with the first version of TimSort, and adjusted some tuning values (which depend on the broken invariant; this is explained in Sections 2 and 5) to prevent the bug exposed by [5]. Motivated by its use in Java, we explain in Section 5 how, at the expense of very complicated technical details, the elegant proofs of the Python version can be twisted to prove the same results for this older version.
[5] Stijn De Gouw, Jurriaan Rot, Frank S de Boer, Richard Bubel, and Reiner Hähnle. Open- JDK’s Java.utils.Collection.sort() is broken: The good, the bad and the worst case. In International Conference on Computer Aided Verification, pages 273–289. Springer, 2015.
You can make the average-case perform in O(n^2) by always pivoting on the smallest number in the array. Nobody would do _that_, but it shows that pivot choice can affect complexity.
Ergo, computer scientists researching the algorithm mathematically must consider the effect of choice of pivot.
No, worst-case complexity of real quicksort is always O(n^2), regardless of pivot choice strategy (even with stochastic pivot choice, although you’d have to get very unlucky to hit that worst case). You can make the average case better or worse though.
The only way of making quicksort’s worst-case runtime O(n log n) is by limiting recursion depth, as done e.g. in introsort. But that’s no longer quicksort.
Isn't there a linear time median selection algorithm, which allows you to always select a pivot in the middle of the sorted part and create two equal halves? This produces a worst-case O(n log n) quick sort, which is no longer quick due to the big constant hidden in O notation.
"quickselect" is a selection algorithm that uses a partial quicksort in order to do a select. You're essentially saying to write quicksort using quicksort.
Quickselect requires a pivot choosing strategy; the problem is not only the same as quicksort's, it is the problem from quicksort.
According to Wikipedia, in the worst case, it is O(n²).[1] But that's not strictly correct, IMO. Regardless, it doesn't answer the OP's question of "is there a selection algorithm that operates in worst case O(n)"
> A variant of quickselect, the median of medians algorithm, chooses pivots more carefully, ensuring that the pivots are near the middle of the data (between the 30th and 70th percentiles), and thus has guaranteed linear time – O(n). This same pivot strategy can be used to construct a variant of quicksort (median of medians quicksort) with O(n log n) time. However, the overhead of choosing the pivot is significant, so this is generally not used in practice.
In the worst case, rho is equal to n, and you get O(n log n). However, O(n + n log rho) gives a better description of how it performs on partially sorted arrays.
All sorting algorithms can trivially be made to perform in O(n) in the "already sorted" scenario without worsening the worst case complexity, so that isn't really helpful.
It can be done only by adding an initial check that does only that. But this means that the algorithm speed doesn’t gradually increase with partially sorted sequences.
For instance, timsort is also very fast if only a single element is unsorted, or only two elements, or only three elements. These are not special cases explicitly handled, its just the natural way the algorithm works.
Why n/2? If the array is, for example, sorted in the reverse order, then there is no monotonous run at all, in which case I believe the algorithm considers each element from the array being a run in itself, giving n runs.
Runs can be increasing or decreasing, a reversed array is a single run. Worst case is n/2 because you can always split the array in pairs (if it alternates a high and a low value for example).
Exactly. Sometimes those are called "adaptive algorithms", in the sense that the complexity depends on some properties of the input, so even though the worst case complexity is still O(n log n), for many outputs it will do much better.
It shouldn't be called n because then `n log n` and `n log m` convey different meanings. In the first case, `n` is one and the same variable, where in the second, `n` and `m` are independent of each other.
You can call it `m` or `rho` or whatever, just use a different variable.
It sort of is though. You can have a really large N with rho = 1, just like you can have a really small N with rho=N. They're orthogonal variables, and they both impact the run time.
n varies with the input as well :) But yes, since ρ is bounded by n you can reduce the complexity to O(n log n) again, but i think the important part here is to distinguish the complexity against other sorting algorithms and Timsort improves things for specific inputs as they note.
I'd say it the other way around : 1/ it's amazing that a code that is used in countless occurences can still have a bug; 2/ (I've studied sorting algorithms for a while) finding such a bug is very clever... Kudo's to the authors.
What's amazing is how little understanding we seem to have over such critical widely-used code. Java opted for a complex and arguably unproven algorithm, so it was always a risk. We now have functional languages that are able to express provably correct sorting algorithms, so the bar is getting higher.