I never really understood why a standard library would insist on having a single "sort" function (or at best, a "sort" and "stable sort"). Why not just provide explicit quick sort, merge sort etc. functions and let users choose which they want? Sorting is one of the first things a budding computer scientist learns, and the research on sorting algorithms is well-trodden ground by now, so I don't see how it could be a usability issue. I'd like to be able to choose from these criteria without having to rely on external libraries:
- How does it perform on average?
- How does it perform on pathological inputs?
- Is it stable?
- Does it allocate?
Not to mention constraints that an application can take advantage of:
- What if the data is already mostly sorted? (insertion sort)
- What if writes are very expensive? (cycle sort)
- What if all the data are integers? (radix sort)
And so on. Trying to force a one-size-fits-all solution seems rather silly at this point.
I think only having "sort" and "stable sort" makes sense for the same reason there's only one hashmap or one vector in most standard libraries: 99% of the time you just want a high-quality implementation of one reasonable choice, without getting into the weeds of it.
Most of the time I don't care if it's timsort or quicksort, or if the hashmap uses Cuckoo hashing or buckets made of linked lists. Of course there are the 1% of cases where I notice that my sort or hashmap is a performance bottleneck and that a different algorithm might perform better. But standard libraries aren't good at serving that need because they have to be broadly applicable and conservative, if I already know that I want top performance for specific requirements I'm almost certainly better served by an external library.
There are such things as sensible defaults though.
Timsort being a somewhat famous example, which uses the fact that python datastructures are known length to figure out the best sorting method for the size.
If you need something more exotic then you likely know more about what you're doing, but most people writing software who need a sort function usually don't need more than timsort.
> uses the fact that python datastructures are known length to figure out the best sorting method for the size.
That... does not match with my knowledge of how timsort works. The sorting method is chosen based on the array length, and every sorting algorithm in every language knows that. It's extremely suboptimal for some types because Python's typing is so weak!
Timsort succeeds because it has many good heuristics to optimize common properties of its merge intervals.
You are thinking about the weeds of sorting now, but that is rarely relevant to the problem at hand.
You can (and most people do) look at it from the other angle. Sort(seq) is a function which returns a permutation of seq such that every element is less-than-or-equal-to the next element. Usually, we use this function because we need this property to hold in our following code. A lot of the time, this function is called for very short sequences, where even the most naive implementation would be fats enough for even the highest scale program. So, in the vast majority of code we only care about the semantics of std::sort; and all of the above have the same semantics.
In fact, the fact that we do often get a stable sort as well shows that semantics are the most important characteristic, as stable and unstable sorts do have slightly different semantics.
There are defaults for all sorts of functions. find() functions on an array in many languages use some default. Parameters often have defaults when not set. Strings often have a default encoding.
The vast majority of the time, the default works well enough. Why make things more complicated than they need to be?
It also has the added benefits of helping more junior engineers when reading the code. "Cycle sort? What is that doing?" "Insertion sort? I was taught that was inefficient! I should spend time updating this code to a better sorting algorithm! My tech lead would love my initiative."
Or we just use sort(), and we all move on with our lives.
while i agree with you to some extent, i don't think including a very large set of sorting functions in the standard library is a good idea. especially for dynamic languages, a larger standard library means more disk usage and worse cache utilization. it's also typically relatively easy to use your own sort function: in performance-critical cases, the sort function is usually called directly, not via intermediate libraries (which might need to be patched to call the new sort function); also, sort functions are self-contained, i.e. typically don't require system dependencies.
> A larger standard library means more disk usage and worse cache utilization
Does it? How? I suppose one has to store the standard library on the machine, which does notionally use up disk space, but I can’t see how it contributes by any reasonable performance-relevant meaning of ‘disk usage’ (presumably i.e. disk I/O during the runtime of the program). And for cache utilization I simply have no clue at all what you might be driving at.
I think this goes back to the whole idea of non-CS people being the primary users of standard libraries.
I agree though, there should be the ability to specify which one, and every function in the standard library should have big O notation in the function description. Still makes sense to have a general-purpose good one as the default.
- How does it perform on average?
- How does it perform on pathological inputs?
- Is it stable?
- Does it allocate?
Not to mention constraints that an application can take advantage of:
- What if the data is already mostly sorted? (insertion sort)
- What if writes are very expensive? (cycle sort)
- What if all the data are integers? (radix sort)
And so on. Trying to force a one-size-fits-all solution seems rather silly at this point.