Hacker News new | past | comments | ask | show | jobs | submit login
Changing std:sort at Google’s scale and beyond (danlark.org)
566 points by ashvardanian on April 20, 2022 | hide | past | favorite | 156 comments



Recent and related:

Go will use pdqsort in next release - https://news.ycombinator.com/item?id=31106157 - April 2022 (112 comments)


Hi all, I'm the author of pdqsort that's credited in the post (to be clear: the authors acknowledged pdqsort at the bottom, I am not associated with the posted work directly). I recently held a talk at my institute on efficient in-memory sorting and the ideas in pdqsort, in case you're interested in hearing some of the theory behind it all: https://www.youtube.com/watch?v=jz-PBiWwNjc

Next week I will hold another talk in the Dutch seminar on data systems design (https://dsdsd.da.cwi.nl/) on glidesort, a new stable sorting algorithm I've been working on. It is a combination of adaptive quicksort (like pdqsort, fully adaptive for many equal elements) and an adaptive mergesort (like timsort, fully adaptive for long pre-sorted runs). It is the first practical implementation of an algorithm I'm aware of that's fully adaptive for both. Like pdqsort it uses modern architecture aware branchless sorting, and it can use an arbitrary buffer size, becoming faster as you give it more memory (although if given a constant buffer size it will degrade to O(n (log n)^2) in theory, in practice for realistic workloads it's just a near-constant factor (c ~<= 3-5) slower).

The source code isn't publish-ready yet, I have to still do some extra correctness vetting and testing, and in particular exception-safety is still not yet fully implemented. This is important because I wrote it in Rust where we must always give back a valid initialized array, even if a comparison operator caused a panic. But I do have some performance numbers to quote, that won't significantly change.

For sorting 2^24 randomly shuffled distinct u32s using a buffer of 2^23 elements (n/2), glidesort beats Rust's stdlib slice::sort (which is a classic timsort also using a buffer of n/2) by a factor of 3.5 times. When stably sorting the same numbers comparing only their least significant 4 bits, it beats stdlib slice::sort by 12.5 times using 6.5 times fewer comparisons, both numbers on my Apple M1 Macbook. All of this is just using single-threaded code with a generic comparison function. No SIMD, no threads, no type-specific optimizations.

Finally, glidesort with a buffer size of >= n/2 is faster than pdqsort.


Any chance you could comment on fluxsort[0], another fast quicksort? It's stable and uses a buffer about the size of the original array, which sounds like it puts it in a similar category as glidesort. Benchmarks against pdqsort at the end of that README; I can verify that it's faster on random data by 30% or so, and the stable partitioning should mean it's at least as adaptive (but the current implementation uses an initial analysis pass followed by adaptive mergesort rather than optimistic insertion sort to deal with nearly-sorted data, which IMO is fragile). There's an in-place effort called crumsort[1] along similar lines, but it's not stable.

I've been doing a lot of work on sorting[2], in particular working to hybridize various approaches better. Very much looking forward to seeing how glidesort works.

[0] https://github.com/scandum/fluxsort

[1] https://github.com/scandum/crumsort

[2] https://mlochbaum.github.io/BQN/implementation/primitive/sor...


I have stolen a lot of ideas from scandum and extended his ideas in new ways. He is definitely a mad genius. Glidesort (on my M1 machine at least) matches fluxsort within a couple % for random data, but glidesort is robust in that it will always take advantage of pre-sorted runs and many equal elements (at least if it has buffer memory), no matter where they are in the array.

In particular, I was inspired by three things from scandum:

1. fluxsort's out-of-place stable partitioning. From this I got reminded that not only is out-of-place stable partitioning a thing, it's highly competitive. I've always had this as an idea in the back of my mind, but never went through with it because I kept getting discouraged by C++'s distinction of moving to uninitialized memory vs. moving into a moved-from value (which is why I implemented glidesort in Rust).

2. quadsort's "ping pong merge", which reduces unnecessary memcpys by merging both on the way out and on the way in the original array. I did have this idea before, but always dismissed it because I thought keeping track of what's where would be a massive pain. Simply waiting until there's 4 things to merge eliminates this problem and is just genius.

3. quadsort's "branchless parity merge", which merges from both ends of the array if the merge is perfectly balanced. I make no claim that I thought of this, it's just genius. I had two key takeaways from this: you can make some very fast small sorting algorithms with merges, and interleaving loops to reduce data dependencies are significantly faster.

So I combined #1 & #3 into what I call bidirectional stable partitioning, where I partition from both sides of the array into an out-of-place buffer through interleaved loops.

I extended the adaptiveness and applicability of #2 heavily by replacing the 'merge' operation in powersort (https://arxiv.org/abs/1805.04154) with a 'virtual merge' operation that delays merges until necessary. This is also what allows me to use quicksort in a bottom-up adaptive mergesort, because I don't eagerly sort small runs! Instead I simply keep unsorted runs around, 'merging' unsorted runs simply by concatenating them - purely in bookkeeping.

I heavily extended 3 for the mergesort part by realizing we don't need perfectly balanced merges, we can just take the `min` of the two runs and start off with a merge from both sides, and then look further. I also did more interleaving by doing a binary search to compute independent parallel merges where sensible, and interleaving those loops.

As a quick preview, here is a visualization of glidesort using a buffer size of n/2, where I have artificially limited the concatenation of unsorted runs to n/8 so that it won't just look only like quicksort, and both the quicksort and mergesort aspects are shown: https://cdn.discordapp.com/attachments/273539705595756544/96...


Thanks for the discussion! Can't say I follow everything, but using parity merge for part of an unbalanced merge makes a lot of sense and that alone is worth it.

Stepped through the video a few times at 1/4 speed. The n/8 thing is a bit confusing, first because I didn't read it and second because it makes it hard to tell a partition result from the beginning of the next segment. I think I can follow what's going on, but I don't get the purpose of the bidirectional partition. It doesn't use less memory, does it? So is there something to do with fitting in with mergesort better? I'm not familiar with powersort; I'll read up on it.


> but I don't get the purpose of the bidirectional partition. It doesn't use less memory, does it? So is there something to do with fitting in with mergesort better

Nope, it does the same amount of comparisons, same number of operations, same memory, etc. What it does do is it allows you to interleave two independent loops, which is also what makes the parity merge fast (I think scandum misidentifies the loop unrolling for this effect for large arrays - you can loop unroll merging large arrays either way - for small constant-size merging it is important however).

A modern CPU has a very long pipeline, and even though we like to pretend all instructions are one cycle with no latency, in reality there are real latencies to instructions and memory accesses, and multiple instructions can be executed in the same cycle. Since each iteration of the partition depends on the previous one (which pointer did we increment? we have to know before we can execute the next store), you can hide these latencies better if you interleave two independent loops. In addition you can use more instruction-level parallelism.


Got it. I'd considered that but something about your explanation threw me off. A subtlety I missed was that you can't just do two forward partitions, because the second one begins exactly halfway through the array—one of the results would be placed there but it's probably not the correct start location for the larger half-partition.


Exactly, which is why I call it bidirectional partitioning: one forward, one backward. It's a very strange case where you can use parallelism (in this case instruction-level parallelism), but only get two independent instances without the ability to recurse further.

You can of course make partitioning embarrassingly parallel, look at IPS4o for that. But it is vastly more complicated, and involves overhead shuffling blocks after the partition.


I got that impression from your linked video -- the algorithm looks cache-friendly and pipeline friendly.


FWIW, I have found scandum's note on pdqsort in HN https://news.ycombinator.com/threads?id=scandum


Random question, but is this from the Google bare metal team? I remember interviewing with them and the work they did was fascinating.


Hi, the author is here

I am working in MapReduce/Data-pipelining/Dataproc efficiency internally and externally. In cloud you can check out Google Cloud Dataflow

We are working closely with general efficiency and bare metal teams, yes :). They definitely do a fascinating work. This one was my 20% project together with the google efficiency team


I'm not sure if you're asking me personally or about the original post, but to be clear: I'm not (currently) associated with Google, nor is my work on pdqsort or glidesort. I'm also not associated with the original post, other than that my work on pdqsort was credited at the bottom of the blog post, which is what my opening sentence was referring to.

I edited my comment to reflect this, I can see how it could've been misinterpreted. It was not my intention to deceive.


Is the buffer here extra memory in addition to the input? What about just an O(log N) size buffer?


> Is the buffer here extra memory in addition to the input?

Yes.

> What about just an O(log N) size buffer?

That would effectively just be a constant sized buffer. Honestly O(log N) factors where N is the size of the input are overrated and are probably best seen as a constant factor. Like, even if you're sorting 16 gigabytes of 32-bit floats, log2 N is still just 32.


Thanks. As a follow up, isn't regular quicksort just using O(log N) memory - for the recursion stack? Assuming random pivot and considering average performance. Seems like O(log N) is worth comparing to...?


Glidesort also - from an academic standpoint - always uses O(log N) memory, log2(N)*4 words to be precise.

But on 64-bit machines log2(N) is always <= 64, so I just allocate a 64-element array of 4 word structs on the stack.

I think it's just much more productive and predictive to say that "glidesort has performance X when using a 512-element buffer" than "glidesort has performance X when using a c log (n) buffer".


Will you take on radix sorting algorithms next? :)


> https://www.youtube.com/watch?v=jz-PBiWwNjc

FWIW, the 360p mp4 (youtube-dl -f18) encoding of this seems to be truncated ("Got server HTTP error: Downloaded 989757 bytes, expected 61794145 bytes."), so you might want to try and get youtube to reencode it. (-f22/720p works fine, so it's probably a transient error rather than a problem with your original upload.)


Fun safety problems.

Towards the end of this post, the author mentions a bunch of places where std::sort() blows up if your data or your sort rule doesn't exhibit strict weak ordering. In some cases you get a bad sort, which feels fair enough since you violated the rules the sort algorithm is relying on, but in others it literally has Undefined Behaviour in C++.

Notably Rust's sort() and unstable_sort() do not have this problem, because of course these are safe Rust, and safe Rust doesn't have Undefined Behaviour. But why can this work?

First up, one bullet which gets dodged is that Rust's floating point types (f32 and f64) are not Ord. Rust acknowledges that NaN != NaN and therefore we can't expect to go around sorting a slice of floating point values. In C++ that's on the programmer, in C++ the fact you can write a < b means that type counts as ordered, even though it trivially isn't for several built-in types. So if you simply naively try to sort an [f32] Rust says you can't because it isn't Ord, you will need to write a lambda or whatever to Order these f32s and then if some of them are NaN your lambda is responsible for deciding what to do about that.

But even if we insist on getting in harm's way, safe Rust promises to protect us anyway. For example if I make an array of a thousand misfortunate::OnewayGreater<u8>s they're only one byte each, yet they all insist they're Ordered greater than each other and even than themselves, Rust's sort() doesn't cause Undefined Behaviour on this input. It may never succeed, since all the elements to be "sorted" insist they're greater than each other, but it definitely won't segfault, trash unrelated memory or cause chaos.


> But even if we insist on getting in harm's way, safe Rust promises to protect us anyway. For example if I make an array of a thousand misfortunate::OnewayGreater<u8>s they're only one byte each, yet they all insist they're Ordered greater than each other and even than themselves, Rust's sort() doesn't cause Undefined Behaviour on this input. It may never succeed, since all the elements to be "sorted" insist they're greater than each other, but it definitely won't segfault, trash unrelated memory or cause chaos.

This argument feels like a difference in semantics, rather than any actual added safety.

Rust will happily let you run the sort with an unstable comparator, which then leads to undefined behavior in the sense that the final sorting order, or whether the sort completes at all, is not defined by the standard and is instead dependent on the internal implementation. The fact that this is not called "undefined behavior" in Rust is a difference in semantics only.

Added safety would be if Rust could detect the fact that the comparator is unstable and tell the user so they could fix the bug in their comparator.

What is perhaps more relevant here than any safety provided by the Rust language is that there is only one Rust compiler. Relying on internal implementation details of that compiler is much more accepted, and the distinction between "internal compiler implementation detail" and "behavior defined by the standard" is a lot less relevant than it is in C or C++, which have many different implementations.


"Undefined behaviour" has a very specific meaning (perhaps ironically) in programming language design. It means anything could happen, including crashing during the undefined operation (i.e. crashing during the sort() in this case), appearing to work but then crashing afterwards, or even crashing before the undefined operation starts! (Or doing other weird things, not just crashing - but a time travelling crash is already plenty weird enough for this discussion.)

If Rust is able to define its sort() as only ever affecting the array passed to it, with the possibility of looping forever, then that is very well defined behaviour in comparison.


Yup. If the only potential variation is on the resulting ordering, C/C++ would call this implementation-defined or unspecified than undefined.

One example of unspecified behaviour is the order in which function call parameters are evaluated (meaning any order is OK, and it could vary from call to call).

One example of implementation-defined behaviour is the value of sizeof(int).


> Rust will happily let you run the sort with an unstable comparator, which then leads to undefined behavior in the sense that the final sorting order, or whether the sort completes at all, is not defined by the standard and is instead dependent on the internal implementation. The fact that this is not called "undefined behavior" in Rust is a difference in semantics only.

You can't avoid allowing this in a general purpose sort. If Rust wanted to insist that the comparator must be "stable" in this sense, it would have to do that by defining an unsafe trait (maybe "ReallyReallyOrdered") implementations of which promise they're correct (hence the "unsafe" requirement, a safe trait like Ord can't promise anything the compiler can't prove).

There's no difference in semantics at work. In C++ sort() with a bad comparator has Undefined Behaviour and so might do absolutely anything, in Rust its behaviour is defined in a way you simply don't like.

> Added safety would be if Rust could detect the fact that the comparator is unstable and tell the user so they could fix the bug in their comparator.

This could only be a best-efforts check, since if you were to try a.cmp(b) a million times, nothing stops a's comparator from counting and giving a different answer on the next try after your millionth check.

I always meant to get around to writing some sort of misfortunate::RandomOrder (but with a cleverer name, maybe Tombola or ManekiNeko ?) which would exhibit completely random ordering, again this isn't methodically detectable, a detector can only have probabilistic success.

Unlike C++, all Rust's built-in types have proper behaviour here, so if you run into trouble either you're using a third party type, or a third party comparator for the sort itself, and since there is never Undefined Behaviour you can track down the problem.


> or cause chaos

Running forever is pretty chaotic.


Running forever is pretty stable. Halting is chaotic.


A trivial infinite loop is Undefined Behaviour in C++ but it is a completely reasonable thing to write in Rust and works fine, in fact Rust's underlying loop is an infinite loop unless you explicitly provide some way to break out of it, which you usually will.


> LLVM history: Back then we recognized some really interesting benchmarks and we didn’t recall anybody trying to really benchmark sorts on different data patterns for standard libraries.

Timsort https://en.wikipedia.org/wiki/Timsort :

> In the worst case, Timsort takes O(n log n) comparisons to sort an array of n elements. In the best case, which occurs when the input is already sorted, it runs in linear time, meaning that it is an adaptive sorting algorithm. [3]

> It is advantageous over Quicksort for sorting object references or pointers because these require expensive memory indirection to access data and perform comparisons and Quicksort's cache coherence benefits are greatly reduced. [...]

> Timsort has been Python's standard sorting algorithm since version 2.3 [~2002]. It is also used to sort arrays of non-primitive type in Java SE 7,[4] on the Android platform,[5] in GNU Octave,[6] on V8,[7] Swift,[8] and Rust.[9]

Sorting algorithm > Comparison of algorithms https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_o...

Schwartzian transform https://en.wikipedia.org/wiki/Schwartzian_transform :

> In Python 2.4 and above, both the sorted() function and the in-place list.sort() method take a key= parameter that allows the user to provide a "key function" (like foo in the examples above). In Python 3 and above, use of the key function is the only way to specify a custom sort order (the previously supported cmp= parameter that allowed the user to provide a "comparison function" was removed). Before Python 2.4, developers would use the lisp-originated decorate–sort–undecorate (DSU) idiom,[2] usually by wrapping the objects in a (sortkey, object) tuple

Big-O Cheatsheet https://www.bigocheatsheet.com/

Quicksort in Python 7 ways (and many other languages) on RosettaCode: https://rosettacode.org/wiki/Sorting_algorithms/Quicksort#Py...


Thanks, I'll remove the note about "recalling"


xeus-cling didn't exist back then: https://github.com/jupyter-xeus/xeus-cling#a-c-notebook

https://github.com/fffaraz/awesome-cpp#debug

A standard way to benchmark and chart {sorting algorithms, web framework benchmarks,} would be great.

The TechEmpower "framework overhead" benchmarks might have at least average case sorting in there somewhere: https://www.techempower.com/benchmarks/#section=data-r20&hw=...


awesome-algorithms https://github.com/tayllan/awesome-algorithms#github-librari...

awesome-theoretical-computer-science > Machine Learning Theory, Physics; Grover's; and surely something is faster than Timsort: https://github.com/mostafatouny/awesome-theoretical-computer...


Christ, the description on that second repo is so comically terribly spelled that I can’t figure out whether it’s a joke:

> The interdicplinary of Mathematics and Computer Science, Distinguisehed by its emphasis on mathemtical technique and rigour.

(Perhaps a witty satire on the reputation of mathmos for being godawful at anything literary...)


IIRC Rust and Swift have a modified timsort implementation to remove the "galloping" strategy.


Rust has "an adaptive, iterative merge sort inspired by timsort" to implement the stable sort() method if you have an allocator, and a pattern-defeating quicksort to implement unstable_sort() which is provided even if you don't have an allocator (no_std embedded environments).


I don't understand why you would need Timsort. If the pivot element is chosen randomly, Quicksort is always (or with probability bordering on certainty) O(n log n).

Update: I confused Timsort with IntroSort. I should have googled before posting ...


> I don't understand why you would need Timsort. If the pivot element is chosen randomly, Quicksort is always (or with probability bordering on certainty) O(n log n).

For starters, it's a stable sort, which excludes quicksort right out the gates.

Second, the primary hypothesis of timsort is that real-world data is non-random, timsort is biased for partially or almost-sorted sequences, which it can sort in as low as O(n). That would be the point westurner was making, timsort's creation is very much predicated on "really benchmark sorts on different data patterns".

Third, for very short sequences (also a common situation) it's an insertion sort, which is extremely efficient due to the low overhead (even more so combined with the adaptivity which insertion sort also has).


> For starters, it's a stable sort, which excludes quicksort right out the gates.

Negative, it excludes in-place quicksort but out-of-place quicksort can be stable just fine.

> Third, for very short sequences (also a common situation) it's an insertion sort, which is extremely efficient due to the low overhead (even more so combined with the adaptivity which insertion sort also has).

Every modern quicksort implementation also does this (or a similarly efficient smallsort).


Hi, you are ignoring the second point, and for the other two points you are referring to particular extensions of quicksort. Quite possibly you can extend quicksort similar to how timsort extended mergesort, but if that was commonplace nobody would care about timsort.


> and for the other two points you are referring to particular extensions of quicksort

Maybe for the third point, but once again, every implementation does this. For the first point, absolutely not. Just because in-place partitioning is often the only variant described, out-of-place partitioning is absolutely just quicksort, not an extension.

> Quite possibly you can extend quicksort similar to how timsort extended mergesort

Please see my top comment in this thread - I have done exactly that. Or rather, I have extended powersort (which is very similar to timsort) with quicksort.


I ported Timsort to the Windows kernel back in college. Good times!


The "use ASLR for random seed" trick in there is pretty cute, but I noticed it will just silently not work in non-ASLR settings, and in particular under Wasm.


Recent versions of Lua use this trick as well.


Is there any other way to generate entropy without an OS dependency?


timestamp counter?


Only works on x86.


As far as entropy generation goes other architectures have equivalents, albeit a few more instructions than just rdtsc


Not the end of the world though surely.


OP describes systems which seed tests with non-determinism to catch users who write tests relying on undefined behavior, such as order of equal elements. Writing tests against these systems is a game changer (once you figure out what’s going on and why there are flakes). I’m a huge fan of this subtle change which has spared me at least one bug. It also allows library authors to gain the advantage of not accidentally producing a behavior that induces Hyrum’s law. Kudos to the team for making this work at scale.


Depending on ASLR for randomization is ironically an extreme case of Hyrums law.


> If you compare by a boolean variable, for example, partition data by existence of something, it’s very tempting to write std::sort call.

> We saw a pattern of sort+resize a lot.

These seem like good opportunities for clang-tidy checks.


offtopic: Such a great engineer and yet their blog says: Website Powered by WordPress.com.

I find the pragmatism worthy of praise. I've been meaning to start blogging since forever but I keep planning and researching about blog platforms/tools and never get to the point.


Making use of existing tools and being able to manage their shortcomings is an important engineering skill.


WP is a never-ending source of security vulnerabilities. 30% of the internet runs on it.

I found Jekyll and never looked back.


That’s why you use managed Wordpress, aka wordpress.com.


The paradox of choice. I suffer from this way too often.

https://en.wikipedia.org/wiki/The_Paradox_of_Choice


[flagged]


In case you genuinely missed the point: he was saying that the person is a great engineer for not wasting their time optimising irrelevant details like their blog. Time allocation is an optimisation problem like any other.


Did one person design the SR-71...?


yes


He was the head of the engineering team. He didn't go into a cave and come out with completed designs...


Who gets credit for a really good movie?


…the people who made it? I know none of us sit through the credits, but I can assure you they do typically carry on rolling for quite some time.


Great fun read. I love this stuff; when I first learned C++ many years ago I didn't really know much about complexity and what I have learned over the years is that complexity costs depend a lot on the underlying data distributions.


I'm confused, so the comparison function can modify the list? I thought most sorting algorithms assumed the contents of the list was constant. Why would I want my comparator to ever modify the list under sort?


If you examine the modification more closely, it is just finding the list that would exhibit worst case performance - once created, that list would always exhibit the worst case as a constant list.

Once you have the resulting list, the indices are sorted, then assigning each num_solid value to that index would result in the list being constant, and then could be sorted to give such worst case behavior.

I wish the article was more clear on this.


You don't, it's just not disallowed :)


"because it prevents the user from writing a comparison function that modifies the list while sorting" is not a reason I would expect to prefer rust, and yet here we are.


It's possible in Rust too, but only for a subset of element types: those that support shared mutation (also known as "internal mutation"). For example, a type containing an `AtomicU32` or a `Mutex<T>` or a `Cell<T>` could have an `Ord::cmp` implementation that alters its value, and the compiler will not prevent this.


Yes, I had a nasty bug in an initial version of glidesort where I (naively) assumed that I could create a temporary copy of an element by doing a memcpy and call the comparison function on this temporary copy a bunch of times before throwing it away, since I assumed the comparison function would not mutate the copy. Unfortunately interior mutability means this would lead to potential double drops, and it was just not sound.


The comparison operator is not actually mutating the list, which would not be allowed in C++ either. Instead it is dynamically generating the < comparison operator for the elements in the list based on the order for which elements it is evaluated. The order for two elements remains fixed once examined and in the end it always ends up with a valid totally ordered comparison operator. The only thing required for this to work is that the comparison operator can have mutable state and that the same shared state is used for all comparisons in the sorting algorithm (asided from thread-safety so you'd need adjustments for a parallel sort).


Technically the list is not being modified but the comparison operator is dynamically determining the order of elements when the sorting algorithm checks them. This is done in a way so that the order of pairs that have already been checked previously remains fixed and it results in a valid totally ordered comparison operator in the end - but the order it ends up with depends on the sorting algorithm.


An "accessed at" field is a trivial use case.


Thanks, I'm going to write all this out on a whiteboard the next time someone asks me to sort a linked list in a technical interview.

Edit: just wanted to clarify that the point I'm trying to make here is that if you want to ask someone if they've memorized the answer to that question, go ahead and ask it in a technical interview, because the real well-considered answer looks like the linked article, and my honest non-smartass answer is that one should use the library function and not try to implement it themselves until they've done sufficient research into the problem to prove to a reasonable degree that they indeed do need to re-implement sort() and that it's in the best interests of the business to fund such a project.

Would love to read this over the phone to a recruiter and have them tell me that it's not what they have on their sheet.


If you need to write a sort from scratch it's because you're in some weird restricted environment without a standard library. Not because you need to get a few percent better performance sorting some gargantuan data set.

Being able to put together a sort from compares and swaps (or inserts, or whatever you have available) is at least evidence that you won't either panic or storm off when the tools you have are something less than ideal.


I'd wager that 99% of developers are not without a stdlib and yet it seems they get ~100% of the crazy algo interviews.

> [...] is at least evidence that you won't either panic or storm off when the tools you have are something less than ideal.

Maybe. Maybe not? Most of the time our tools are far from ideal anyway. Most of the time the important factors are a direct consequence of management/leadership/team. (Deadlines, feature creep, crazy engineering practices - eg. hand rolling sort, crazy interview practices - 8 hours of hardcore interview or whatnot, and so on.)


In my job:

* I always have access to a decent stdlib and would not need to implement sort itself

* But I do need to design and implement bespoke algorithms

The algorithms are nothing groundbreaking - they're things like how to grow and shrink a buffer when reading from a socket. But I'd struggle if I didn't know a few examples of different algorithms and how to think about their trade offs, and basic sorting algorithms (not like this article!) are ideal for that. Maybe I'm doing unusual work but it doesn't really feel like it.

Edit: To put it another way: in a past job, we had some unfortunate examples of people being hired whose ability to work with algorithms was sadly not up to scratch, and it had a serious practical impact on their ability to do their job. If they had been asked a few questions about sorting algorithms in their interview then that could have been avoided.


Then would a better interview technique be something like - here is a spec for an algorithm. Now implement it. I see very little value in having people studying algorithm implementations for interviews.

Of course, knowing about specific algorithms is handy, and adjusting the question to find an appropriate spec to implement for a given problem could be a nice touch.

I think that either approach above is strictly better than implementing algorithms from memory and would produce a far better signal.


I think algorithmization is a must for programming. (Like you said, transforming/filtering data, results, structures, etc.) No questions about that. But grilling candidates for hours about it seems "a bit" excessive. (Which is again qualitatively different from asking someone to write a few lines of pseudocode/python/whatever-they-want to map/reduce something or ask them how they would implement a calculator, a few pointed questions about knowing the basics, etc.)


Also in in such an environment (real world not interview) you almost always have constraints on your data and/or requirements that allow you to write something faster or smaller (depending on your needs) that can beat the standard implementation because it doesn’t have to handle corner cases (or as this post shows, some major cases as well).


I once had occasion to implement a sort algorithm in anger, on a platform _with_ a standard library no less!

The project was written in SmallTalk, running on a very old implementation (last updated in 1999 I think) and we had a problem where clicking on column headers in the UI to sort would be very slow if most of the values were equal. The problem was that the standard library array sort was quicksort (which also meant that sorting on several columns broke since quicksort isn't stable), but since SmallTalk makes these things possible when you need to replacing the stdlib quicksort with mergesort was actually a fairly simple operation.


"You have a weird bespoke data structure" is an example of an "environment" where you might not have a sort that works for your data in your standard library. Of course, if you're in that situation, you're not using a weird bespoke data structure for a mere few percent better performance.


I'm starting to question my understanding of algorithmic complexity, the author says things like "complexity being worst case O(n \log n)" and seems to talk about big O notation used for average cases as well.

I've learned that big O notation is always and exclusively about the worst case, for average and best cases we use Theta Θ and Omega Ω, respectively.

Do people not follow these definitions strictly? Has the author committed a mistake or am I missing something?

I did experience an interviewer once saying that my complexity analysis was wrong as the worst case would be extremely rare to hit, I did try to explain nicely that big O notation is always about the worst case, and that they were thinking about big Theta instead, now I wonder if I was wrong.


O, Theta, and Omega are independent from best, worst, and average case. You can mathematically establish upper, lower, or exact bounds on any of the average, best, or worst cases.

It is perfectly valid to say "the best case is O(n)" which means the best case scales no worse than linearly. The "no worse" here is not describing the best case, but rather the strictness of your bound. You could say a sort algorithm is O(n!) since, yes, it does scale no worse than n! but it's not particularly helpful information.

Big O notation is used imprecisely frequently (see also people saying a dataset is O(millions) to describe scale).


Big O, Theta, and Omega notations all ultimately denote sets of functions. So e.g. O(n) is the set of all functions which are eventually overtaken by some linear function.

When we say "X is O(f(n))" what we really mean is "the function that characterizes X is contained within the set of functions denoted by O(f(n))." Anything that can be described by a function hence can be given Big O, Theta, or Omega notation. So yes worst case time complexity could be described in this notation, but so could average time complexity.

But we could also describe things that have nothing to do with time or space complexity with big O notation as well. As long as we've turned it into a function, we can describe it with big O notation.


I think you are mistaken here. The concept of worst/best/amortized case is different from O/Theta/Omega. The latter is about the direction with which it is bounding.

Big O/Theta/Omega denotes sets of functions, and each of them can be applied to best/worst/amortized complexity analysis.


I like to use the metaphor of driving somewhere, and trying to estimate when you're going to get there. Maybe you have a hotel reservation, and you want to make sure you don't get there too early nor too late. Then (with some times filled in for example):

  +---------------------------------------+-------------------------------+--------------------------------+----------------------------------------+
  |                   .                   | Best Case (little/no traffic) | Average Case (average traffic) | Worst Case (car accident, snow-in etc) |
  +---------------------------------------+-------------------------------+--------------------------------+----------------------------------------+
  | Max time, upper bound (Big O)         | 30 minutes                    | 45 minutes                     | 1 hour 20 minutes                      |
  | Both min and max, tight bound (Theta) | 25 minutes                    | 35 minutes                     | 1 hour                                 |
  | Minimum time, lower bound (Omega)     | 15 minutes                    | 25 minutes                     | 50 minutes                             |
  +---------------------------------------+-------------------------------+--------------------------------+----------------------------------------+
So then you can start asking questions like - if I assume I'm in the average case, and I can show up earliest at 20 minutes from now, and show up latest at 50 minutes from now, can I guarantee that I will make it within that window? The answer in this case is yes, because our range is 25-45 minutes, so the earliest I can show up is in 25 minutes and the latest in 45 minutes.

This also helps explain why Big O is generally more useful - we can almost always be early to something, but we can almost never be late to something. So showing up 5 minutes early usually is not a problem.

It also shows why Theta is better than both - in this case, if we had this table we could completely ignore both the Big O row and Omega row, because Theta is a better bound for both min and max time.

Of course, in algorithmic complexity we replace the times with functions in terms of some variables, usually just n.


Your example is nice, but extremely informal, as these are all constant numbers, while O/Theta/Omega (and o, omega, theta) are about functions.

I think the most important thing that needs to be clarified though is that algorithmic complexity is first and foremost a function, `WorseCaseComplexity(size of input) = max number of steps for any input of that size`. This function is often then classed into a "complexity class" using O/Theta/Omega etc, but that is only done for the sake of easier comparison of complexities between algorithms.


>It also shows why Theta is better than both

Big Theta for some functions is an empty set. Take for example bubble sort. It's O(n^2) and Ω(n), but there is no function that asymptotically bounds it above and below. In fact it is better if there isn't one as it means there exists fast cases. For bubble sort it's fast cases are ones where data is almost sorted and requires to scans of the array. If you know this is true of the input data you know bubble sort is more appropriate than something like merge sort which is Ω(nlog(n)). Now the O(n^2) worst case can still be significant so you may want to consider an alternate algorithm that doesn't break down as bad.

As the other reply pointed out we are talking about functions that bound some other function (in this context the amount of steps / operations / time an algorithm will take) when looking at how it behaves asymptotically.


Note that "the amount of steps / operations / time an algorithm will take" is not in general a mathematical function, as for the same input size, different inputs will give different numbers of steps. For bubble sort for example there doesn't exist a function f(n) = number of steps bubble sort will take for an input of size n. There does exist a function best_case_complexity(n) = number of steps bubble sort will take for the best input of size n; and a function worst_case_complexity(n) = number of steps bubble sort will take for the worst input of size n.

Then, we can say best_case_complexity(n) = Theta(n), and worst_case_complexity(n) = Theta(n^2). We can also define the set of all functions representing the number of steps bubble sort will take for input of a certain "shape" of size n. We can then indeed say that any function in this set will be <= worst_case_complexity(n) and >= best_case_complexity(n), so this set will indeed be a subset of Omega(n) and of O(n^2).


But I have the same table, but instead of time, I have the turns need to be taken by the car. That's not always useful.


In school, we used big O notation for all cases. I've never heard about theta or omega notation.

Wikipedia's bubble sort page -- as an example -- also uses it for all cases.

https://en.wikipedia.org/wiki/Bubble_sort



You are mixing two things together. The "average/expected case" and the "worst case" are two different functions, each with their own O, Θ, Ω.


Theta(X) is not "average case" behaviour, it means "both Omega(X) and O(X)".

Ex: merge sort is Theta(n log(n)), but quick sort is not.

The author is likely saying "worst case O(n log(n))" as redundancy to emphasize that this is worst case behaviour.


It's not redundant. You can have "average case O(n log(n))" but "worse case O(n^2)", like quicksort does. Depending on the pivot function, that "worse case" can be more or less common.


It is not redundant to say "average case O(n log(n))". It's not formally a correct usage of O(). But colloquially, it means that the average number of operations over all possible inputs on length n is X.

It is redundant to say "worst case O(n^2)". By definition, O(X) means that the maximum of the number of operations over all possible inputs of length n is at most X.


You are incorrect about this. Worst case and Big O are independent of one another. The O, Ω, Θ, etc... are about describing bounds where O(f) is used for the set of asymptotic upper bounds on f, Ω(f) is the set of asymptotic lower bounds on f, and Θ(f) is the intersection of O(f) and Ω(f).

Typically f is used to describe space or time complexity of an algorithm, but it can be used for any function that maps positive integers to the non-negative real numbers.

You can make any combination of best case/worst case/average case with O, Ω, Θ. In fact, you can even consider other cases as well such as amortized cases, 90th percentile cases, any case you want to investigate that can be defined as a function from positive integers to non-negative real numbers can be classified as having an asymptotic upper bound, lower bound, or tight bound.


Yeah, okay. You are correct here.

"Average-case of algorithm X" can be a function function (average number of operations for all possible inputs of length n). And asymptotic bounds can be applied to arbitrary functions from R -> R.


> It is not redundant to say "average case O(n log(n))". It's not formally a correct usage of O().

You are wrong. It is perfectly formally correct to say "average case complexity is 2n log(100n) + 78n + 90, which is O(n log (n))".

> By definition, O(X) means that the maximum of the number of operations over all possible inputs of length n is at most X.

No, that is, in fact, a colloquial use of O(X). O(X), formally, is simply the set of all functions that are bigger than X up to some constant factor. It has nothing to do with number of operations in itself.

Instead, we formally define complexity functions for an algorithm, which are the minimum/average/maximum number of operations the algorithm will execute for any input of size N; call these functions Cmin(n), Cavg(n), Cmax(n). Then, we can say that C_(n) is in O(X(n)) (or, as a shorthand, C_(n) = O(X(n))) if there exists a number k such that C_(n) < k * X(n) for any n > n0.

So, we can say that the best case complexity of an algorithm, Cmin(n), is O(n^2); this means that even for the best possible input of length n, it will execute at least k * n^2 steps, for some k. Or, we can say that the worse case complexity of some algorithm, Cmax(n), is Omega(n^2); that is, that even in the worse case input of length n, it will execute less than k * n^2 steps, for some k.

In fact, the concept "number of steps that an algorithm will execute for any input of size n" is not a mathematical function at all (unlike min/max/avg number of steps), since there isn't, in general, a unique such number for any given size.

So, I would argue that saying "complexity O(n^2)" is in fact ill-defined formally, and that, if you want to be formal, you must specify "worst case complexity O(n^2)".


I think you're right on the formalism, and I shouldn't have brought that in. I've been out of grad school a while and should clearly brush up on my theory.

My main line of reasoning is:

1. If I say "this algorithm time-complexity is O(X)", I mean that all possible cases are in O(X).

2. If all possible cases are in O(X), then the worst case is in O(X).

3. If the worst case is in O(X), then any other case will also be in O(X). This follows from our definition of "worst case".

4. Therefore, "all possible cases are in O(X)" <=> "worst case is in O(X)"

5. Therefore, "this algorithm time-complexity is O(X)" <=> "this algorithm time-complexity is worst-case O(X)"

If your thought is that 1) isn't a formal definition, fair enough. It may have been a convention we adopted within the department. But aside from that, I think the reasoning is solid.


Would argue it isn't perfectly formally correct, because "average case complexity" isn't well defined. Is average here mean, median or mode? It is typically used as if it meant, 90th percentile, or all but a few degenerate cases.

But that is a bit nit picky.


I'm not sure I've ever seen anyone argue over which measure of central tendency to use to assess the "average case". You certainly could, but it doesn't strike me as a particularly productive argument; in common parlance, an average is essentially always either an arithmetic mean or a non-rigorous informal typical case unless otherwise specified; it also normally doesn't matter very much since most algorithms of interest do have their complexity distributed across inputs s.t. as input size goes to infinity one regime describes the behavior on almost all inputs. It's surprisingly hard to even come up with one without just having having an approximately constant fraction of inputs just be trivial cases.

The much more important problem with "average case" complexity, particularly infamously for sorts, is just that practical "typical" inputs aren't distributed evenly over the space of possible inputs; pathological inputs have, after all, practically by definition, a nontrivial almost–computable property that informally gives them more of a reason to exist than than the average input.


The claim was "perfectly formally correct". If the average is arithmetic mean of all possible inputs, then average case O is worst case O because the worst case term dominates. Therefore they are using an informal definition of average, which doesn't jive with formally correct.


> then average case O is worst case O because the worst case term dominates

No, you're overgeneralizing. It's true that the worst case term dominates if e.g. a constant fraction of inputs exhibit the worst-case behavior, so that average complexity is lower bounded by a constant fraction of the worst-case behavior.

This isn't true for the kind of algorithms where it's customary to talk about average case behavior. We only normally talk about average cases when as input size increases, the fraction of inputs with the higher complexity bound shrinks faster. For example, we can imagine an algorithm which, on inputs of size n, requires n^2 operations on 1/n of possible inputs, and n operations on the other (n-1)/n inputs. Then the average runtime is, formally, (n^2)*(1/n)+n*(n-1)/n=2n-1, which is still O(n), even though the worst-case runtime is still O(n^2).

Where this goes wrong is, e.g. perhaps the O(n^2) runtime occurs on "trivial" inputs which are disproportionately likely to appear. For example, if we the above hypothetical function is idempotent and runs in O(n^2) on all of its possible outputs, but O(n) on all other inputs, it'd be easy to accidentally quadratic by calling it on its result repeatedly under the assumption that it's safe to do so.*


There are variations, that's what we learned back at Uni. Expected O(f(n)), average O(f(n)), amortised O(f(n)), even expected amortised O(f(n)). "Expected" implied a randomised algorithm.


> "Expected" implied a randomised algorithm.

Not necessarily. It can be the "expected" number of steps over the range of unknown inputs even for a deterministic algorithm.


The other replies are great, but just to put it concisely: big O is ‘worst case’ in terms of input size, while the ‘best case big O’ is ‘best case’ in terms of the input contents (‘shape’).


Worst case, median case, and best case complexity have nothing to do with O/Θ/Ω.

When analyzing an algorithm, you want to compute the number of steps it has to execute in certain cases as a function of its input size(s). That function will be something like 2n^2 + 67n + 189 steps. Since this is too much information, you can then express it instead as O(n^2) (or O(n^100)) or Ω(n^2) (or Ω(25n) ) or Θ(n^2).

You can then also compute the best case running time of the algorithm as a function of the size of the input, and get 2n^2 + 67n. Then you can say that the best case running time is also O(n^2).

Finally, you can (sometimes) define an "average" running time.

Now, what do we mean by "shortest running time"? Is that not always for n=0 or something? No - the best case running time has to do with the running time in a case where the input has a certain shape. For some sorting algorithms for example, if the input is already sorted, they will only go through the list once and do nothing, so they have a best case running time that is O(n). Other sorting algorithms will still perform steps even if the input is already sorted, so even their best case running time may be O(n log(n)).

Now, you can say that, given any possible input, the running time of the algorithm is O(worse case running time) and Ω(best case running time). You definitely CAN'T say in general that it is Θ(average case running time) though, as in will sometimes run in time that is shorter than "average case running time" (for best case scenarios); or sometimes it will run in time that is longer than "average case running time" (for worse case scenarios).

Basically, the worse/best/average case running times are an analysis of how properties other than the size of the input affect its running time.

Big-O notation and its friends have nothing to do with complexity analysis. They are a general notation for the asymptotic behavior of mathematical functions. Big-O is usually understood as an upper bound on a function, big-Ω is a lower bound, and Θ is a tight bound (the function f is by definition always within a factor of k above or below Θ(f)).

Edit: let's take an example. Say we have the following algorithm:

  isAllOdd(array x):
    if x is empty:
      return true
    if x[0] is even:
      return false
    return isAllOdd(x[1:])
In the best case, for any size array with the first element being even, this function finishes after executing 4 instructions: 2 comparisons, an array access and then a return. So it's best case complexity is best(n) = 4, which is O(1), Θ(1), Ω(1).

In the worse case, all elements of the array are odd, so we have to go through all n elements to find that out. It will have to execute 6 instructions for each element in the array (2 comparisons, 2 array accesses, a call and a return), except for the last element where it will execute only 2 instructions (check size then return). So its worst case complexity function is worst(n) = 6(n-1) + 2, which is O(n), Θ(n), Ω(n).

In the average case, for every call we are flipping a coin, and we have n total flips. Since we can assume the coin is fair, we will expect to see an even number after log(n/2) calls of isAllOdd. So, our average case complexity is average(n) = 6*log((n-1)/2) + 2 steps, which is O(log n), Θ(log n), Ω(log n). (I am ignoring cases where n < 2, which is anyway irrelevant for asymptotic behavior).

Now, what is the overall complexity of isAllOdd? We can generalize the average case calculation if we want to get an exact value; but whatever that function, overall(n), is, we know for sure best(n) < overall(n) < worst(n); so overall(n) is O(n) and Ω(1). However, we can't define any Θ for it - there is no function which it will differ from by at most a constant factor for any n.

Why do the detailed analysis instead of saying this is O(n) and being done with it? Well, perhaps we know that we are in fact calling this function with a large number of even numbers, so we would actually over-estimate its complexity. If we instead know we are calling it with a large number of odd numbers, we would also find out from the average case analysis that it would be better to define a reverse funtion, isAllEven, and compute isAllOdd = !isAllEven.


I always used big O for worst case (by default). I did think that was common.

Theta and Omega I use differently however.

I sometimes say, best case if X then it's O(n) or whatever.


You should look into amortized complexity. That's the formalized name for worst case runtime over multiple trials.


I remember from college there was big-Oh `O(x)` and little-Oh `o(x)`. Is Omega/Theta the same as little-Oh?


No, big O is a kind of loose upper bound, while little-O is a very strict upper bound.

  f(x) = O(g(x)) <=> there exists k, x0 such that f(x) < k * g(x) for any x > x0

  f(x) = o(g(x)) <=> for any k > 0, there exists x0 s.t. f(x) < k * g(x) for any x > x0
For example:

  f(n) = 3*n^2 + 2n + 1

  f(n) = O(n^2)? 
     let's take k = 10
     3*n^2 + 2n + 1 < 2 * n^2 for any n > n0?
     n = 0: 3 * 0^2 + 2*0 + 1 < 10 * 0^2 <=> 1 < 0 false
     n = 1: 3 * 1^2 + 2*1 + 1 < 10 * 1^2 <=> 6 < 10 true
     n = 2: 3 * 2^2 + 2*2 + 1 < 10 * 2^2 <=> 17 < 40 true
     ...
     so, yes, f(n) = O(n^2) (n0 = 1)

  f(n) = o(n^2) ? 
    let's take k = 2.
    
     3*n^2 + 2n + 1 < 2 * n^2 for any n > n0?

     3 * n^2 + 2 * n + 1 < 2 * n^2      <=>
     3 * n^2 + 2 * n + 1 - 2 * n^2 < 0  <=>
     n^2 + 2 * n + 1 < 0 false for any n > 0.

     so no, f(n) != o(n^2). However, f(n) = o(10 * n^2), and f(n) = o(n^3).

Big-Omega and small-Omega are the same idea for a lower bound.

Big-Theta is a tight bound around the function:

  f(n) = Theta(g(n)) <=> there exists k1, k2, n0 such that k1*g(n) < f(n) < k2*g(n), for any n > n0.
There is no equivalent "little-theta", as that would imply that the relation must hold for any k1 and k2, which would at best leave only the function itself as fulfilling the condition.


Computer scientists stole some notation from mathematicians, but then slightly changed and substantially slackened the definition...


No, not really. CS and other branches of maths are using the exact same definitions of O,o,Θ, and ω. It is true that Ω is used with a different definition in CS, but it's stricter not more slack.

As explained elsewhere though, they are not notations for algorithmic complexity, they are notations for function categorization. It just happens that a common use for this notation in CS is to categorize the function "best/average/worse/... case complexity of an algorithm", but they are also used for other purposes.


In my experience, a lot of the time when programmers say big O, they actually mean big Theta...


Programmers need not be very good computer scientists. The actual scientists publishing papers on algorithms usually get it right.


Yes, I agree with that. However, if f(n)=Theta(n) => f(n)=O(n), so they are usually not wrong, technically, just unnecessarily broad.


In my opinion, nothing about big-O and related omega, theta, etc, is at all strict.

Maybe I value "pure symbolic clairty" too much. I do see a very pragmatic usefullness to the complexity notation.


An interesting takeaway here is that custom comparators are really hard to write correctly. There are a lot of constraints that aren’t obvious.

I suspect there should be a new std::verify_weak_ordering to help people write tests that perform n^3 tests on potential data to ensure a comparator is safe to use.


MSVC and GCC's standard libraries implementations both have a debug mode which has been checking that for almost two decades.


Are you sure about that? I just ran an MSVC test that calls std::sort on a std::vector<float> containing a variety of values including infinity, negative infinity, and NaN. I compiled and ran under debug mode. There were no errors or exceptions. The end result was extremely incorrect.

Maybe there's some extra flag to turn on even more tests. I've always found MSVC STL to be completely impossible to use in debug mode for any real project.


I'll double check - maybe the checking code only gets called when a custom comparator is provided, not the default < which would be used there.


After double check:

    std::sort(<some floats with NaNs>); 
indeed does not assert, what does assert (with a nice "invalid comparator" message in MSVC) is

    std::sort(..., [] (float a, float b) { return a <= b; });
    std::sort(..., [] (float a, float b) { return true; }); 
or anything else that is not a correct < implementation.

I'd assume they don't want to introduce a check for the FP case as it would break pretty much all the code on earth which assumes that NaNs and Infs do not exist.


> I'd assume they don't want to introduce a check for the FP case as it would break pretty much all the code on earth which assumes that NaNs and Infs do not exist.

There are even compile flags that let the compiler assume that as well - some of them as benign sounding as -ffast-math or -Ofast.


or the aptly-named -funsafe-math-optimizations... fun and safe math optimizations, what could go wrong :^p


Neat, TIL. Thanks!


wow. that was really impressive. scale is different kind of monster. it is like an end game boss. Thank you for sharing it. I learned a lot.


I never really understood why a standard library would insist on having a single "sort" function (or at best, a "sort" and "stable sort"). Why not just provide explicit quick sort, merge sort etc. functions and let users choose which they want? Sorting is one of the first things a budding computer scientist learns, and the research on sorting algorithms is well-trodden ground by now, so I don't see how it could be a usability issue. I'd like to be able to choose from these criteria without having to rely on external libraries:

- How does it perform on average?

- How does it perform on pathological inputs?

- Is it stable?

- Does it allocate?

Not to mention constraints that an application can take advantage of:

- What if the data is already mostly sorted? (insertion sort)

- What if writes are very expensive? (cycle sort)

- What if all the data are integers? (radix sort)

And so on. Trying to force a one-size-fits-all solution seems rather silly at this point.


I think only having "sort" and "stable sort" makes sense for the same reason there's only one hashmap or one vector in most standard libraries: 99% of the time you just want a high-quality implementation of one reasonable choice, without getting into the weeds of it.

Most of the time I don't care if it's timsort or quicksort, or if the hashmap uses Cuckoo hashing or buckets made of linked lists. Of course there are the 1% of cases where I notice that my sort or hashmap is a performance bottleneck and that a different algorithm might perform better. But standard libraries aren't good at serving that need because they have to be broadly applicable and conservative, if I already know that I want top performance for specific requirements I'm almost certainly better served by an external library.


There are such things as sensible defaults though.

Timsort being a somewhat famous example, which uses the fact that python datastructures are known length to figure out the best sorting method for the size.

If you need something more exotic then you likely know more about what you're doing, but most people writing software who need a sort function usually don't need more than timsort.


> uses the fact that python datastructures are known length to figure out the best sorting method for the size.

That... does not match with my knowledge of how timsort works. The sorting method is chosen based on the array length, and every sorting algorithm in every language knows that. It's extremely suboptimal for some types because Python's typing is so weak!

Timsort succeeds because it has many good heuristics to optimize common properties of its merge intervals.


You are thinking about the weeds of sorting now, but that is rarely relevant to the problem at hand.

You can (and most people do) look at it from the other angle. Sort(seq) is a function which returns a permutation of seq such that every element is less-than-or-equal-to the next element. Usually, we use this function because we need this property to hold in our following code. A lot of the time, this function is called for very short sequences, where even the most naive implementation would be fats enough for even the highest scale program. So, in the vast majority of code we only care about the semantics of std::sort; and all of the above have the same semantics.

In fact, the fact that we do often get a stable sort as well shows that semantics are the most important characteristic, as stable and unstable sorts do have slightly different semantics.


There are defaults for all sorts of functions. find() functions on an array in many languages use some default. Parameters often have defaults when not set. Strings often have a default encoding.

The vast majority of the time, the default works well enough. Why make things more complicated than they need to be?

It also has the added benefits of helping more junior engineers when reading the code. "Cycle sort? What is that doing?" "Insertion sort? I was taught that was inefficient! I should spend time updating this code to a better sorting algorithm! My tech lead would love my initiative."

Or we just use sort(), and we all move on with our lives.


while i agree with you to some extent, i don't think including a very large set of sorting functions in the standard library is a good idea. especially for dynamic languages, a larger standard library means more disk usage and worse cache utilization. it's also typically relatively easy to use your own sort function: in performance-critical cases, the sort function is usually called directly, not via intermediate libraries (which might need to be patched to call the new sort function); also, sort functions are self-contained, i.e. typically don't require system dependencies.


> A larger standard library means more disk usage and worse cache utilization

Does it? How? I suppose one has to store the standard library on the machine, which does notionally use up disk space, but I can’t see how it contributes by any reasonable performance-relevant meaning of ‘disk usage’ (presumably i.e. disk I/O during the runtime of the program). And for cache utilization I simply have no clue at all what you might be driving at.


I think this goes back to the whole idea of non-CS people being the primary users of standard libraries.

I agree though, there should be the ability to specify which one, and every function in the standard library should have big O notation in the function description. Still makes sense to have a general-purpose good one as the default.


Seems like mostly a pointless change and it will break some software. Most developers never read the docs. There is always a question about how much rope you should give the users but in these cases my experience tells me that even if the docs clearly state something users will rely on real world results, not the docs. Especially if the development is test driven or if testing is done mainly with automatic tests. I would expect better decisions from the GO team. This is a poor decision not in line with the promises of GO. What did you have to do to not make a breaking change? Add something new and mark the old obsolete?

Damn cool algorithm though.


Sorry if this seems pedantic in the grand scheme of the topic, but is the indentation in the code samples messed up, or is there some kind of logic to it? I've never seen anything like that, and I find it pretty difficult to read.


Sorry, it definitely was a little inconsistent as I stored all snippets without proper formatting and I tried at the same time align texts for phones, etc

Failed at everything. Will improve


Looks normal to me (except it seems like some snippets indent with 2 spaces while others indent with 4).

The line spacing is weird though.


I'd like to hear more about the reinforcement learning part. The patch literally just says 'Reinforcement Learning'. Not very helpful!


Quite a long interesting read. Unless I missed it, I recall that there’s optimal solutions for sorting up to 7?? items. I can’t find the reference, but unsurprisingly I read it a dlang forum post by Andrei Alexandrescu. Anyone have a link to that aspect of sorting?


Under the section heading "Reinforcement learning for small sorts", there are diagrams of the small sorts, for 3, 4 and 5 items.

They also explain that down here you're at the mercy of the physical hardware. Emitting one fewer x86 instruction does not necessarily make your program faster, still if it's not slower then smaller programs are good too.


In the first example of problematic calls to nth_element, there is another bug: if the container contains nothing, it will call nth_element with begin(), begin() - 1, ...

Also, typo: "compare functions much comply" -> "compare functions must comply"


The function is trying to get the median, which is not defined for an empty set. With this particular implementation, there is an assert for that:

https://github.com/shogun-toolbox/shogun/blob/9b8d85/src/sho...

Unrelated, but from the same section:

> Fixes are trivial, access the nth element only after the call being made. Be careful.

Wouldn't the proper fix to do the nth_element for the larget element first (for those cases that don't do that already) and then adjust the end to be the begin + larger_n for the second nth_element call? Otherwise the second call will check [begin + larger_n, end) again for no reason at all.


Huh. If I found that Reflection::ListFieldsMayFailOnStripped was a frequent caller of std::sort, to an extent that it was actually costing measurable CPU time, first thing I would do is find and fix all callers of that.


"Fix" how exactly? I'm not quite sure what you're suggesting.


C++ proto reflection is a convenience. If it actually shows up at scale in production, that smells like a design error.


All things show up at scale in production at Google.


In one sense yes, but in another sense even though Google is large their threshold for something being worth fixing based on efficiency justifications is correspondingly large.


Frequently though, you're dealing with a death by a thousand paper cuts. Perhaps that call comes from some common library whose implementation requires reflection.


If you're measuring in terms of time, electricity, or hardware, then the threshold would tend to be lower for Google, not higher. A 0.1% savings on a million machines is ~1 million times more valuable than a 0.1% savings on a single machine.

OTOH if the measurement is risk of breaking something, then it's not nearly so clear, since larger organizations tend to have more software and hardware, and therefore more opportunities for things to get broken.


You're saying the opposite of what the other guy said, though. They said that at scale small things are big. I'm saying that at scale .001% is still .001%.


It isn't though. A typical developer will never care about 0.1% of unnecessary CPU usage in their application, but at Google, depending on what component it's in, that can easily be a million dollar find.


> In one sense yes, but in another sense even though Google is large their threshold for something being worth fixing based on efficiency justifications is correspondingly large.

not everything is a linear relationship


"It took us around 20 years to find something really appropriate, thanks to Andrei Alexandrescu."

The way this is written, it sounds as if Andrei Alexandrescu is responsible for the 20 year delay.


Do you need to have learned Computer Science to grasp the topic at hand here?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: