On mobile at the moment but it will be interesting to see how this compares to G...

orlp · on Dec 2, 2022

Hello, author of glidesort (and for context, pdqsort) here.

No it's not released yet but it's getting real close. The code is virtually done, with some minor cleanup required. Besides some personal stuff, a large source of my delay has been the headache that is panic safety in Rust. Non-trivial sorting code becomes extra difficult if a forced stack unwinding can occur every time you compare two elements, where you are then required to fully restore the input array to a valid state. Doubly so if you can't even assume the comparison operator is valid and obeys the strict order semantics.

I am currently in the process of writing a paper I want to publish along glidesort which is also mostly done. Since the linked talk I've also had some more performance wins making it ~4.5 times faster than std::stable_sort for uniform random integers and much more than that for low cardinality data and input patterns.

Glidesort can use arbitrary amounts of auxiliary memory if necessary, but is fastest when given a fraction of the original input array worth of memory - I'm currently planning on releasing it with n / 8 as the default. I haven't looked at blitsort in detail but I am skeptical of the O(n log n) claim, I believe it is O(n (log n)^2) like glidesort is when given a constant amount of memory, as blitsort's partition and merge functions are recursive. Not that this really matters in practice - ultimately the real runtime is what matters. But I wouldn't be surprised to see blitsort slow down more relative to the competition for larger inputs.

scandum · on Dec 2, 2022

It should be O(n log n) comparisons and technically O(n (log n)^2) moves. The moves are reduced by a relatively large constant however, and blitsort might qualify as O(n log n) moves when given sqrt(n) aux.

In your Youtube presentation you do seem to skip mentioning that many of the performance innovations in glidesort were derived from quadsort and fluxsort. Some credit in your upcoming paper would be much appreciated. Feel free to email me if you have any questions, some things like my first publication of a "branchless" binary search in Aug 2014 may be hard to find, though there might be prior claim.

~4.5 times faster than std::stable_sort for uniform random integers is pretty impressive. Is this primarily from increasing the memory regions from 2 to 4 for parity merges / partitions? I'm benching on somewhat dated hardware and had mixed results (including slowdowns), so I never went further down that rabbit hole.

orlp · on Dec 2, 2022

Hi Igor,

> It should be O(n log n) comparisons and technically O(n (log n)^2) moves.

Yes, the same applies to glidesort. Looking at your repository again more carefully you do indeed only claim O(n log n) comparisons, I was not careful enough.

> blitsort might qualify as O(n log n) moves when given sqrt(n) aux

I have a similar conjecture for glidesort but I'll have to do some thinking to see if I can prove it.

> In your Youtube presentation you do seem to skip mentioning that many of the performance innovations in glidesort were derived from quadsort and fluxsort. Some credit in your upcoming paper would be much appreciated.

Good artists borrow, great artists steal. I do try to give credit where due though, so to clear my name...

...I do cite you in the presentation in the section on ping-pong merges (at 10 minutes in the presentation), and in the upcoming paper I do also cite you for what you call parity merges (something I did not use at the time of the presentation, but I do use now in the small sorting routine), where you don't have to do bounds checks when merging two equal-size arrays from both ends. As you could notice, I didn't have a large time slot for my presentation, so I could not go into more details regarding the branchless nature of merging and partitioning. I'll make sure to mention you in the paper for inspiring the out-of-place branchless partition as well. I do take full credit myself for making the bidirectional branchless out-of-place partition however: https://i.imgur.com/EiVi8Y2.png.

> Feel free to email me if you have any questions, some things like my first publication of a "branchless" binary search in Aug 2014 may be hard to find, though there might be prior claim.

I'll email you for sure, and I have a summary I posted on HN 7 months earlier as well: https://news.ycombinator.com/item?id=31101056. Hopefully this does show my good intent and that I have never tried to cover up the inspiration I took from your work. That said, glidesort does not use a branchless binary search.

> Is this primarily from increasing the memory regions from 2 to 4 for parity merges / partitions? I'm benching on somewhat dated hardware and had mixed results (including slowdowns), so I never went further down that rabbit hole.

I believe it is primarily from having multiple independent interleaved loops, which can use instruction-level parallelism. The speed-up is most significant on Apple M1, less so on my AMD Threadripper machine. I disagree with the term 'parity partition' entirely however, and when I say 'parity merge' I specifically refer to your trick where the optimization where bounds checks can be eliminated entirely when merging two equal-sized arrays. I call my partitioning method bidirectional partitioning. Similarly, how I see it is that every parity merge is a bidirectional merge, but not every bidirectional merge is a parity merge.

def-lkb · on Dec 2, 2022

I wrote a similar sorting algorithm around 14 years ago (fast, inplace, adaptive merge sort algorithm using rotation/swap, and I think, stable). This is all I remember from the time...

I was able to prove (on paper) that the whole algorithm was actually O(n log n). The moves seems to have a bad recursion pattern, but assuming my proof was right, it is possible to have a well behaved implementation in practice. However, the recursion pattern didn't fit the master theorem, I remember having to do a taylor expansion of some formula giving an upper bound on the number of computation steps to prove the O(n log n) bound.

And runtime measurements graph were showing the expected O(n log n) curve, as in your cases. So maybe they are not O(n log n ^ 2) as you feared... I did not keep much information from that time :P.

def-lkb · on Dec 2, 2022

Correction, technically, it uses O(log n) space for the recursion, so it is not stricly-speaking inplace, which would mean O(1) space.

scandum · on Dec 2, 2022

It's possible to implement a 64 item stack so you can call it O(1) and in-place.

But that's such an effort in futility that I refuse to do so. So it's theoretically and practically in-place, but not technically.

def-lkb · on Dec 2, 2022

Of course, this is nitpicking :). For all practical purpose, this O(log n) is O(1).

If you are interested, I can try to recover the proof that the rotation step can be done in O(n), thus allowing to apply the master theorem on the main recursion and getting the O(n log n) result.

scandum · on Dec 2, 2022

I'm not that interested as I'd prefer to count each move and prove it that way, but perhaps Peter (orlp) is interested?

slekker · on Dec 2, 2022

Oh HN, only here you can see these kind of discussions :ˆ) Amazing!

jqr- · on Dec 2, 2022

Right?! It made me feel like a was witnessing a letter exchange between great mathematicians of lore.

scandum · on Dec 2, 2022

Hoi Peter,

>I have a similar conjecture for glidesort but I'll have to do some thinking to see if I can prove it.

It probably is technically O(n (log n)^2) moves, but I suspect that when the rotations are fast enough (trinity / bridge rotations) the cost saving from localization nullify the additional moves.

>I do cite you in the presentation in the section on ping-pong merges

Wikisort does have prior claim to ping-pong merges, and it likely goes further back than that. I did independently implement them in quadsort.

>I do also cite you for what you call parity merges (something I did not use at the time of the presentation

This feels like a tricky claim to me, since I assume your bidirectional merge was fully inspired by my parity merge. If so you weren't the first to implement one, I did so while working on the parity merge, but I figured it was self-evident the 2x gain would apply for a more traditional merge, and I did state the gain was from accessing two memory regions in the quadsort article.

I've since made a quadsort release that uses a galloping 'bidirectional merge', though it still takes advantage of the parity principle, so properly speaking it's a bidirectional galloping parity merge hybrid that sorts 2 blocks of 8 elements at a time, without boundary checks for the 16 merge operations.

>As you could notice, I didn't have a large time slot for my presentation

I did take that into account, I guess it's human nature to get butthurt over stuff like this and I couldn't help myself from whining a little. It is better than bottling it up and I apologize if I caused you unease in turn.

>Hopefully this does show my good intent

It does, and I appreciate it.

>That said, glidesort does not use a branchless binary search.

I assume that's one of the advantages of building upon powersort? The blocky nature of quadsort is both a blessing and a curse in that regard.

>I call my partitioning method bidirectional partitioning.

I meant "parity merge" / "partition". If you think about it quicksort is naturally bidirectional, which is why branchless partitioning works without further work, and undoubtedly why it wasn't self-evident to most. I did try to write a bidirectional partition early on, but it's either my hardware or I messed up somehow.

I agree it's possible to write a bidirectional merge without utilizing the parity principle. One of the trickiest things in programming is properly naming things.

As for instruction-level parallelism, I do think it's actually almost entirely memory-level parallelism. I could be wrong though.

orlp · on Dec 2, 2022

> Hoi Peter,

A nitpick, but my name is Orson Peters, Peters is my last name :)

> I assume that's one of the advantages of building upon powersort? The blocky nature of quadsort is both a blessing and a curse in that regard.

I don't know, I only use a binary search when splitting up merges, and almost no time is spent in this routine. I use this for the low-memory case, as well as to create more parallelism to use for instruction-level parallelism.

> As for instruction-level parallelism, I do think it's actually almost entirely memory-level parallelism. I could be wrong though.

I didn't do specific research into which effect it is, when I say ILP I also mean the memory effects of that.

scandum · on Dec 2, 2022

>A nitpick, but my name is Orson Peters

I actually knew that, not sure what went wrong in my brain.

>I don't know, I only use a binary search when splitting up merges, and almost no time is spent in this routine.

You'll still get a definite speed-up, worth benching the difference just in case. I guess there's no easy way to avoid a binary search.

>I didn't do specific research into which effect it is, when I say ILP I also mean the memory effects of that.

They're indeed related. I did some empirical testing and I'm quite sure it's cache related. Thinking about it, one issue I may have had was not benching sorting 100M elements, which might be where a bidirectional partition might benefit on my system.