Parallel In-Place Merge Sort

wfunction · on Oct 1, 2014

It's really silly that they depend on the STL implementations, because I don't believe STL implementations actually implement in-place merges in-place. The actual problem of stable in-place merge is extremely hard, so I was really surprised it would show up in a Dr. Dobbs article.

CJefferson · on Oct 1, 2014

Certainly g++ does have an in-place merge sort. It's tricky in to activate it, because it is only used when an attempt to allocate a spare buffer fails, and on most 64-bit machines malloc doesn't fail, your program just gets killed later when it tries to use the memory.

wfunction · on Oct 1, 2014

Are you talking about this?

https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-...

Do you know what the time complexity of their merging algorithm is?

As a side note, an optimal merging algorithm is in the paper called "On Optimal and Efficient in Place Merging", and it's too complicated for me to expect it in a standard library implementation. (Although the STL doesn't need to be optimal, I think the sub-optimal linear-time algorithms are also complicated enough to not be used for the standard library, but correct me if I'm wrong.)

CJefferson · on Oct 2, 2014

I know the complexity of stable_sort (which uses this) is O(n. log n) with extra memory (which is fairly obvious extra buffer), and the intention is that without extra memory the sort is O(n.log^2 n). I assume that algorithm meets that target, but I'll admit I'm not positive it does.

wfunction · on Oct 2, 2014

See, that would violate the standard. The standard mandates O(n log n), but I don't think any implementation follows this. (Again, that complexity is indeed possible, but quite challenging and not something I've seen in standard library implementations.)

CJefferson · on Oct 2, 2014

No, my previous message was a quote from the standard. O(n log^2 n) is fine when no extra memory is available. As you say, this could in principle now be tightened. In the past the standard used to allow O(n^2) std::sort to permit naive quicksort implementations, which has now been tightened to O(n log n). Here's the relevant quote from stable_sort:

> Complexity: It does at most N log^2(N) (where N == last - first) comparisons; if enough extra memory is available, it is N log(N).

wfunction · on Oct 3, 2014

Ohh sorry, I thought you were talking about the merge, not the sort. Yes O(n log n) is allowed for the merge, but what I'm saying is that the Dr. Dobbs article's in-place merge sort is actually not quite an in-place variant of merge sort: its complexity (as far as I can tell) is O(n log^2 n) rather than O(n log n).

aidanhs · on Oct 1, 2014

A stable in-place merge sort is actually reasonably easy to implement if you happen to be using a linked list - http://www.chiark.greenend.org.uk/~sgtatham/algorithms/lists...

I think it's interesting how choice of data structure affects algorithm characteristics.

StefanKarpinski · on Oct 1, 2014

Using a linked list is effectively equivalent to doing things not in place. Except that the memory locality is worse and for word-sized data like integers and floats, the overhead is permanent 3x instead of just temporary 2x. But yes, for linked lists, merge sort is a great choice since it doesn't depend on O(1) indexing.

wfunction · on Oct 1, 2014

Yeah, exactly. For those who didn't catch it: linked lists already have O(log n) overhead, because distinguishing between n items requires log n bits for the pointer. They're fundamentally not in-place.

aidanhs · on Oct 3, 2014

Hmm, I was thinking 'in-place' in the context of 'no extra memory needed'. Is this not correct?

For sure, if it doesn't make sense to use a linked list in the first place then linked list merge sort definitely doesn't make sense...

wfunction · on Oct 3, 2014

It's not that it's incorrect, it's that it's pointless. The entire point of a linked list is that you don't move around the data, you just re-link nodes. So any sane algorithm that works on linked lists is already "in-place" by default; an out-of-place merge on a linked list would be pure madness. So saying that an in-place merge can be performed on a data structure that already has some overhead and which already performs everything in place is not a very insightful or useful observation.

aidanhs · on Oct 4, 2014

> an out-of-place merge on a linked list would be pure madness

Absolutely. I'm not even sure how you would do such a thing.

I personally have found it interesting when considering the difference between an array of pointers and a linked list, simply because they both have the same overhead but one doesn't require allocating any more memory for a merge sort.

I basically agree with what you say (maybe bar it not being useful knowledge) but I don't think it's worth continuing trying to clarify my initial statement.

wfunction · on Oct 1, 2014

I don't think people have linked lists in mind when they talk about merge sort.

angersock · on Oct 1, 2014

I was under the impression that the sorting of linked lists is exactly what merge sort is good for.

wfunction · on Oct 1, 2014

You're confusing two things:

1. Yes, merge sort is exactly the kind of algorithm you'd use for sorting linked lists

2. No, sorting linked lists is not the primary application of merge sort

angersock · on Oct 1, 2014

Right, that's not the primary application of merge sort--I meant 1. English is hard sometimes. :(

yelnatz · on Oct 1, 2014

Can you clarify what you mean by that?

Article said they got 2.2x improvement using their parallel method compared to STL. Trade off was having 100% CPU utilization vs 12.5% with STL.

rikkus · on Oct 1, 2014

That's not really a trade-off, if you're using the term with a negative connotation. Going from using 100% of one core to 100% of eight (well, 4, but hyperthreading) would normally be considered an advantage.

gear54rus · on Oct 1, 2014

I'm probably wrong, but wouldn't it be a decrease in overall productivity? 2.2x increase in speed (higher the better) but ~7x increase in load (lower the better) for the same workload.

Or let x be time we need to complete the task:

12.5 * x first (load over time) is less than second 100 * x / 2.2 for the same x.

Or is this an inaccurate comparison?

illumen · on Oct 1, 2014

Yeah, it depends. How much memory bandwidth do you use with each? Are the results useful incrementally? What else do you have to do on the machine? If this is the only task, and you don't care about power use, then using all resources for a quicker time is better. If you care about power use, or have other things running on the machine, then great. Being in place will of course matter with memory bandwidth and available RAM.

GPUs are way more parallel, and faster than this too. "clocks about 100x faster than calling std::stable_sort on an i7 Sandy Bridge" http://nvlabs.github.io/moderngpu/mergesort.html

wfunction · on Oct 1, 2014

> Can you clarify what you mean by that?

I'm saying their merge sort was likely not actually in-place.

taeric · on Oct 1, 2014

Anyone have any quick comparisons of this to Batcher's sorting method? Quickly consulting wikipedia, I see that those have gained traction in the GPGU community. I was curious if/when those techniques would start to get use.

brudgers · on Oct 1, 2014

While the approximately 10x increase of the parallel in-place merge sort over the non-parallel version shown on the benchmarks may be meaningful, it also shows that parallelism reduces constant factors, not exponents.

taeric · on Oct 1, 2014

If you have a look at sorting networks, they actually do reduce an exponent. Goes from nlg(n) to just lg(n). See [1], for example.

Saying that, I feel like I'm trying to mislead, though. This requires nlg(n) comparators, so you still have to pay for it somehow. And I doubt we are going to see that many comparators on any piece of kit any time soon. For large n, anyway. (Though, seeing this can be done with a gpu, it does seem that you can hit rather large arrays already.)

[1] http://en.wikipedia.org/wiki/Batcher_odd%E2%80%93even_merges...

brudgers · on Oct 1, 2014

[Assumption: n is large enough to make the problem interesting]

If a sorting network (such as Batcher's} requires O(n (log n)^2)[Space] doesn't this sort of defeat the purpose of trying to sort in-place?...since a traditional array copying merge-sort has 0(n)[Space] requirements.

taeric · on Oct 1, 2014

I think the point is if you could build these into hardware. Is the reason this is popular for GPU programming, if I understand correctly. Consider, if you can use this, you can now sort 1024ish elements in 10ish cycles, instead of 10240ish ones. That is rather impressive.

It is algorithms like this that get people very excited about the potential that FPGAs pose, if I understand correctly.

Unless, of course, I completely flubbed my understanding and or math. (Very very possible.)

And then, of course, your point ultimately stands as to why these are not more widely known. For most of us, we have a fixed set of comparators that is very small.

brudgers · on Oct 1, 2014

According to Knuth, Batcher's parallel sort was discovered in 1964 and published in 1968. [1] That it's been in Knuth for 35 years of course does not make general unawareness of it unsurprising.

To be a bit snarky, maybe the best piece of hardware to throw first at a sorting problem is a good book on sorting algorithms.

[1] TAoCP: Volume 3 [1978], page 111

taeric · on Oct 1, 2014

:) Indeed, that is where I just found out about it. Although, do be fair and realize that many of the things in the latest revision of that book are newer than 1978, it was last updated in 1998. (Of course, many are also older...)

I have been going through what is undoubtedly a ridiculous phase of suggesting these books to everyone. I'm doing them in multiple passes where I am mostly skipping exercises for the first pass. Yes, the math is intense. No, they aren't as hard to approach as they have been made out to be.

brudgers · on Oct 1, 2014

I'm cheap and that's why I purchased a used older edition. Incidently, though, I recently realized that all the obsolete material on searching tapes as secondary storage is pure Turing machine manipulation.

Anyway, I don't expect to ever master the material, but I learn something every time I open one up.

taeric · on Oct 1, 2014

Still not exactly cheap, but the ebook versions are lower in price. Is what finally got me to read through them. (Well, that mainly because they are easy to take with me, now.) I did save up to buy physical copies, though. Plan to use those to go through the exercises.

readerrrr · on Oct 1, 2014

It is impossible to get anything better than a constant improvement, if you divide you work by a constant factor.

brudgers · on Oct 1, 2014

Not discussing that is a weakness of the article. The algorithm addresses performance issues caused by the IO bottleneck, not an algorithmic deficiency with common merge-sort implementations.

The merits and deficits of trading-off O(n)[Memory] growth for 0(log n)^2[Time] growth are worthy of discussion if it is anticipated that people will implement the algorithm in production code. There's a case for looking at it, certainly, on constrained systems, but such systems may not have processors with the execution pipeline sophistication of an Intel i7 quad-core.

Which of course is just a round about way of saying, the article doesn't indicate domains in which the algorithm may be relevant and not relevant.

sp332 · on Oct 1, 2014

That's just what brudgers said. But don't worry, CPU and GPU manufacturers are increasing the number of cores exponentially :p

govilk · on Oct 1, 2014

similar to java8 feature, fork and join