Fast median search: an ANSI C implementation

jallmann · on May 18, 2013

I've been using a variation of quickselect to find median rows of a matrix at certain columns.

There's a problem with quickselect: while it will find the median, it doesn't properly pivot the median around itself if the median occurs multiple times. So you could end up with the median sprinkled around both sides, which may or may not be a problem (it was for me).

One way to solve this is to take a second pass over the data to pivot -- which is essentially the core of QS where you nibble from both ends and swap low/high values.

Another nice property of QS is that it'll be approximately sorted, with values generally growing closer to the median towards the middle. This helps if you need to do repeated sub-selections on the partitions.

susi22 · on May 18, 2013

There is actually a paper about exactly this:

"Optimal Pivot Selection in Fast Weighted Median Search"

ieeexplore.ieee.org/iel5/78/6236322/06193457.pdf

It gives a formula for the optimal subset size to use to find the pivot and also does some optimization after the first partitioning step.

qznc · on May 18, 2013

I like quickselect, because it is essentially quicksort. The difference is that one of the two partitions is ignored, since you know the median is in the other partition.

sesqu · on May 18, 2013

The two-pivot variant has a smaller recursion depth in addition to avoiding the multiplicitous median issue, but does come at the cost of extra swaps. But in the end, quickselect is what you use when you don't worry about pessimal performance.

andrewcooke · on May 18, 2013

maybe this is included and i missed it, but it's asymptotically faster to maintain a sorted tree of values plus a list of pointers to nodes in order added. if you store "number of values to right" in each tree node then finding the next median (moving the window one position) is O(log w) and total cost for whole array is O(n log w) iirc.

this is used to median filter images in the IRAF package. i don't know if the approach is published anywhere (it's pretty obvious once the idea of keeping points within the window in a sorted tree "clicks"), but frank valdes did test it against other approaches.

the main drawbacks are that the overhead/constant is pretty high, so you need fairly large datasets (more exactly, large windows) for it to be a win, and implementation in old fortran is a pain...

jallmann · on May 18, 2013

If the window is the length of the data, that becomes O(n log n) -- essentially no better than a sort?

For images and other kernel-based filtering, there is a O(1) median algorithm available: http://nomis80.org/ctmf.html

sesqu · on May 18, 2013

Thanks for the link. Note, though, that the median algorithm used there uses a different parameter for n than OP. It also is bucketsort-based and O(n), O(1) only amortized (in both parameters).

Additional knowledge of your data is often useful, like using the fact that all data are 8-bit integers above. Similarly, if you know your data is uniformly distributed, you could try Torben (in TFA) for pivot selection in quickselect.

andrewcooke · on May 18, 2013

maybe i have misunderstood the application - i am talking about "median smoothing" with a "running window". w is typically fixed.

but thanks for link to paper - that is new to me.

qznc · on May 18, 2013

Look at jallmanns comment [0]. Quickselect partially sorts the input array. If you reuse that array it will be faster on repeated runs.

[0] https://news.ycombinator.com/item?id=5728220

andrewcooke · on May 18, 2013

even w ordered data, sorting is O(n) isn't it?

sp332 · on May 18, 2013

Ordered data? Lookups into sorted data can use a binary search and are only O(log n).

andrewcooke · on May 18, 2013

but you need to move n/2 to "make room" - think of the whole process.

edit: you're not wrong, but what i am saying is that to make the filter practical you have to save the new point, too.

sorry for brief comments. on vacn w tablet only.

vowelless · on May 18, 2013

I would have liked to see a benchmark for the deterministically linear O(n) BFPRT [1] algorithm as well. It plays out well in worst case scenarios if you have a very large dataset.

[1] http://en.wikipedia.org/wiki/Selection_algorithm#Linear_gene...

cschmidt · on May 18, 2013

That's funny. I just looked back at my old email, and I used this very quickselect.c implemention back in April 2008. It worked well and was wicked fast.

The algorithm by Blum, Floyd, Pratt, Rivest, and Tarjan takes 24n comparisons in the worst case. A description here:

http://www.ics.uci.edu/~eppstein/161/960130.html

However, this algorithm has an average case performance of 4n comparisons:

http://www.ics.uci.edu/~eppstein/161/960125.html

twiceaday · on May 19, 2013

Torbens method is very interesting. At each step it takes O(n) to half the range of inputs. This doesnt say much about how many numbers in the array it removes, but it does say that on average its about half, leading to O(nlogn) expected time. In the worst case it is Quadratic, with one example being [1, 2, 4, 8, 16, 32, 64, ...]