I've been using a variation of quickselect to find median rows of a matrix at certain columns.
There's a problem with quickselect: while it will find the median, it doesn't properly pivot the median around itself if the median occurs multiple times. So you could end up with the median sprinkled around both sides, which may or may not be a problem (it was for me).
One way to solve this is to take a second pass over the data to pivot -- which is essentially the core of QS where you nibble from both ends and swap low/high values.
Another nice property of QS is that it'll be approximately sorted, with values generally growing closer to the median towards the middle. This helps if you need to do repeated sub-selections on the partitions.
I like quickselect, because it is essentially quicksort. The difference is that one of the two partitions is ignored, since you know the median is in the other partition.
The two-pivot variant has a smaller recursion depth in addition to avoiding the multiplicitous median issue, but does come at the cost of extra swaps. But in the end, quickselect is what you use when you don't worry about pessimal performance.
maybe this is included and i missed it, but it's asymptotically faster to maintain a sorted tree of values plus a list of pointers to nodes in order added. if you store "number of values to right" in each tree node then finding the next median (moving the window one position) is O(log w) and total cost for whole array is O(n log w) iirc.
this is used to median filter images in the IRAF package. i don't know if the approach is published anywhere (it's pretty obvious once the idea of keeping points within the window in a sorted tree "clicks"), but frank valdes did test it against other approaches.
the main drawbacks are that the overhead/constant is pretty high, so you need fairly large datasets (more exactly, large windows) for it to be a win, and implementation in old fortran is a pain...
Thanks for the link. Note, though, that the median algorithm used there uses a different parameter for n than OP. It also is bucketsort-based and O(n), O(1) only amortized (in both parameters).
Additional knowledge of your data is often useful, like using the fact that all data are 8-bit integers above. Similarly, if you know your data is uniformly distributed, you could try Torben (in TFA) for pivot selection in quickselect.
I would have liked to see a benchmark for the deterministically linear O(n) BFPRT [1] algorithm as well. It plays out well in worst case scenarios if you have a very large dataset.
That's funny. I just looked back at my old email, and I used this very quickselect.c implemention back in April 2008. It worked well and was wicked fast.
The algorithm by Blum, Floyd, Pratt, Rivest, and Tarjan takes 24n comparisons in the worst case. A description here:
Torbens method is very interesting. At each step it takes O(n) to half the range of inputs. This doesnt say much about how many numbers in the array it removes, but it does say that on average its about half, leading to O(nlogn) expected time. In the worst case it is Quadratic, with one example being [1, 2, 4, 8, 16, 32, 64, ...]
There's a problem with quickselect: while it will find the median, it doesn't properly pivot the median around itself if the median occurs multiple times. So you could end up with the median sprinkled around both sides, which may or may not be a problem (it was for me).
One way to solve this is to take a second pass over the data to pivot -- which is essentially the core of QS where you nibble from both ends and swap low/high values.
Another nice property of QS is that it'll be approximately sorted, with values generally growing closer to the median towards the middle. This helps if you need to do repeated sub-selections on the partitions.