It was actually done to counter Elo based approaches so there's some references in the readme on how to prove who's better. I haven't run this code in 5 years and haven't developed on it in maybe 6, but I can probably fix any issues that come up. My co-author looks to have diverged a bit. Haven't checked out his code. https://github.com/FrankWSamuelson/merge-sort . There may also be a fork by the FDA itself, not sure. This work was done for the FDA's medical imaging device evaluation division