I'm in the middle of such a story right now. I'm doing research on a data set of COVID-19 hospital patients with multiple blood samples over time from each patient. The obvious thing we want to do with this data is to line up all the samples on a single timeline so we can see how the data changes over the course of COVID-19 from infection to resolution. Unfortunately, as with most infectious diseases, no one knows exactly when they were actually infected, which means we can't just sort them by time since infection and be done with it.
So, we set out to find some way of inferring the timeline from the data itself (RNA-seq and other molecular assays from the blood, in this case). The first thing we tried was to apply some standard methods for "pseudo-time" analysis, but these methods are designed for a different kind of data (single-cell RNA-seq) and turned out not to work on our data: for any given patient, these methods were only slightly better than a coin flip at telling whether Sample 2 should come after Sample 1.
Eventually, we gave up on that and tried to come up with our own method. I can't give the details yet since we're currently in the process of writing the paper, but suffice it to say that the method we landed on was the result of repeatedly applying the principle of "try the stupidest thing that works" at every step: assuming linearity, assuming independence, etc. with no real justification. As an example, we wanted an unbiased estimate of a parameter, and we found one way that consistently overestimated it in simulations and another that consistently underestimated it. So what did we use as our final estimate? Well, the mean of the overestimate and the underestimate, obviously!
All the while I was implementing this method, I was convinced it couldn't possibly work. My boss encouraged me to keep going, and I did. And it's a good thing he did, because this "stupidest possible" method has stood up to every test we've thrown at it. When I first saw the numbers, I was sure I had made an error somewhere, and I went bug hunting. But it works in extensive simulations. It works in in vitro data. It works in our COVID-19 data set. It works in other COVID-19 data sets. It works in data sets for other diseases. All the statisticians we've talked to agree that the results look solid. After slicing and dicing the simulation data, we even have some intuition for why it works (and when it doesn't).
And like I said, now we're preparing to publish it in the next few months. As far as we're aware (and we've done a lot of searching), there's no published method for doing what ours does: taking a bunch of small sample timelines from individual patients and assembling them into one big timeline, so you can analyze your whole data set on one big timeline of disease progression.
We will be posting it on a preprint server when we're ready to submit it to journal review, hopefully some time in February (but who knows, publishing timelines are murky at the best of times). The title will be something along the lines of "Reconstructing the temporal development of COVID-19 from sparse longitudinal molecular data", though that's likely to change somewhat.
So, we set out to find some way of inferring the timeline from the data itself (RNA-seq and other molecular assays from the blood, in this case). The first thing we tried was to apply some standard methods for "pseudo-time" analysis, but these methods are designed for a different kind of data (single-cell RNA-seq) and turned out not to work on our data: for any given patient, these methods were only slightly better than a coin flip at telling whether Sample 2 should come after Sample 1.
Eventually, we gave up on that and tried to come up with our own method. I can't give the details yet since we're currently in the process of writing the paper, but suffice it to say that the method we landed on was the result of repeatedly applying the principle of "try the stupidest thing that works" at every step: assuming linearity, assuming independence, etc. with no real justification. As an example, we wanted an unbiased estimate of a parameter, and we found one way that consistently overestimated it in simulations and another that consistently underestimated it. So what did we use as our final estimate? Well, the mean of the overestimate and the underestimate, obviously!
All the while I was implementing this method, I was convinced it couldn't possibly work. My boss encouraged me to keep going, and I did. And it's a good thing he did, because this "stupidest possible" method has stood up to every test we've thrown at it. When I first saw the numbers, I was sure I had made an error somewhere, and I went bug hunting. But it works in extensive simulations. It works in in vitro data. It works in our COVID-19 data set. It works in other COVID-19 data sets. It works in data sets for other diseases. All the statisticians we've talked to agree that the results look solid. After slicing and dicing the simulation data, we even have some intuition for why it works (and when it doesn't).
And like I said, now we're preparing to publish it in the next few months. As far as we're aware (and we've done a lot of searching), there's no published method for doing what ours does: taking a bunch of small sample timelines from individual patients and assembling them into one big timeline, so you can analyze your whole data set on one big timeline of disease progression.