1. They trained and tested on a balanced dataset, which is very unlike the data distribution this algorithm would see “in the wild”. Under real world prescreening conditions the data would likely be extremely unbalanced toward the negative class, and also be subject to drift over time.
2. They seem to have identified positive subjects through a questionnaire not via clinical chemistry diagnostics; so (a) it is unclear whether their training labels are correct, and (b) they may have completely missed the asymptomatic population.
3. As mentioned in another comment ca. 5000 patients and 250K samples is not a lot considering the size and diversity of the population(s) where this would be deployed.
Disclaimer: I gave the article the brief high level scan treatment so I could be wrong about any or all of these. Please correct me if I am mistaken.
1. Given that real-world distribution of positive/negative COVID cases is hugely imbalanced, having a balanced dataset would seem to be a form of random undersampling from the majority class. (undersampling potentially discards useful data from the majority class, unless we can somehow determine that the discarded data adds no new information. In this case, there's a lack of homogeneity in the majority class, which the paper points out i.e. "there are cultural and age differences in coughs, future work could focus on tailoring the model to different age groups and regions of the world ")
2. In the abstract, the claim is:
"When validated with subjects diagnosed using an official test, the model achieves COVID-19 sensitivity of 98.5% with a specificity of 94.2% (AUC: 0.97). For asymptomatic subjects it achieves sensitivity of 100% with a specificity of 83.2%." [Reminder: sensitivity = True Positive Rate = TP/P, specificity = True Negative Rate = TN/N]
If you look at Table 1, the breakdown is 59% self-reported, 28% doctor's assessment, 13% official test.
3. 5320 patient data points is something (the train/test breakdown is 4256/1064, so the model was built on 4256 data points). It would depend on the assumptions, but on first glance (based on sample size calculators), it doesn't seem underpowered. That said, this assumes a homogeneous population. The dataset is likely (unintentionally but) systematically undersampling certain populations due to lack of reach.
What I worry about with the undersampling are the “difficult” cases such as other types of respiratory conditions and infections. How many COPD, rhinitis, chronic bronchitis, etc patients were there in the training data? It is precisely these patients the algorithm needs to perform well on as they are higher risk and / or likely to be most prevalent among the people who seek out this app.
I think the other big question is what advantages / disadvantages does this have compared to a questionnaire administered to someone who is experiencing symptoms of an upper respiratory infection?
That being said, this study is a significant academic achievement. The authors should be very proud of what they have done. There are real challenges to doing something like this that impose hard limitations and they did as well as anyone could without infinite resources.
1. Isn't an issue. They make inference on a sample by sample basis. The network has no memory so it won't expect a 50/50 distribution on the test set just because its trained like that. Having a balanced distribution is the exact right thing to do because you do not want the network to be biased to one or the other class for any given sample. If it were unbalanced the network could achieve almost 0 training error by just predicting negative all the time. This is not what you want.
My main concerns with the imbalance are undersampling of the negative class data distribution relative to the positive class, and overestimating performance on the test splits. I can buy that you may want to train on a balanced dataset, but the testing condition should reflect the true case distribution as closely as possible.
I agree that you would not want to use only the class priors for prediction. However, I do not think it is clear that you would want to throw that information out. Also not sure that I agree with the statement that neural network has “no memory” of the prior class distribution. That is a strong claim to make about something as opaque as a neural net model.
They could have used all negative samples for testing (and even training if they would have done it better), yes. But once your test set is large enough, whatever that means, its not that relevant anymore. They are anyway "under sampling" by not recording data from all humans that are negative right now.
And no, it's not a strong claim to make. Of course the network learns the distribution of your training set. That's why you want it balanced. But during successive applications of inference the weights do not change, it has no state. So it cannot, for example, store that it just predicted 90% negative and now it would be time again for some positive prediction.
An easy problem to solve. Have the app keep the coughs until a test result is obtained then feed it into the model. Intermittently update the model used by the apps, done.
Easy in theory, but not in practice. For good reasons there are very strong privacy protections in place for medical records, and significant administrative barriers. And this is not even getting into the technical / infrastructure challenges.
Maybe feasible for a VC funded company with several million dollars and >20 FTEs. Less so for an academic lab with a few grad students and postdocs being paid with pocket lint.
I was thinking it could be self reported in-app. After using the app and saving coughs, if you get a test result you have the option of adding it with the test date.
This was not a blinded clinical trial. The subjects all knew whether they have COVID-19 or not and knowing how strong psychological effects can be, what's detectable in their cough might be their knowledge they're sick. The researchers even acknowledge in the paper that "sentiment" is a big part of how a forced cough sounds.
What's worrying is also how little of the data was from a diagnostic test (over half of "positive" samples were "self-diagnosed" COVID-19, whatever that means).
I don't think FDA or any other regulatory body would accept such an app as a screening tool without a proper trial being done.
If it works, that would be the most practical and coolest application of ML I've seen - but it still feels like something from the "too good to be true" category at the moment.
It’s just junk science and the title is false. They used an ML model to detect if a person knows they have a diagnosis through a fake cough into a phone app. Even then their results could quite possibly just be overfitting, even with the verification data set separated.
Thank for pulling out that false positive statistic. It is incredibly irritating when an article, even a press release type one like this, makes it a point to give an exact number for the true positive / false negative rate and then fails to answer the obvious other half of the question. It made me sneer "oh yeah? Well, a magic 8 ball that only ever says 'yes you have covid you're gonna die' would catch that remaining 1.5%; why aren't you doing that?"
But as usual, the fault is in the summary, not the research.
Based on the title, my first question was "If they are asymptomatic why are they coughing?" I know there are many reasons you can cough, (acid reflux, allergies, etc) but this still seems that if its detectable its not 100% asymptomatic. Im splitting hairs(possibley wrong?), yes, but to me asymptomatic means its completely passive
I had the same thought, but I think it still counts as asymptomatic. Symptomatic and detectable are two different things.
For example, it's possible to have early stage breast cancer or colon cancer but have no symptoms (yet). Which is why they do screenings to catch these early.
Asymptomatic doesn’t mean unaffected. It means there are no symptoms the patient is aware of and presents with. A large fraction of the asymptomatic cases on the Diamond Princess had pneumomia.
If this really works, this could be the biggest news of the year. Assuming they get FDA approval for this app, we would suddenly have a free, instantly scalable, instantly available test that catches the vast majority of true positives. You could very easily set these up at all public places and require people to check before entry. If that happens you don’t even have to require people to self-test at home, they’ll do it voluntarily to know in advance whether they’ll be let in at their destination.
The final and most difficult step would be effective quarantine of infected individuals, some of whom are likely to try and go to work anyway etc.
But even if you assume nothing more than voluntary self-quarantine etc, I would expect this to drive R0 below 1 very quickly, as the vast majority of infected would stay home and thus cease to spread the disease.
Finally, if all of the above where to come true, I think this could go down in history as the first truly life-changing AI discovery, and potentially one of the biggest watershed moments in recent history.
Obviously we’re not there yet, but I am very optimistic and excited after reading the story.
It has almost a 20% of false positives. If you have a class of 30 students, approximately 6 of them will get a false positive on Monday and loose the rest of the week just in case. By Friday, you will have only 10 students...
The next week you will have the same problem ...
The next week, people will start to ignore the test.
And this assuming the students only take one test per day. If they also get tested in the bus and the cafeteria and the supermarket, ... the number of people without a false positive will much lower.
The way to look at it is as a cheap instant filter, something like a bloom filter, that can protect the more expensive tests that take longer.
Your example assumes there's no hierarchy of available tests, and that this test is the only test there is.
What would really happen is those 6 false positives would be referred for a more accurate test. They might miss a day of school but not a week.
At the same time, your more accurate testing pipeline can now speed up thanks to Little's law. There's dramatically less pressure on the system and less backlog, so you have a second order effect that the more expensive slower tests also become cheaper and quicker.
But even if we gloss over all that, and we're only concerned about false positive rate, then this is still much better than no school at all, as in hard lockdown, which has a 100% false positive rate.
Finally, there's the lives saved because of earlier rapid detection and isolation, with corresponding relief for the health care system, leading to increased quality of care and resources available for more severe cases... and so on and so on.
A bloom filter can do wonders for a system, and if this test works it should do the same.
Are the false positives IID, or are they correlated to each other in the same individual, i.e. a false positive will likely test false positive again, a true negative will likely never test false positive? This is the kind of critically important stuff in stochastic processing theory that medical-field statistics never seem to care to report on.
If it is the former (IID), and let's say P_D = 1.0, P_FA = 0.2, it is an extremely easy problem to solve: Just have each student take 3 tests each day, which will reduce the overall P_FA from 0.2 to 0.008. Or 4 tests for 0.0016.
If it is the latter, you will only lose 6 students for the whole week; you'll have 24 students left on Friday, not 10.
Do you think the false positives are per-recording? In that case, you could just do a few more test coughs with the same person to double check, and get that false negative rate down.
If false positives are per-person, then your scenario won't happen. It'll be the same 6 kids for whom the test never works right.
So with all this in mind, you'll have to come up with appropriate norms around the results. You could call it "okay" vs "suspect" instead of negative and positive. Maybe there's a lowered-risk version of activities for people who are "suspect" that day. Maybe they don't go to the gym that day, maybe they sit in the isolated booth in the classroom, whatever. But then, they need to take a standard test that night to return to school the next day. Or as someone else mentioned, a rapid-test at the nurse's office.
Reading the headline, I got really nervous imagining how many people would assume that a negative on the cough app means they are COVID-free and then go out and infect everyone.
However, reading the article, it seems like the false negative rate is really low. It sounds like this could be an incredibly effective screening tool.
My wife has been following suspected COVID patients and has said since the beggining their cough is different than normal. Makes sense thay ML would work is this scenario.
This anecdote actually makes me a lot more confident the tool might work. The article sounded to me like they just threw something at the wall, and it's not that hard to get a model that looks good on paper when you take the liberties they did.
But if this study serves as a PoC to back up that real-world observation, then this is quite a promising approach!
There's a reason they published this in an engineering journal, not medical one. It's very cool, but in terms of practical use, is in the realm of "maybe it's worth doing a real medical study to see if this works" - a valuable contribution, no doubt, but a not really a contribution to diagnostics in itself (yet).
That said, I like that they're thinking outside the box on this one. A free digital test with a low false negative rate would be a game changer.
I agree. Covid-related work that is disseminated in "non-medical" venues presents most likely toy examples, which is okay as long as the authors clearly state the limitations and do not try to push their research directly to the public without getting medical experts involved.
Researchers with really strong results will for sure publish in medical journals, which typically implies a massive push in prestige (impresses the administration and funding providers).
This sounds extremely useful, if it's as accurate as it sounds - scaling this up to test literally billions of people a day would be a major factor in controlling the pandemic, and far easier than scaling physical tests.
I tried to propose this idea at the beginning of the pandemic to a bunch of pulmonologists, some also very active in research, but received no interest at all.
> A user could log in daily, cough into their phone, and instantly get information...
While this model is indeed extremely useful and interesting work, this seemingly casual quote gives new meaning to how unsanitary our phones really are/can be.
I'm happy that so many phones are waterproof these days, because now I can clean the thing more thoroughly.
Of course there the issue of water getting into the charging port, but I've found that the air blown out of a laptop compiling a Node.js project works brilliantly here.
Why? A false positive it’s actually not that big of a deal. You spend a day being really cautious and you get another test to confirm, and then it turns out that you didn’t actually have Covid and you feel amazingly relieved.
It’s a false negative they really need to worry about. Then you have people who are going around super spreading but telling everyone that it’s fine because they tested negative.
A lot of false positive would make this solution unreliable and would have the same effect of a lockdown. Also, 70k people sample could also mean discrimination towards minorities, we aren’t all made the same.
People voting me down probably didn’t get my point. If a tool is inefficient people won’t use it. A way to detect COVID that would produce many false positive would mine the spread of its use. Also, if the demographic sample is not done properly in theory you could risk having people underrepresented getting the wrong end of the stick. Imagine what this would mean in countries like USA!
lockdown is literally the goal. this is an example of a way to target such a lockdown instead of a blanket quarantine. if you can reliably get false negatives, you can clear people to go out.
Now that you or your employer have agreed to the terms and conditions of facetime/zoom/teams/webex/skype/etc, we now have an opportunity to identify and detai... market to coronavirus sufferers.
The way this is reported doesn't give that much information. It says it has a high true positive rate, but what is the false-positive rate? A test isn't useful if there is a high rate of these.
This sounds like a bad usage of machine learning. Unless the Covid cough sounds distinctly the same, then there might just be too many false positives that can come from this. Or even a false negative.
I remember back in March a lot of people (within the field) saying AI folk should stand back and let the medical/virology community deal with Covid. This sort of thing puts paid to that attitude, IMO.
I want an Amazon Echo skill to screen anyone before they come to my house :)
Seriously, this is incredibly and if this is verified to work could be a game changer. The real thing we need to do is all get tested at once on the same day at the same time. Then those who are positive need to isolate for 2-3 weeks until they are negative again. That would completely reset us back to nearly zero. Then do this again 2-3 times and we could shove this demon back into the bottle.
Yeah we would need to do daily testing too. The main problem is that (a) it takes 3-5 days for the infection to become detectable and (b) it takes 1-5 days to get your test results back. So for 4-10 days you are contagious and don’t know it. Instant testing like this that works before symptoms means you could detect the infection soon as it takes hold, significantly reducing the spread rate. If our traditional testing was all rapid we would be a lot better off. Well that, and leadership that isn’t solely concerned with its own reelection in a pointless bid for staying out of prison. Both would be good.
1. They trained and tested on a balanced dataset, which is very unlike the data distribution this algorithm would see “in the wild”. Under real world prescreening conditions the data would likely be extremely unbalanced toward the negative class, and also be subject to drift over time.
2. They seem to have identified positive subjects through a questionnaire not via clinical chemistry diagnostics; so (a) it is unclear whether their training labels are correct, and (b) they may have completely missed the asymptomatic population.
3. As mentioned in another comment ca. 5000 patients and 250K samples is not a lot considering the size and diversity of the population(s) where this would be deployed.
Disclaimer: I gave the article the brief high level scan treatment so I could be wrong about any or all of these. Please correct me if I am mistaken.