1. They trained and tested on a balanced dataset, which is very unlike the data distribution this algorithm would see “in the wild”. Under real world prescreening conditions the data would likely be extremely unbalanced toward the negative class, and also be subject to drift over time.
2. They seem to have identified positive subjects through a questionnaire not via clinical chemistry diagnostics; so (a) it is unclear whether their training labels are correct, and (b) they may have completely missed the asymptomatic population.
3. As mentioned in another comment ca. 5000 patients and 250K samples is not a lot considering the size and diversity of the population(s) where this would be deployed.
Disclaimer: I gave the article the brief high level scan treatment so I could be wrong about any or all of these. Please correct me if I am mistaken.
1. Given that real-world distribution of positive/negative COVID cases is hugely imbalanced, having a balanced dataset would seem to be a form of random undersampling from the majority class. (undersampling potentially discards useful data from the majority class, unless we can somehow determine that the discarded data adds no new information. In this case, there's a lack of homogeneity in the majority class, which the paper points out i.e. "there are cultural and age differences in coughs, future work could focus on tailoring the model to different age groups and regions of the world ")
2. In the abstract, the claim is:
"When validated with subjects diagnosed using an official test, the model achieves COVID-19 sensitivity of 98.5% with a specificity of 94.2% (AUC: 0.97). For asymptomatic subjects it achieves sensitivity of 100% with a specificity of 83.2%." [Reminder: sensitivity = True Positive Rate = TP/P, specificity = True Negative Rate = TN/N]
If you look at Table 1, the breakdown is 59% self-reported, 28% doctor's assessment, 13% official test.
3. 5320 patient data points is something (the train/test breakdown is 4256/1064, so the model was built on 4256 data points). It would depend on the assumptions, but on first glance (based on sample size calculators), it doesn't seem underpowered. That said, this assumes a homogeneous population. The dataset is likely (unintentionally but) systematically undersampling certain populations due to lack of reach.
What I worry about with the undersampling are the “difficult” cases such as other types of respiratory conditions and infections. How many COPD, rhinitis, chronic bronchitis, etc patients were there in the training data? It is precisely these patients the algorithm needs to perform well on as they are higher risk and / or likely to be most prevalent among the people who seek out this app.
I think the other big question is what advantages / disadvantages does this have compared to a questionnaire administered to someone who is experiencing symptoms of an upper respiratory infection?
That being said, this study is a significant academic achievement. The authors should be very proud of what they have done. There are real challenges to doing something like this that impose hard limitations and they did as well as anyone could without infinite resources.
1. Isn't an issue. They make inference on a sample by sample basis. The network has no memory so it won't expect a 50/50 distribution on the test set just because its trained like that. Having a balanced distribution is the exact right thing to do because you do not want the network to be biased to one or the other class for any given sample. If it were unbalanced the network could achieve almost 0 training error by just predicting negative all the time. This is not what you want.
My main concerns with the imbalance are undersampling of the negative class data distribution relative to the positive class, and overestimating performance on the test splits. I can buy that you may want to train on a balanced dataset, but the testing condition should reflect the true case distribution as closely as possible.
I agree that you would not want to use only the class priors for prediction. However, I do not think it is clear that you would want to throw that information out. Also not sure that I agree with the statement that neural network has “no memory” of the prior class distribution. That is a strong claim to make about something as opaque as a neural net model.
They could have used all negative samples for testing (and even training if they would have done it better), yes. But once your test set is large enough, whatever that means, its not that relevant anymore. They are anyway "under sampling" by not recording data from all humans that are negative right now.
And no, it's not a strong claim to make. Of course the network learns the distribution of your training set. That's why you want it balanced. But during successive applications of inference the weights do not change, it has no state. So it cannot, for example, store that it just predicted 90% negative and now it would be time again for some positive prediction.
An easy problem to solve. Have the app keep the coughs until a test result is obtained then feed it into the model. Intermittently update the model used by the apps, done.
Easy in theory, but not in practice. For good reasons there are very strong privacy protections in place for medical records, and significant administrative barriers. And this is not even getting into the technical / infrastructure challenges.
Maybe feasible for a VC funded company with several million dollars and >20 FTEs. Less so for an academic lab with a few grad students and postdocs being paid with pocket lint.
I was thinking it could be self reported in-app. After using the app and saving coughs, if you get a test result you have the option of adding it with the test date.
1. They trained and tested on a balanced dataset, which is very unlike the data distribution this algorithm would see “in the wild”. Under real world prescreening conditions the data would likely be extremely unbalanced toward the negative class, and also be subject to drift over time.
2. They seem to have identified positive subjects through a questionnaire not via clinical chemistry diagnostics; so (a) it is unclear whether their training labels are correct, and (b) they may have completely missed the asymptomatic population.
3. As mentioned in another comment ca. 5000 patients and 250K samples is not a lot considering the size and diversity of the population(s) where this would be deployed.
Disclaimer: I gave the article the brief high level scan treatment so I could be wrong about any or all of these. Please correct me if I am mistaken.