1. Given that real-world distribution of positive/negative COVID cases is hugely imbalanced, having a balanced dataset would seem to be a form of random undersampling from the majority class. (undersampling potentially discards useful data from the majority class, unless we can somehow determine that the discarded data adds no new information. In this case, there's a lack of homogeneity in the majority class, which the paper points out i.e. "there are cultural and age differences in coughs, future work could focus on tailoring the model to different age groups and regions of the world ")
2. In the abstract, the claim is:
"When validated with subjects diagnosed using an official test, the model achieves COVID-19 sensitivity of 98.5% with a specificity of 94.2% (AUC: 0.97). For asymptomatic subjects it achieves sensitivity of 100% with a specificity of 83.2%." [Reminder: sensitivity = True Positive Rate = TP/P, specificity = True Negative Rate = TN/N]
If you look at Table 1, the breakdown is 59% self-reported, 28% doctor's assessment, 13% official test.
3. 5320 patient data points is something (the train/test breakdown is 4256/1064, so the model was built on 4256 data points). It would depend on the assumptions, but on first glance (based on sample size calculators), it doesn't seem underpowered. That said, this assumes a homogeneous population. The dataset is likely (unintentionally but) systematically undersampling certain populations due to lack of reach.
What I worry about with the undersampling are the “difficult” cases such as other types of respiratory conditions and infections. How many COPD, rhinitis, chronic bronchitis, etc patients were there in the training data? It is precisely these patients the algorithm needs to perform well on as they are higher risk and / or likely to be most prevalent among the people who seek out this app.
I think the other big question is what advantages / disadvantages does this have compared to a questionnaire administered to someone who is experiencing symptoms of an upper respiratory infection?
That being said, this study is a significant academic achievement. The authors should be very proud of what they have done. There are real challenges to doing something like this that impose hard limitations and they did as well as anyone could without infinite resources.
It's a short 9-page paper, worth a read.
1. Given that real-world distribution of positive/negative COVID cases is hugely imbalanced, having a balanced dataset would seem to be a form of random undersampling from the majority class. (undersampling potentially discards useful data from the majority class, unless we can somehow determine that the discarded data adds no new information. In this case, there's a lack of homogeneity in the majority class, which the paper points out i.e. "there are cultural and age differences in coughs, future work could focus on tailoring the model to different age groups and regions of the world ")
2. In the abstract, the claim is:
"When validated with subjects diagnosed using an official test, the model achieves COVID-19 sensitivity of 98.5% with a specificity of 94.2% (AUC: 0.97). For asymptomatic subjects it achieves sensitivity of 100% with a specificity of 83.2%." [Reminder: sensitivity = True Positive Rate = TP/P, specificity = True Negative Rate = TN/N]
If you look at Table 1, the breakdown is 59% self-reported, 28% doctor's assessment, 13% official test.
3. 5320 patient data points is something (the train/test breakdown is 4256/1064, so the model was built on 4256 data points). It would depend on the assumptions, but on first glance (based on sample size calculators), it doesn't seem underpowered. That said, this assumes a homogeneous population. The dataset is likely (unintentionally but) systematically undersampling certain populations due to lack of reach.