For anyone who is interested in playing around with these charts, the various assumptions that under pin them etc. I've thrown together a colab notebook as a starting point.
Observation: if you rank via true "skill" and assume for a particular instance the predicted performance and observed performance are independent but both have the true skill as their mean you dont observe the effect. CC of 0.00332755.
If you rank via observed performance and plot observed vs predicted the effect is there. CC of -0.38085757.
This is assuming very simple gaussian noise which is not going to be accurate especially as most of these tasks have normalised scores.
What your simulation includes and the original article didn't (and I didn't touch at all in my article) is the statistical reliability of the tests they administered. Where you got a CC of -0.38 you used equal reliability (/ unreliability) of the skill tests and self-assessments. You can see that as you increase the test reliability, the CC shrinks and the effect disappears.
I have no idea what's the actual reliability of the DK tests, they do seem to consider that but maybe not thoroughly enough. In my view it's very fair to criticize DK from that angle. But that would require looking at the actual tests and their data.
My point being, that any purely random analysis is based on assumptions that can easily be tweaked to show the same effect, the opposite effect, or no effect at all.
That's a nice spot about the decreasing CC as we increase accuracy!
My hypothesis would be that some of the DK effect in the original paper may be down to an effect like this (as suggested in the original article) but that asserting it is completely incorrect because of it is premature. We'd need access to more data to verify that the level of reliability was sufficiently acceptable.
Right. Just to be clear, "an effect like this" is (comparatively) unreliable tests, not some elusive statistical phenomena as implied by the original article. I'd have no issue if the author had called the article "the DK effect is due to poor skill tests", spent 5 minutes showing that the DK results are consistent not only with their claims but also with unreliable tests (like you did), then went on to show data that indicates that the tests indeed are not reliable enough to draw the conclusions that DK did. Instead the author spends a lot of time digging under the wrong tree and no time at all saying anything about the reliability of the tests.
I agree, the article seems to imply the plot in the original paper will always be an incorrect thing to do, instead of something which can have some issues in cases when we have inaccurate tests.
I've gone back and updated the colab notebook to use orderings exclusively instead of values and you can see that the auto correlation plot B from the first article exists when the noise is high enough but disappears when you reduce it, definitely not a statistical law.
Observation: if you rank via true "skill" and assume for a particular instance the predicted performance and observed performance are independent but both have the true skill as their mean you dont observe the effect. CC of 0.00332755.
If you rank via observed performance and plot observed vs predicted the effect is there. CC of -0.38085757.
This is assuming very simple gaussian noise which is not going to be accurate especially as most of these tasks have normalised scores.
Edit: fixed wrong way around
https://colab.research.google.com/drive/1Vy7JjkywxwEP8nfR6oS...