I think the other piece that has been glossed over a bit is that DK are using quantiles (for both the test and the self-assessment). That means everything is bounded by 0 and 1, and you can't underestimate your performance if it was poor, or overestimate your performance if it was perfect. Or conversely, if you're the most skilled person in the room, your (random) actual performance on the day of the test is bounded above by your true skill, and vice versa for the least skilled. So e.g. we could simulate data with perfect self-assessment of overall skill, add a small amount of noise to actual performance on the day of the test, and get the same results. The bottom quartile (grouped by actual test score) will be a mix of people who are actually in the bottom quartile in skill and some who are in the higher quartiles. The top quartile by actual test score will be a mix of some from the top quartile in skill and some from lower quartiles.
I agree in principle, although I think to get an effect size similar to what DK observed you'd need quite large noise. Which again comes back to the test reliability.