I think if you wanted to poke holes in the paper you'd start with the generic issues that are typical to much psychological research:
1. It uses a tiny sample size.
2. It assumes American psych undergrads are representative of the entire human race.
3. It uses stupid and incredibly subjective tests, then combines that with cherry picking:
"Thus, in Study 1 we presented participants with a series of jokes and asked them to rate the humor of each one. We then compared their ratings with those provided by a panel of experts, namely, professional comedians who make their living by recognizing what is funny and reporting it to their audiences. By comparing each participant's ratings with those of our expert panel, we could roughly assess participants' ability to spot humor ... we wanted to discover whether those who did poorly on our measure would recognize the low quality of their performance. Would they recognize it or would they be unaware?"
In other words, if you like the same humor as professors and their hand-picked "joke experts" then you will be assessed as "competent". If you don't, then you will be assessed as "incompetent".
Of course, we can already guess what happened next - their hand picked experts didn't agree on which of their hand picked jokes were funny. No problem. Rather than realize this is evidence their study design is maybe not reliable they just tossed the outliers:
"Although the ratings provided by the eight comedians were moderately reliable (a = .72), an
analysis of interrater correlations found that one (and only one) comedian's ratings failed to correlate positively with the others (mean r = -.09). We thus excluded this comedian's ratings in our calculation of the humor value of each joke"
The fact that this actually made it into their study at all, that peer reviewers didn't immediately reject it, and that the Dunning-Krueger effect became famous, is a great example of why people don't or shouldn't take the social sciences seriously.
> is a great example of why people don't or shouldn't take the social sciences seriously.
Oh the irony in your last statement. Somebody who hasn't done social science research professionally (this is an assumption, let me know if I'm wrong), has difficulty judging what social science research can (and can't) do ...
One does not need to have done social science research to be able to recognize obvious general philosophy of science level problems with the methods used in much social science.
I’d we take your claim seriously then we have to disallow all critiques of the replicatability crisis in the social sciences that don’t come from social scientists, but that would present an obvious new problem: conflict of interest. It’s also just an absurd requirement.
You are correct - I should have been more precise: I hypothesis parent has not done science research professionally (again happy to be proven wrong).
Don't get me wrong, I'm not defending social science research per se (yes, there are questionable methods). I'm critiquing parent who has high confidence in pointing out issues with the DK paper, yet misses the real issues. Which, in the context of discussing whether the DK effect is more than just regression to the mean, is quite ironic (which I have worded quite strongly, agreed).
Parent's arguments lead to absurd conclusions like "two Cornell professors not being very logical people" or "a HN poster being better at peer review than experts in the field". If you want to see a state-of-the-art critique of whether the DK effect is explained by metacognition vs. regression to the mean see [1].
Why is this relevant? From the article:
> I have no illusions that everything I read online should be correct, or about people’s susceptibility to a strong rhetoric cleverly bashing conventional science, even in great communities such as HN. But frankly, for the last few years, the world seems to be accelerating the rate at which it’s going crazy, and it feels to me a lot of that is related to people’s distrust in science (and statistics in particular).
I completely agree with the author here. Science is rarely black and white, and, arguably, there are more shades of grey in the social sciences. Just as an example, because you mentioned the replicability crisis. I still see many commenters here on HN believing that from the failure to replicate a result it follows the result is wrong. It doesn't. But that's a whole other discussion.
None of your points actually address the sample size and study design issues that wouldn’t be unacceptable even in social sciences today. Generalizing results from a fistful of privileged undergrads is a well-known issue even in the community.
I totally agree, the first study with the jokes seems silly. But I am also not from the field, maybe it is not actually as silly as it seems to me. But the other studies seem much better to me and removing the first one would not change the conclusions.
Is there supplemental material I didn't notice? I only scan read it after the joke section but I can't find any mention of supplemental data anywhere. That's a problem because although you say the other tests are better, no information appears to be provided on which we can judge that.
Let's look at the second test. It's advertised as a "logic test". The description is:
> Participants then completed a 20-item logical reasoning test that we created using questions taken from a Law School Admissions Test (LSAT) test preparation guide (Orton, 1993).
That's the entire description of their method. So immediately, we can see the following problems:
1. Just like the joke test, there's no way to replicate this given the description in the paper. Which questions did they take and why? In turn this throws all claims that the DK study has been replicated into question.
2. The citation is literally a Cliffs Notes exercise for students. It's about memorization of answers to pass law exams, not an actual test itself designed to verify logical reasoning ability. Why do they think this is a good source of questions for testing logic? Law is not a system of logic, there's even a famous saying about that: "the life of the law is not logic but experience". If you wanted to test logical reasoning a more standard approach would be something like Raven's Matrices.
Putting my two posts together there's a third problem:
3. Putting aside the obvious problems with subjectivity, their joke test is defined in an illogical way. They define a test of expertise (working as a comedian), select some people who pass this test and define them as experts, then discover that one expert would have been ranked by their own test as "incompetent but doesn't know it". Yet this is a contradiction, because this person was selected specifically because the researchers defined them as competent. Rather than deal with this logical contradiction by reframing the question they simply ignore it by discarding that comedian from their expert pool.
This is good evidence that DK themselves weren't particularly logical people, yet, they claim to have designed a test of logic - a bold claim at the best of times. Ironically, it appears DK may be suffering from their own effect. They believe themselves to be competent at designing tests yet the evidence in their paper suggests they aren't.
In my experience prepping and taking the test, I found the LSAT logic questions to be pretty good at assessing deductive reasoning.
They’re 100% divorced from law and are closer to puzzles of the nature of, “Six people sit at a table, four of whom are wearing hats, three of which are red, …”
I could not quickly find the LSAT preparation guide but I found some LSAT sample questions [1] and they seem suitable to assess reasoning abilities. Also I do not think that it really matters which questions you choose as long as they span a wide enough difficulty range so that you are able to separate participants.
Hmm, do they? The logical reasoning test in that page is a question about lab rat studies on coffee+birth defects, and a hypothetical spokesperson's response that they wouldn't apply a warning label because the government would lose credibility if the study were to be refuted in future. You're then asked a multiple choice question:
1. Which of the following is most strongly suggested by the government’s statement above?
(A) A warning that applies to a small population is inappropriate.
(B) Very few people drink as many as six cups of coffee a day.
(C) There are doubts about the conclusive nature of studies on animals.
(D) Studies on rats provide little data about human birth defects.
(E) The seriousness of birth defects involving caffeine is not clear.
Given the structure of this question I assumed there'd be more than one right answer but apparently, the only "logical" answer is C.
Maybe the word logic is used differently in the legal profession, but this doesn't resemble the kind of logic test I'm used to. It's about unstated/assumed implications of natural language statements i.e. what a 'reasonable' person might read into something, rather than some sort of tight reasoning on which logical laws could be applied. I can see why that's relevant for lawyers but it's not really about logic.
Still, let's roll with it. (A) and (B) are clearly irrelevant given the stated justification, strike those. But (C) and (D) appear to just be minor re-phrasings of each other. Why is C correct but D not? An implied assumption of the study is that rat studies provide a lot of data about human birth defects, and the government's position implies that they don't agree with that. D could easily be a reasonable subtext for that position. E could also be taken as a reasonable inference, that is, the government believes there's a risk the study authors are using an exaggerated definition of birth defect that voters wouldn't agree with, and that 'refutation' of the study would take the form of pointing out the definitional mismatch.
So if I was asked to score this question I'd accept C, D or E. The LSAT authors apparently wouldn't.
That said, the "analytical reasoning" sample question looks more like a logic test, and the logic test looks more like a test of analytical reasoning. But even their bus question is kind of bizarre. It's not really a logical reasoning test. It's more like a test to see if you can ignore irrelevant information. The moment they say rider C always takes bus 3, and then ask which bus {any combination + C} can take, the answer must be (C) 3 only. Which is the correct answer.
> I do not think that it really matters which questions you choose as long as they span a wide enough difficulty range so that you are able to separate participants.
The problems here are pointing at a fundamental difficulty: all claims about competence/expertise are relative to the person picking the definition of competent. In this case the tasks are all variants on "guess what the prof thinks the right answer is", which is certainly the definition of competence used in universities, but people outside academia often have rather different definitions.
So the questions really do matter. If the DK claim was more tightly scoped to their evidence - "people who think they're really good at guessing what DK believe actually aren't" - then nobody would care about their results at all. Because they generalized undergrads guessing what jokes Dunning & Kruger think are funny to every possible field of competence across the entire human race, they became famous.
> Given the structure of this question I assumed there'd be more than one right answer
I did not, since the question is explicit about there being one correct answer only:
“Which of the following is most strongly suggested by the government’s statement above?”
> but this doesn't resemble the kind of logic test I'm used to. It's about unstated/assumed implications of natural language statements
Agreed.
> But (C) and (D) appear to just be minor re-phrasings of each other.
I think the key here is “there are doubts”. The government’s position stems from doubts on the conclusive nature of the study, that’s it. The statement doesn’t say anything about how much data studies on rats provide about human birth defects. If we’re being logical, studies on rats provide “no data” on human birth defects. Across many studies with different substances there may be a correlation (p(human birth defect | rat birth defect) = x), but an observation of birth defects on rats for a particular substance gives us data about rat birth defects, not human ones.
Ah yes - is vs are. You're right. I think I assumed there'd have to be >1 right answer after reading the options.
It's a remarkably poor question, but option (C) isn't about doubts on the conclusive nature of this specific study, but rather the nature of all studies on all animals. You could credibly argue (and I'd hope a lawyer would!) that no government would base policy on doubting all animal studies and that their position in this case must therefore be due to something about this specific study, e.g. the usage of rats, or the topic of birth defects, or both. So they could argue that (D) is the most logical answer.
Not that it really matters. Pretty clearly the LSAT authors are using the word logical in the street sense of "makes sense" or "sounds plausible" rather than meaning "based on an inference process that's free of fallacies". If DK based their test of competence on questions like this then it doesn't mean much, in my view.
If the validity or significance of the paper depends on whether LSAT questions are fit for DK's purpose, we have entered a much more subjective realm than whether they mishandled the statistical analysis - but as we are there, now, I feel that this particular question is not as bad as it is being portrayed.
Firstly, I think we should put aside the fact that it is labeled as a test of "logical reasoning": it is certainly not a test of formal logical reasoning, and an ambiguous or erroneous label does not necessarily make it a bad question (it is not necessary that it be characterized at all.)
Secondly, we are not logically obliged to accept that it has only one answer among the options presented, though if it has more or less than one while the people who posed it thought it had exactly one, that is a problem (I once was nearly expelled from a class for making this point at greater length than the instructor liked!) On the other hand, the question asks which of the candidate answers is most strongly suggested by the passage, which is not a statement that the others are false.
Here, however, it certainly has no more than one answer among the candidates: there is nothing in the passage that has any bearing on options A, B, D or E
- this is perhaps most obvious in the case of B, but the others are like it. In particular, with respect to D, that specific issue is not raised, and furthermore, if it was the government's opinion now that D was the case to the extent of having a bearing on the decision, there would be no need to explain its position in terms of a potential future determination that the tests are inconclusive.
C, on the other hand, is suggested by the government's explanation: if the tests were conclusive, their future refutation would not be a worry.
As I said, this is not a test of formal logic, where the government's response would not imply the possibility of future refutation. Nevertheless, to explain something on the basis of a premise that is only formally possible would be almost as much an informal fallacy as begging the question, IMHO, and one might suspect it is being offered deceitfully (a concept that has no place at all in logic.)
The sort of analysis of natural language called for here (to see what are and are not the issues being considered) is useful and important, for lawyers and the rest of us, and it is, as I have set out above, more objective than "makes sense" or "sounds plausible." If people were more practiced in analytical reading, then corporations, governments and other organizations would less easily get away with blatant non-sequiturs in their explanations of their positions and actions ("there is no evidence the attackers took any personal or confidential information"...)
The second question labeled analytical reasoning is probably closer to what you consider a logical reasoning question, maybe they picked questions more like those?
That's the issue - we don't actually know what they did. Which means their claims would have to be taken on faith.
Now, maybe other researchers designed different more rigorous studies that are replicable and which show the same effect. That could be the case. The point I'm making here is that the DK paper isn't by itself capable of proving the effect it claims, and that you don't need a statistical argument to show that. Sanity checking the study design is a good enough basis on which to criticize it.
> It's about memorization of answers to pass law exams, not an actual test itself designed to verify logical reasoning ability. Why do they think this is a good source of questions for testing logic?
I'm not sure what that has to do with anything? The paper doesn't claim to have anything to do with testing logic. It's about people's self-perception in relation to a task at which they are, or are not, competent. That task could be juggling watermelons or strangling geese for all it matters.
> The paper doesn't claim to have anything to do with testing logic.
The paper reports on the results of a 'logic' test administered to undergrads and uses this to define competence. It's a key part of their evidence their effect is real.
> It's about people's self-perception in relation to a task at which they are, or are not, competent. That task could be juggling watermelons or strangling geese for all it matters.
The specific tasks matter a great deal.
The whole paper relies very heavily on the following assumption: DK can accurately and precisely tell the difference between competence and lack of competence. In other words, that they know the right answers to the questions they're asking their undergrads.
In theory this isn't a difficult bar to meet. They work at a school and schools do standardized testing on a routine basis. There are lots of difficult tasks for which there are objectively correct and incorrect answers, like a maths test.
But when we read their paper, the first two tasks they chose aren't replicable, meaning we can't verify DK actually knew the right answers. Plus the first task is literally a joke. There isn't even a right answer to the question to begin with, so their definition of "competence" is meaningless. The other tasks might or might not have right answers that DK correctly selected, but we can't verify that for ourselves (OK, I didn't check their grammar test but given the other two are unverifiable why bother).
That's a problem because the DK effect could appear in another situation they didn't consider: what if DK don't actually know the right answers to their questions but their students do. If this occurs then what you'd see is this: some students would answer with the "wrong" (right) answers and rate their own confidence highly, because they know their answer is correct and don't realize the professors disagree. Other students might realize that the professors are expecting a different answer and put down the "right" (wrong) answer, but they'd know they were playing a dangerous game and so rate their confidence as lower. That's all it would take to create the DK effect without the underlying effect actually existing. To exclude this possibility we have to be able to check that DK's answers to their own test questions are correct, but we can't verify that. Nor should we take it on faith given their dubious approach to question design.
> The paper reports on the results of a 'logic' test administered to undergrads and uses this to define competence.
Right, but my point is that 'logic' is simply being used as an example of 'a task'. It's immaterial whether it's actually a good test of logic. As long as you agree that whatever it is is a good example of 'a task', then it's equally probative for the purpose of their argument.
The tasks aren't arbitrary. They're meant to be a proxy for some universal concept of competence. That's why DK is a well known effect, it claims to hold true for anything even though they can't test every possible task.
> we presented participants with tests that assessed their ability in a domain in which knowledge, wisdom, or savvy was crucial: humor (Study 1), logical reasoning (Studies 2-and 4), and English grammar (Study 3).
They picked humor because they think it reflects "competence in a domain that requires sophisticated knowledge and wisdom". They then realized the obvious objection - it's subjective - and decided to do the logical reasoning task to try and rebut those complaints (but then why do the first experiment at all?):
> We conducted Study 2 with three goals in mind. First, we wanted to replicate the results of Study 1 in a different domain, one focusing on intellectual rather than social abilities. We chose logical reasoning, a skill central to the academic careers of the participants we tested and a skill that is called on frequently ... it may have been the tendency to define humor idiosyncratically, and in ways favorable to one's tastes and sensibilities, that produced the miscalibration we observed-not the tendency of the incompetent to miss their own failings. By examining logical reasoning skills, we could circumvent this problem by presenting students with questions for which there is a definitive right answer.
So logical reasoning was chosen because:
1. It's objective.
2. It's an important skill.
3. It's a general "intellectual" skill.
That makes it very important if it's actually a good test of logical reasoning. If it was truly an arbitrary test like an egg-and-spoon-race or something, then there's no reason to believe the results would generalize to other areas of life and nobody would care.
> The tasks aren't arbitrary. They're meant to be a proxy for some universal concept of competence.
I’ve seen absolutely nothing suggesting this. It’s explicitly about task competency; no particular task is specified nor needs to be specified.
> That's why DK is a well known effect, it claims to hold true for anything even though they can't test every possible task.
Yes, they claim it holds true for everything because it’s how human beings introspectively experience being poor at a task. It’s really not necessary to have some Platonic ideal of Task Competency … which is then specifically restricted to logical tasks for reasons known only to you.
> Logical reasoning was chosen because: It’s objective.
I think there’s a kernel of truth in this, albeit assuming by ‘objective’ you instead mean (as people often do) something like “people almost always agree in their evaluations of this quality”. You need that for a good experiment. I’m still not sure how it relates at all to your point here. Personally I would find it easier to just say “I was wrong, it’s not explicitly about logic, I just associated it with that because it’s commonly adduced in silly arguments about logic/intelligence on the internet” - but ah well, it’s an interesting theory so I’m happy to discuss it.
The reason no particular task needs to be specified to invoke DK is exactly because they argue that their initial selection of experimental tasks is so general, that the effect must apply to everything.
It feels like you and danbruc are inverting causality here. You start from the assumption that DK is a real effect and then say, because it's real and general, it doesn't matter what tasks they used to prove it. But that's backwards. We have to start from the null hypothesis of no effect existing, and then they have to present evidence that it does in fact exist. And because they claim it's both large and very general, they need to present evidence to support both these ideas.
That's why they explicitly argued that their tasks reflect general attributes like wisdom and intelligence: they wanted to be famous for discovering a general effect, not one that only applies in very specific situations.
But their tasks aren't great. The worst are ridiculous, the best are unverifiable. Thus the evidence that DK is a real and general effect must either be taken as insufficient, or you could widen the argument to include studies by other psychologists that pursue the same finding via different means.
And because they claim it's both large and very general, they need to present evidence to support both these ideas.
To me the claims in the paper do not really seem that strong, almost to the point that I am not sure if they claim anything at all. If you read through the conclusions, they mostly report the findings of their experiments. The closest thing to any claims about generality I can find is that they discuss in which scenarios their findings will not apply. You could maybe read into this that they claim that in all other scenarios their findings apply, but that is not what they actually do.
But I guess the better way to discuss this is that you just quote the claims from the paper that you consider too strong and unjustified instead of me trying to anticipate what you are referring to or me going over each claim in the paper.
The tasks aren't arbitrary. They're meant to be a proxy for some universal concept of competence.
This seems at least somewhat wrong to me - the competence is not universal but task specific. They compare how your competence to perform task X is related to your ability of assessing your performance of task X in absolute terms and relative to the other participants. They repeat this for different tasks and find that for all tested tasks the same pattern emerges - roughly, the better your performance, the better your ability to accurately assess your own performance and the performance of others.
So you can be competent doing task X and provide accurate assessments for task X performances while at the same time being incompetent doing task Y and being less accurate in assessing task Y performances. This essentially means that you can not be universally good at assessing performances of arbitrary tasks, you can only do this well for tasks for which you are yourself competent.
For completeness I would add that a good task must allow objectively rating the performance of participants with [much] room for debate. But given that, the whole setup is self-contained and task-independent. Let participants perform the task and establish their competence by rating their performance. Then let participants perform the meta-tasks of rating their performance in absolute and relative terms and finally check how task and meta-task performances are related.
I can't quite figure out from this post and the posts after if you have any background in social science or not (you have stated you didn't do social science professionally - but I get a nagging feeling you have studied it) - and I'll try to explain why I think it matters. For what it's worth - I wouldn't necessarily object to what you wrote here if you finished with "great example of why people don't take the social sciences seriously" and left it there. I do have a problem with "shouldn't", although in a different setting (i.e. amongst social science people) I would probably argue for "shouldn't".
Full disclaimer - I was a sociological researcher before I started working in IT - and would (I can appreciate the irony given all of this is about DK effect) rate myself as very significantly above average in terms of methodological rigour and mathematical skill compared to other social researchers.
One thing that is taught to social researchers - although I've seen it much less with psychologists - is that social research is fundamentally different from natural sciences in that it is accepted as fundamentally subjective. Now, a radical such as myself will tell you that all research, including natural science, is not entirely objective due to very subjective navigation of selection bias, but putting that to the side - this is an extremely important point when evaluating social research.
Coming back to your original point - I would agree with the points you object to vis-a-vis original DK Effect paper, however, as a social researcher, I am always already coming into reading that paper knowing that I'll have to take it with spoonfulls of salt. There is no need to write the paper in a way that puts in many of the disclaimers you might expect, because we are institutionally taught that these disclaimers apply.
Having said that - one of my peeves with social research, and why I ultimately went away, is that a lot of garbage goes on and gets through peer review. There is almost no proper testing of quantitative instruments and methods. Which is why I agree with your point that it rightfully isn't taken seriously - but I would object to your assertion that it shouldn't be taken seriously. Especially amongst IT professionals who are already going to have a bias against non-STEM. Point out the shortcomings and apply a different interpretive lense, rather than discounting the field completely - as social science can be better and taken seriously if it was held to a higher standard, even with the methodological shortcomings we have today - but it is very often discounted wholesale, which I don't think is going to incentivise the bubble that is forming around it to reform and get better.
1. It uses a tiny sample size.
2. It assumes American psych undergrads are representative of the entire human race.
3. It uses stupid and incredibly subjective tests, then combines that with cherry picking:
"Thus, in Study 1 we presented participants with a series of jokes and asked them to rate the humor of each one. We then compared their ratings with those provided by a panel of experts, namely, professional comedians who make their living by recognizing what is funny and reporting it to their audiences. By comparing each participant's ratings with those of our expert panel, we could roughly assess participants' ability to spot humor ... we wanted to discover whether those who did poorly on our measure would recognize the low quality of their performance. Would they recognize it or would they be unaware?"
In other words, if you like the same humor as professors and their hand-picked "joke experts" then you will be assessed as "competent". If you don't, then you will be assessed as "incompetent".
Of course, we can already guess what happened next - their hand picked experts didn't agree on which of their hand picked jokes were funny. No problem. Rather than realize this is evidence their study design is maybe not reliable they just tossed the outliers:
"Although the ratings provided by the eight comedians were moderately reliable (a = .72), an analysis of interrater correlations found that one (and only one) comedian's ratings failed to correlate positively with the others (mean r = -.09). We thus excluded this comedian's ratings in our calculation of the humor value of each joke"
The fact that this actually made it into their study at all, that peer reviewers didn't immediately reject it, and that the Dunning-Krueger effect became famous, is a great example of why people don't or shouldn't take the social sciences seriously.