This was submitted a couple hours ago by the post author but was deleted for some reason. Reposting my criticism:
1) Distribution of Resume Scores for Each Group box plot: the distributions for both box plots seem to be equivalent at a glance; you can't assert a statistically significant difference between group scores at first glance.
2) Overall Accuracy histogram: You assert that 64% of the resumes are strong, so a person saying that all the resumes are strong would have 61% accuracy. The most popular groups are 3/6 correct and 4/6 correct, which fits this, and is why accuracy isn't always the best metric for a successful experiment, especially with relatively low amounts of data. (also, that distribution is definitely not normally distributed.)
3) None of the differences between participant groups were statistically significant (p < 0.05). In other words, all groups did equally poorly. That's not what a statistical significance test determines: it determines where the results between two statistics (in this case, the accuracy of two participant groups) can be attributed to chance (i.e. if p < 0.05, then the observed value only has a <5% chance of occurring, so it is unlikely to occur by random chance).
1) Distribution of Resume Scores for Each Group box plot: the distributions for both box plots seem to be equivalent at a glance; you can't assert a statistically significant difference between group scores at first glance.
2) Overall Accuracy histogram: You assert that 64% of the resumes are strong, so a person saying that all the resumes are strong would have 61% accuracy. The most popular groups are 3/6 correct and 4/6 correct, which fits this, and is why accuracy isn't always the best metric for a successful experiment, especially with relatively low amounts of data. (also, that distribution is definitely not normally distributed.)
3) None of the differences between participant groups were statistically significant (p < 0.05). In other words, all groups did equally poorly. That's not what a statistical significance test determines: it determines where the results between two statistics (in this case, the accuracy of two participant groups) can be attributed to chance (i.e. if p < 0.05, then the observed value only has a <5% chance of occurring, so it is unlikely to occur by random chance).
4) A helpful note is that the study can be modeled as a binomial distribution, with 6 trials and probability of success = 0.64 (assuming user guesses strong): http://www.wolframalpha.com/input/?i=binomial+distribution%2...