Hacker News new | past | comments | ask | show | jobs | submit login

From the "methodology" section of this post:

First, we randomly sampled 6,348 applications for 668 different users from TalentWorks. Then we extracted the qualifications from the original job postings and the users’ submitted resumes using proprietary algorithms. Finally, we grouped the results based on qualification match and regressed the interview rate using a Bagging ensemble of Random Forest regressors.

This is... not plausible. Effectively, they're trying to infer causality here, not merely do prediction. That has to be the case, because this is presented as useful advice---to go ahead and apply even if you don't meet all the listed qualifications. But when you're trying to infer causality you're doing social science, not data science, and that means you need to worry about omitted variables.

Here's an example: what if less qualified people who nonetheless apply are more confident. And what if that confidence is associated with other good things that show up on resumes, like attending prestigious schools, having had prestigious prior jobs, having a record of success in some other fashion, or even just doing things like paying careful attention to formatting?

This is why social scientists use tried and true techniques like old fashioned OLS regression with control variables rather than throw everything into a random forest and see if the hypothesized association standing on its own predicts things.

(Insert remark about how companies should be hiring data science people with science backgrounds rather than just pure cs backgrounds here)




Very well put, I agree completely. The absence of controls in their model for school, education, experience etc. means we cannot draw definitive conclusions from their analysis.

Another commenter also pointed out the fit curve. Note the wide prediction intervals starting at ~40% and frequency of points across all charts. It's hard to draw conclusions above that cutoff, since outcomes vary significantly and high-matchers are scarce. This may also be a failure of the "proprietary" qualification-extracting algorithms.

Most applicants also seemingly interview at a <10% rate, and the data in general looks fishy. I know they sampled from ~6,300 applications, but the joint distribution of matches & interviews seems bimodal: either you barely match and barely interview, or you greatly match and interview far more often.

Weird, weird post. It should be titled "we rarely see > 50% match in job requirements, which either says something about applicants or our proprietary algorithms."


All true, but I think even more fundamentally, they treat all requirements as of equal importance, and reported based on the % of requirements met. The requirement to, for example, know at least one programming language, is different than the requirement to have experience with a particular IDE. Someone who does not have the latter requirement is probably much more likely to nonetheless get a callback, than someone who has no programming experience, even though each might be one "requirement".

Secondly, the people applying who have <100% of the requirements, know this, and are trying to guess if they meet all of the _actually_ required criteria. The ones who apply anyway, and get an interview, might be doing a better job at guessing which "requirements" are real.


Thank you! Finally some reason.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: