I'm interviewing people for multiple DS positions (subtle recruiting thing there...) at the moment and it's not fun.
The number of people who can't work out what kind of solution a DS scenario needs is very disappointing. I'm not even talking about giving a "correct" solution: most can't even work out the class of problem!
Here's something to think about: Are you doing visualization? Building some kind of model to explain existing behavior? Building a predictive model? Is it supervised or unsupervised?
This is pretty basic stuff (surely it's close to the FizzBuzz of data science?), and yet it is borderline impossible to find people who just nail it.
It's because "data science" doesn't mean anything. People call themselves "data scientists" after they've set up a toy Hadoop cluster between their desktop and their laptop, and the meaning, expectations, and responsibilities associated with the title vary from organization to organization, much like "business analyst".
Some people think that anyone who uses a SQL RDBMS doesn't qualify as a "data scientist" and that the role is limited to people who have experience in "big data".
In general, if you want better applicants for this type of position, your job description should be explicit about the actual activities associated with the job, you should post it where people who know how to do those things hang out, and you should make sure it's apparent that you're willing to compensate well. You'll still get plenty of bad applicants, because every job posting does, but this should help refine it a little bit and clue in some good people that you're worth applying for.
So the answer to your question is basically "Well, what is a 'data scientist'?"
I also hire data scientists & have a similar experience. As far as I can tell, many people are taught to start from a statistical/machine learning method and apply it to a problem, but very few are taught to start from a problem and figure what techniques to use. Honestly, 95% of the time I solve my questions through iterative SQL queries in a few hours, while I see most people using laborious statistical methods the first chance they get.
> 95% of the time I solve my questions through iterative SQL queries in a few hours, while I see most people using laborious statistical methods the first chance they get.
The issue seems to be a mismatch between your posting and your workload.
I do hiring for a data team, and explicitly don't advertise a data science role. While we do have projects that are advanced enough to fall under a data science moniker, the majority of candidates we got for that role had very... academic expectations. But a business isn't a static, cleanroom environment with everything already collected, cleaned, standardized, validated, and normalized for use.
Re-titling the job posting to Data Specialist or Data Analyst resulted in a lot more candidates that are perfectly well suited to the type of problem solving you mentioned. There's an endless number of business problems where this skillset can be applied, making them very flexible and providing high labor utilization. Including getting to a "good enough" state for the few problems we have that could benefit from the more advanced statistical methods a data science candidate would bring to the table.
Yeah–to be clear, I pretty much totally revamped the hiring process once I became a manager, and was speaking mostly from previous experience. I've found splitting up the job into different titles "Data Analyst", "Data Scientist", and "Data Engineer" depending on the actual role to work pretty well.
That said, even with the vast majority of analyst candidates, I find them very eager to apply known methods–flexibility and problem-first thinking is rare and extremely valuable.
Those titles. What type of roles do they cover? Is there a quick summary -- particularly between analyst and scientist. I expect engineer is source quality, repeatability, accuracy, precision, feature engineering, etc. In other words making the data stable and easily consumed, whether that is directly from the instrument or the charts for the final decision.
The nuance between analyst and scientist is less clear. Can you describe what type of candidates the two draws or what you look for depending on the title?
My job title is currently "Data Engineer" I work in an industrial plant. Here's my two cents:
My background is in Engineering (I'm a materials engineer by qualification). What differentiates me from a statistician, analyst etc is my domain knowledge. I have almost 15 years experience working with industrial processes. I have the background knowledge of chemistry, thermodynamics, mechanics etc. Which someone with a stats background would be lacking. So when I am asked to optimize an industrial process I can utilize that expertise whilst developing models.
I would expect that a data scientist would know more about machine learning and would have a much stronger stats background than me. They'd also probably write much better code (I work in C/C++ and SAS, from what I have seen data scientists tend to be Python/R focused).
Not the OP, but in my experience a "Data analyst" is mostly responsible for writing analytical SQL queries and generating reports. So they don't require a strong math background or programming skills (other than SQL).
Call it "the kaggle effect" - once someone defines the problem and the metric you'll be graded on and gives you a relatively clean dataset, "solving the problem" is just as simple as importing xgboost and plowing away. But there is often an under-appreciation among people without much job experience how hard it is to get to that point. The OP article touched on it a bit, but really, the most difficult job a data scientist has is defining what problem they're trying to solve and getting buy-in from other business stakeholders. And frankly, no data science masters program or boot camp can teach those skills.
Thing I've noticed the most is that many people don't fully understand why we need to apply a specific test.
The core concept many people seem to miss is that the point of data science is to find meaning in large quantities of data, to recognize patterns, and to present them in a meaningful and easy-to-understand way. Really to allow for educated data-driven decision making. Each approach is a tool for you to make an informed idea, but if you apply them the wrong way then... well, could be worrisome down the road.
Understanding the full problem and then finding the right tools or approaches to solve it is necessary instead of putting everything inside a black-box model.
I've had similar experiences. Updating the job posting can help to weed out some candidates. I've moved more toward giving short take home problems that basically test whether the candidate can perform basic skills. The goal might be to perform a simple classification task on toy data, but along the way you'll need to use git, access a database, use an API, etc. The candidates usually find out for themselves whether they'd be a good fit.
This is fine, but make sure the take-home test is truly respectful of candidate time. It shouldn't take more than 2 hours max for the minimally-acceptable candidate. The best candidates have a lot of options and therefore see little reason to commit a large quantity of their personal time to any particular candidate-employer.
Some people are doing the take home test but they make it into a multi-week ordeal, or say it should take "about a full work day", which always stretches out into 4-5 evenings after the real-world is accounted for. I've never had enough interest in working somewhere to finish those long take-home tests; always get like half to three-quarters of the way done before I decide I don't really want to work there that much anyway.
While the take-home approach has it's drawbacks (biased towards candidates with more time, like those without families), I think it's great you test the basic tools needed to work in this era. Classification aside, the rest would be good checks for most non-DS software development roles.
The number of people who can't work out what kind of solution a DS scenario needs is very disappointing. I'm not even talking about giving a "correct" solution: most can't even work out the class of problem!
Here's something to think about: Are you doing visualization? Building some kind of model to explain existing behavior? Building a predictive model? Is it supervised or unsupervised?
This is pretty basic stuff (surely it's close to the FizzBuzz of data science?), and yet it is borderline impossible to find people who just nail it.
Why is this?