Hacker News new | past | comments | ask | show | jobs | submit login
The Data Science Process (springboard.com)
182 points by EternalData on Feb 13, 2017 | hide | past | favorite | 64 comments



I dislike that "substantive domain knowledge" has here been replaced with "communications skills." Science stands on the shoulders of giants, and ignorance of what has come before doesn't do that.

Being able to spin a good yarn isn't really enough here. If data science just becomes a code word for brogramming your way through a set of black-box ML algorithms, then I will welcome the inevitable crash of data science.

If insight is the goal, then classic applied statistics plus reproducibility feels like a much better story. At least if insight rather than "making it go" is the goal.


>If data science just becomes a code word for brogramming your way through a set of black-box ML algorithms, then I will welcome the inevitable crash of data science.

A fundamental challenge I see here is how bottom-heavy data science feels now. There are tons of people out there trying to "get into data science" from other fields, but the number of people with substantive domain knowledge, strong programming skills, and the math background to be able to understand the ML black boxes is quite small relative to the number of people calling themselves data scientists. In other words, real insight definitely is (or should be) the goal, but real insight is really hard, and scikit-learn is so easy.

My hope is that this improves over the next 5-10 years - the more mature data science becomes as a discipline/career, the better the education will be and the more experienced people there will be. There is a risk in the mean time, though, that a flood of relatively inexperienced people causes a collapse in expectations for data science, making businesses less eager to hire them in the future.


Strongly disagree. Maybe thats the case the a huge company, but most small organizations I've worked with are extremely top-heavy, filled with STEM PhD's who are very capable, but require 1-3 years to get a useful result and aren't often familiar with programming best practices or how to turn their results into a product. You need a larger team of engineers to make that happen and if there's a large overlap between engineers familiar with machine learning, that transition is much easier.

Furthermore, there's a number of practitioners that expect their data to be ready for them in some perfect state. Probably a majority of the task is create a pipeline for acquiring data and labeling it appropriately if necessary, which may require developing some ontology or classification with rigid guidelines such that someone in India can delegate the task to a large team. Then the practitioner spends an inordinate time optimizing some heuristic that has a meaning that drifts over time, or is completely inconsistent with the goals of the product. These are both problems outside the realm of domain knowledge or experience.


Sorry, I might not have been clear about what I meant by "bottom-heavy". I think we actually agree - as someone who's hiring for DS roles right now, I've seen a ton of exactly what you're talking about.

-Some candidates can write great code, but don't have the math background to understand what ML black boxes are doing.

-Then there are STEM PhDs that have never written non-research (i.e. maintainable) code or had to formulate a qualitative business problem into a quantitative problem they can solve.

Both types of candidates need to come in at a "junior" level and do some on-the-job learning in order to be fully successful data scientists. IMO it appears to be easier to teach STEM PhDs how to code than programmers how to do math, but that might be personal bias (since I came from the former group).


Wonder if the finance roles of quants and quant devs will spread to other industries. Quant devs are math heavy programmers that might not do original research but still can understand/calibrate/implement the models the pure quants produce. Ie given an abstract paper with a shiny model (or a hacked together spreadsheet...) the qdev might need to analyze what monte Carlo error correction strategies are relevant for the problem or how a certain market's peculiarities might influence calibrations etc.

Also, quant devs are heavily involved in building the calculation engines that invokes the models. These engines handles real-time dataflow and calibrations etc and are often highly non-trivial.

My guess is that that type of role is relevant in a data science context. This is much more than data cleansing and piping data between databases.


Heck, when I was in school of CS degree, some people from literature undergraduate went straight to CS graduate programs without too much a pain.

Tuned out programming never really required much math background, it is the level brain teaser that programming posed is as much as math education. So anyone who's has survived math advanced degree would take program like piece of cake, but it doesn't mean people from non-STEM background is hopeless to master data science.

Yet it's a joke to refer data science without referencing to advanced math concept. Albeit significant domain knowledge, data science is not just business analysis aided with spreadsheet. Modelling is an essential part of.


> Both types of candidates need to come in at a "junior" level

Then what's the point of the Ph.D.? Why not just go straight from B.S. to junior data scientist then?


In theory, any programmer worth their salt would already know a massive amount of math (comparatively) and should be readily capable of learning more. If you program without a solid understanding of the underlying math, you're not programming. You're typing until it compiles.


Disagree. As someone who knows more mathematics and less programming than the average programmer I'd say the average programmer need not know all that much mathematics at all, if they're not working in a particular area that involves mathematics.


You must already know vector math or be capable of learning it in less than a day. If you don't have that aptitude, then I put you at higher in the stack.


Vectors aren't terribly advanced mathematics.


You're joking, right? What math, algebra 2 math?


I come from a quantitative social sciences background. I won't math quite as good as your average STEM PhD, but I like to think those of us in social sciences do a pretty good job of building good questions to parameterize squishy qualitative business objectives.


From my experience the biggest hinder to the future of data science is how crappy it is to learn statistics. And I think this is why a lot of data science courses stop at Z-tests and p values or super basic Bayes theorem. I think mathematicans and statisticians have a lot of work to do to make more advanced parts of the field more accessible, otherwise we will end up with people ignoring important assumptions and using tools like a black box.


To be fair, learning statistics is hard for the same reason that doing statistics is hard - any statistic involves assumptions, and the different assumptions underlying different models can be very subtle. There's a lot of disagreement among even professional, academic statisticians about fundamental concepts like p values [1] and how to quantify uncertainty under multiple hypothesis testing [2]. Unfortunately, I don't see any of this getting easier any time soon (although I would love to be proven wrong).

[1] http://www.tandfonline.com/doi/full/10.1080/00031305.2016.11...

[2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1112991/


Moving away from Null Hypothesis Testing and towards a more Bayesian approach is a good first step. For me, and I'm sure many others, NHT is a very backwards way of approaching inference. I don't care about an imaginary distribution with mean 0, I have real data I can fit to a distribution directly--what can you tell me about it? Conditioning on the data itself rather than an unobservable parameter of interest is much more intuitive and makes it much easier to report results to non-statisticians.


I completely agree; I've found it much harder to self-learn the stats than the software side of things. Sibling post makes a good point, but I think the history of stats vs. comp sci bears weight here too; having many people want to learn stats outside academia is a much newer phenomenon than people doing the same with programming.

Anyone have any good resources for self-teaching stats? I have a BS in math but only took one stats course, and it was as terrible as all intro-stats classes are. I have a strong, proof-based understanding of probability theory, but haven't found a similar approach to stats. It all seems to be "if data looks like this, use this test, watch for these pitfalls" which is terrible for building intuition.


Try the Khan Academy stats resources - https://www.khanacademy.org/math/statistics-probability

Datacamp also launched a bunch of new stats courses recently. I haven't checked them out yet, but their courses are usually good quality. https://www.datacamp.com/courses/topic:probablity_and_statis...


If you like proofs and rigor, take a look at "Statistical Inference" by Casella and Berger.


It's very hard to find people with both deep domain knowledge and deep math/statistics knowledge, in the same way that it's often hard to find people with deep programming knowledge and deep business knowledge.

We solve the latter problem by having business analysts or product managers that "get" the technology enough to provide direction, even if they wouldn't be effective implementing it themselves. I think there's a next phase where, as we try to do data-science at scale, we look for a similar role that deeply understands the business and knows enough about the analytical techniques to define the problem and work with a team of specialists to figure out the best analytical approach.

People talk about data science teams being multifunctional - with programmers, data engineers, data scientists, and designers - but we always leave out the role for someone with deep business expertise and shallow but meaningful data science expertise.


As the author of the OP, I must say this is very well put. Part of the problem is that there is no fixed 'role' for the person with the 'deep business expertise and shallow but meaningful data science expertise'. In my experience, it could be a bunch of different people. When I was in a network security startup, this expert would typically be a malware analyst. In other companies, depending on the project, it could be someone from Product, Sales or Marketing. Similar to designers, a data scientist is expected to figure out who the main stakeholders are and get them engaged in the process, instead of the business stakeholder being part of the data science team per se.


I think another thing that could contribute to a crash is companies hiring data scientist roles without really being sure what type of problems they need to have them solve. "We have TBs of data, maybe they can turn it into money." However if the 'data scientists' weren't/aren't even involved in what type of data to collect to solve certain problems (or the problems to solve aren't even known), it makes for very ambiguously drawn up goals and expectations.

Also, depending on the political priorities of the organization, data science may not even be really used. Executives/management may look for analysis results to support their ideas, and just throw out the ones that don't align with what 'they already knew to be right.' After all, who wants to be proven wrong?

EDIT: One anecdote -- I worked for a company and showed pretty plainly that the length of customer engagement had fallen since the previous year. My boss basically said "why did you point that out?" because it made them look bad to the owner of the business.


>If computer science becomes code word for brogramming your way through a set of black-box algorithms, then I will welcome the inevitable crash of computer science.

huh. Hasn't crashed yet.


I'm interviewing people for multiple DS positions (subtle recruiting thing there...) at the moment and it's not fun.

The number of people who can't work out what kind of solution a DS scenario needs is very disappointing. I'm not even talking about giving a "correct" solution: most can't even work out the class of problem!

Here's something to think about: Are you doing visualization? Building some kind of model to explain existing behavior? Building a predictive model? Is it supervised or unsupervised?

This is pretty basic stuff (surely it's close to the FizzBuzz of data science?), and yet it is borderline impossible to find people who just nail it.

Why is this?


It's because "data science" doesn't mean anything. People call themselves "data scientists" after they've set up a toy Hadoop cluster between their desktop and their laptop, and the meaning, expectations, and responsibilities associated with the title vary from organization to organization, much like "business analyst".

Some people think that anyone who uses a SQL RDBMS doesn't qualify as a "data scientist" and that the role is limited to people who have experience in "big data".

In general, if you want better applicants for this type of position, your job description should be explicit about the actual activities associated with the job, you should post it where people who know how to do those things hang out, and you should make sure it's apparent that you're willing to compensate well. You'll still get plenty of bad applicants, because every job posting does, but this should help refine it a little bit and clue in some good people that you're worth applying for.

So the answer to your question is basically "Well, what is a 'data scientist'?"


I also hire data scientists & have a similar experience. As far as I can tell, many people are taught to start from a statistical/machine learning method and apply it to a problem, but very few are taught to start from a problem and figure what techniques to use. Honestly, 95% of the time I solve my questions through iterative SQL queries in a few hours, while I see most people using laborious statistical methods the first chance they get.


   > 95% of the time I solve my questions through iterative SQL queries in a few hours, while I see most people using laborious statistical methods the first chance they get.
The issue seems to be a mismatch between your posting and your workload.

I do hiring for a data team, and explicitly don't advertise a data science role. While we do have projects that are advanced enough to fall under a data science moniker, the majority of candidates we got for that role had very... academic expectations. But a business isn't a static, cleanroom environment with everything already collected, cleaned, standardized, validated, and normalized for use.

Re-titling the job posting to Data Specialist or Data Analyst resulted in a lot more candidates that are perfectly well suited to the type of problem solving you mentioned. There's an endless number of business problems where this skillset can be applied, making them very flexible and providing high labor utilization. Including getting to a "good enough" state for the few problems we have that could benefit from the more advanced statistical methods a data science candidate would bring to the table.


Yeah–to be clear, I pretty much totally revamped the hiring process once I became a manager, and was speaking mostly from previous experience. I've found splitting up the job into different titles "Data Analyst", "Data Scientist", and "Data Engineer" depending on the actual role to work pretty well.

That said, even with the vast majority of analyst candidates, I find them very eager to apply known methods–flexibility and problem-first thinking is rare and extremely valuable.


Those titles. What type of roles do they cover? Is there a quick summary -- particularly between analyst and scientist. I expect engineer is source quality, repeatability, accuracy, precision, feature engineering, etc. In other words making the data stable and easily consumed, whether that is directly from the instrument or the charts for the final decision.

The nuance between analyst and scientist is less clear. Can you describe what type of candidates the two draws or what you look for depending on the title?


My job title is currently "Data Engineer" I work in an industrial plant. Here's my two cents:

My background is in Engineering (I'm a materials engineer by qualification). What differentiates me from a statistician, analyst etc is my domain knowledge. I have almost 15 years experience working with industrial processes. I have the background knowledge of chemistry, thermodynamics, mechanics etc. Which someone with a stats background would be lacking. So when I am asked to optimize an industrial process I can utilize that expertise whilst developing models.

I would expect that a data scientist would know more about machine learning and would have a much stronger stats background than me. They'd also probably write much better code (I work in C/C++ and SAS, from what I have seen data scientists tend to be Python/R focused).


Not the OP, but in my experience a "Data analyst" is mostly responsible for writing analytical SQL queries and generating reports. So they don't require a strong math background or programming skills (other than SQL).


Call it "the kaggle effect" - once someone defines the problem and the metric you'll be graded on and gives you a relatively clean dataset, "solving the problem" is just as simple as importing xgboost and plowing away. But there is often an under-appreciation among people without much job experience how hard it is to get to that point. The OP article touched on it a bit, but really, the most difficult job a data scientist has is defining what problem they're trying to solve and getting buy-in from other business stakeholders. And frankly, no data science masters program or boot camp can teach those skills.


Thing I've noticed the most is that many people don't fully understand why we need to apply a specific test.

The core concept many people seem to miss is that the point of data science is to find meaning in large quantities of data, to recognize patterns, and to present them in a meaningful and easy-to-understand way. Really to allow for educated data-driven decision making. Each approach is a tool for you to make an informed idea, but if you apply them the wrong way then... well, could be worrisome down the road.

Understanding the full problem and then finding the right tools or approaches to solve it is necessary instead of putting everything inside a black-box model.


I've had similar experiences. Updating the job posting can help to weed out some candidates. I've moved more toward giving short take home problems that basically test whether the candidate can perform basic skills. The goal might be to perform a simple classification task on toy data, but along the way you'll need to use git, access a database, use an API, etc. The candidates usually find out for themselves whether they'd be a good fit.


This is fine, but make sure the take-home test is truly respectful of candidate time. It shouldn't take more than 2 hours max for the minimally-acceptable candidate. The best candidates have a lot of options and therefore see little reason to commit a large quantity of their personal time to any particular candidate-employer.

Some people are doing the take home test but they make it into a multi-week ordeal, or say it should take "about a full work day", which always stretches out into 4-5 evenings after the real-world is accounted for. I've never had enough interest in working somewhere to finish those long take-home tests; always get like half to three-quarters of the way done before I decide I don't really want to work there that much anyway.


While the take-home approach has it's drawbacks (biased towards candidates with more time, like those without families), I think it's great you test the basic tools needed to work in this era. Classification aside, the rest would be good checks for most non-DS software development roles.


Probably because you are offering too low compensation?


This one isn't it!


Is this data science? This is the process for being a good BI/Marketing/Web Analyst. I use a variation of this process, but I've never really considered this to be what Data Scientists are meant to be - I always saw Data Scientists as being more specialized in statistics and algorithms, with less specialization in domain knowledge and stakeholder communication.

If someone needs to improve their conversion funnel and help with segmentation and reporting, they need an analyst. If you want to build an algorithm to determine what content is shown to each customer when they make a request, you need data scientists.


Agreed. Too many people confuse basic BI for data science.


This is missing an important component of the process. If you don't want to have to reinvent the wheel every time you are asked to do a certain type of analysis, you also have to set up some infrastructure to support your analytic pipeline. That involves understanding databases, writing scripts to automatically harvest data, possibly creating APIs for your data to support flexible analytic views, etc. The more time I spend in data science, the more time I find myself spending on these types of infrastructural tasks. It's great to work for a company that provides engineers that will do all of this for you, but those companies aren't super common.


BTW this is what we're hacking on at NStack.. we're building an analytics platform which gives you a high-level language which provides an abstraction over infrastructure. The aim is that data teams can productionize code without thinking about anything but business-logic and without requiring an engineering team. So you can write things like..

  nstack start "Schedule { interval : 'Daily' } | DataWarehouse { sql : "./request.sql" } | YourPythonClassifier | Postgres { insert_table : "Results"}"
..which then gets distributed on your cloud-provider. You can kind of think about it like a type-safe, distributed cloud bash!

I'm clocking off for bed, but would love to give you a demo if you're interested: leo@nstack.com.


I think the problem is that the industry hasn't created a product management layer that interfaces between non technical business folk and the technical data scientists. In software engineering, we don't expect the customers to speak directly to the engineers; that's what product managers are for (cur infamous Office Space scene).

However product managers aren't typically involved in solving data science related problems. This is primarily because most product managers don't have the math/stat/compsci background to be useful.

However I predict this will change in the next 5 years.


Agree with this. In fact, Lead Data Scientist roles often become de facto PM roles, where the LDS basically spends their time prioritizing the important research questions DS has to solve based on customer and business needs.

I've been hearing from multiple people that this is a gap that's really hard to fill right now -- PMs who can work with heavy DS and AI products. It's much easier to train experienced data scientists to be PMs than the other way round.


Agree with this. In fact, it's already changing in a couple of domains. (1) Teams that build data products often have product managers. (2) Data analytics consulting teams have project managers who interface between the clients and the analysts.


this looks suspiciously identical to what software engineering has always been about.

There's nothing new in "data science", as per this post, than what has always been true of building a piece of software for non-technical clients. It has always been true that having domain expertise provides a huge boost. It has always been true that requirements are moving targets, that objectives are fluid, that clients don't talk computer science ("data science" in this case). Clients (internal or external) often don't know how to describe their own workflows, and especially edge cases, in rigorous ways. All deja vu. It has always been true that you need to "frame the problem", "clean the data", "design and apply the algos", "communicate the results". We've been grappling with this for 50 years.


This article is a good overview on the why of data science and statistical-based decision marking, but doesn't discuss much of the how and the various warnings that occur during the process (i.e. data gathering/fidelity issues which invalidate models)

The article is marketing for a data science bootcamp which likely answers those questions. There has been a lot of discussion on HN about the merits of bootcamps for developers, but not much about the merits of bootcamp for statisticians, or even the entire hiring workflow in that field.

I've been looking into Data Analyst/Science jobs at companies in the San Francisco Bay Area and almost every position wants a Masters/PhD, either explicitly stated as a requirement or implied. If there is a high demand/low supply of data science jobs out there, I'm unsure how a data science boot camp/tutorial would be able to compete.


I've been researching various masters programs and bootcamps, including talking to graduates of both.

My sense is that while there's a huge variance in quality on both, the median bootcamp seems to be more in touch with industry and better at imparting real-world skills than the median master's program. I'm not sure if employers have started to recognize this yet (from your comment, it seems that they haven't). But once the feedback loop completes, I'd wager that they will.

Also, getting a graduate degree and attending a data science bootcamp doesn't seem to be mutually exclusive. For instance, there are data science bootcamps that specifically target PhDs.


Yeah, there a bunch of bootcamps for people who have statistics or statistic-heavy PhDs that need to translate those skills from the academic to tech company context. Tends to work out pretty well.


Point is there is an explosion of ML usage in soft sciences and applied industries who treat the tools as a black box. On the other hand, extreme reliability is really not needed there, it's just a lot of non-math or basic-math trained people messing around with stats, R packages and novel jargon.


Another point is the overreaction among the cognoscenti: so many words about open source leveraging the tools and allowing the masses to focus on real-life problems, then the rage against the end-users mocked as data-monkeys? Most industries do not need data science at all and, if the case, a very simple 80/20 approach solves all their problems. It happens the 80/20 approach is within the reach of every data-monkey able to clean and normalize datasets, set up Anaconda with Scikit, theano, xgboost, do some ensembling and deploy to AWS for semi-intensive tasks. You all as an industry wanted that for years, so what now?


"As an ethical data scientist concerned with both security and privacy, you are careful not to extract any personally identifiable information from the database. All the information in the CSV file is anonymized, and cannot be traced back to any specific customer."

Honest question - is this really necessary and applicable in this scenario? We're talking about a full time employee accessing company data, presumably with any necessary permissions, to generate insights for internal consumption within the company about its customers?


Yes, it is. Full time employees usually work on their laptops, which can be stolen or hacked especially when they're outside work. Ultimately, people and culture are usually the weakest links in security.


Two differences between a data scientist at the capability level of someone reading this article in earnest at a SV startup and an enterprise company in most of the rest of the country are cost of living and the assumption that moving past step two is a question of one person's independent ethical decisions


I understand this is a marketing piece but as a data scientist the narrative doesn't resonate with me at all. Are there any data scientists that have actually had an experience like this?


This seems mostly like the "data analyst" job. Jobs with the "data scientist" title that I know of are usually basically programming jobs with a focus around machine learning and large scale data.


Those programming jobs exist because someone at some point discovered a problem that could be solved through statistical techniques. The difference is between moving quickly and exploring new problem spaces and really hammering home on well-defined problems that have solutions that need to be implemented in a way that looks a lot like normal software engineering.


So, are all "data science" jobs related to marketing?


In my experience most are, but there is also some work in operations management and optimisation, product design (e.g. financial products, digital products) and other areas.

I think marketing is the obvious first use case, but in large organisations there are often gains to be made looking at operational data.


So this is a question for those of you in the comments.

I'm finishing up a Ph.D. in engineering (heavy into climate change research, so tons of programming + mathematical + statistical knowledge in addition to combing through TBs of data with R and other languages).

What kinds of problems are frequently present in the data science industry that differs from academic research?


I realize this isn't a proper answer to your question, but it reminds of a tweet from Monica Rogati:

  "A decade in academia taught me a bunch of sophisticated algorithms; a decade in industry taught me when not to use them."
Source: https://twitter.com/mrogati/status/726115691703619584


Welp that's really fair. Thanks for the quote.


Great article, this is the exact process I've observed multiple times.


Down for me, but the main springboard page is up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: