Unfortunately it seemed pretty clear from the start that this is what data scien...

jghn · on Nov 29, 2022

I worked adjacent to the data science field when it was in its infancy. As in I remember people who are now household names in the field debating what it should be called.

At the time I considered going down that path, but decided I did not have anywhere near the statistics & math knowledge to get very far. So I stuck with the path I had been on. Over time I saw a lot of acquaintances jumping into the data science game. I couldn't figure out how they were learning this stuff so fast. At some point I realized that most of them knew less than I did when I decided I didn't know enough to even begin that journey.

Of course, I was comparing myself against the giants of the field and not the long tail of foot soldiers. But it made for a great example to me of how with just about everything there's a small handful of people who are the primary movers, and then everybody else.

codeulike · on Nov 29, 2022

Data science effectively rebranded statistics but removed the requirement of deep statistical knowledge to allow people to get by with a cursory understanding of how to get some python library to spit out a result.

I dont know anything about Data Science but as a bystander with a mathematical background thats what I assumed was going on so its kindof interesting to see it spelt out like that. Like you've put words to a preconception that I didnt even know I had.

rafaelero · on Nov 29, 2022

That's because businesses don't require a deep level of math knowledge.

ShredKazoo · on Nov 30, 2022

>Data science effectively rebranded statistics but removed the requirement of deep statistical knowledge

An important thing people miss is that shallow statistical knowledge can cause subtle failures, but shallow software engineering knowledge can cause subtle failures too.

A junior frontend developer will write buggy code, notice that the UI is glitched, and fix the bug. A junior data analyst will write buggy code, fix any bugs which cause the results to be obviously way off, but bugs which cause subtler problems will go unfixed.

Writing correct code without the benefit of knowing when there is a bug is challenging enough for senior developers. I don't trust newbie devs to do it at all.

Context here is I used to work in email marketing and at one point I was reading some SQL that one of the data scientists wrote and observed that it was triple-counting our conversions from marketing email. Triple-counting conversions means the numbers were way off, but not so far off as to be utterly absurd. If I hadn't happened to do a careful read of that code, we would've just kept believing that our email marketing was 3x as effective as it actually was.

So, it's impossible to know how much of a problem this is. But there is every reason to believe it is a significant problem, and lots of code written by data scientists is plagued by bugs which undermine the analysis. (When's the last time you wrote a program which ran correctly on the first try?) Any serious data science effort would enforce stern practices around code review, assertions, TDD, etc. to make the analysis as correct as possible -- but my impression is it is much more common for data analysis to be low-quality throwaway code.

Breza · on Dec 9, 2022

This is an important point. I used to work in adtech. It's amazing how terrible the modeling is in that space. You can generate a model that identifies a given target audience and simply assert that it works without any real validation.

ShredKazoo · on Dec 11, 2022

Surely adtech companies like Google and FB do OK though?

adamsmith143 · on Nov 29, 2022

On the flip side you used to have statisticians writing code that is frankly unusable in a Production environment. You would weep at the R code I've seen and had to turn into something to actually produce business value.

fnands · on Nov 29, 2022

There is a bit of a joke that a data scientist is someone who can do better stats then the average SWE and can write better code than the average statistician. Both of those are relatively low bars to clear though

ketzo · on Nov 29, 2022

The way I heard the joke was "a data scientist is someone who's not good enough at math to be a statistician, and not good enough at programming to be a software engineer."

Maybe a little harsh...

disgruntledphd2 · on Nov 29, 2022

That's much better. Consider that stolen.

fnands · on Nov 29, 2022

Harsh, but funnier than how I phrased it.

drgiggles · on Nov 29, 2022

This is exactly my point. Let subject matter experts in their respective disciplines handle what they know and communicate through the lingua franca of R. Most data scientists/statisticians probably shouldn't be writing production code, I think that's ok. It's a failing of management to think that coding is coding and not understand the value of true engineering ability.

numbsafari · on Nov 29, 2022

My first job basically consisted of taking code in FORTRAN and translating it into C++ with robust testing and engineering, and then frontending that code into a ton of spreadsheet packages. So you had quanta doing quant work, software engineers doing software engineering, and analysts and traders being analysts and traders, instead of having quants fail at all three, which is more or less what data science is.

esparrohack · on Nov 29, 2022

Yeah but in the end it’s just code. And even better, just R.

The business value comes from the stats guy.

adamsmith143 · on Nov 29, 2022

When the R/stats guy quits and you have to figure out which of his 7 notebooks to run in which order and which local files need to be in which local directories to run correctly and which versions of each package are now broken and which code you need to rewrite to fix it you start to realize the value he produced was clicking a lot of buttons in the right order and that overall this doesn't scale at all.

esparrohack · on Nov 30, 2022

Yeah, but I meant that because the business value is in the stats, and there is such low quality of stats in the field to begin with, it’s borked no matter what.

There’s no point in fixing it. You can just pretend like you did. But if the stat work is quality, then it’s worth the effort to optimize.

mellavora · on Nov 29, 2022

That sounds more like a jupyter notebook/python problem than an R problem.

but otherwise, yes, I see the problem.

adamsmith143 · on Nov 30, 2022

The hours I have spent debugging package problems in R would disagree.

esparrohack · on Nov 30, 2022

I know that pain. That’s why I’m saying avoid it if you can do so.

layman51 · on Nov 30, 2022

> Data science effectively rebranded statistics but removed the requirement of deep statistical knowledge to allow people to get by with a cursory understanding of how to get some python library to spit out a result.

That's a good way of putting it. I remember in my first calculus-based probability+statistics class in college, I felt incredibly challenged by the theory. I wondered why there are so many probability distributions out there, why the standard stats formulas look like they do, what "kernel density estimation" even is, etc.

On the other hand, my data science course did include some theory, but a big part of it was also learning how to type the right commands in R to perform the "featured analysis of the week" on a sample data set. Something about these lab exercises felt off because it felt more like training rather than education. The professor expressed something along the lines that if we wanted to go far with this in the future, he would expect us to design the algorithms behind the function calls. I think the analogy he used was "baking a cake from scratch rather than buying a ready made one at the store."

kxc42 · on Nov 29, 2022

That answer somehow reminds me of an article in logicmag: An Interview with an Anonymous Data Scientist [1].

[1]: https://logicmag.io/intelligence/interview-with-an-anonymous...

manicennui · on Nov 30, 2022

I don't know many software engineers who have the ability to design and implement robust production systems.