This is a completely, utterly worthless list: the only things that belong on it ...

ims · on July 24, 2020

Seconding this comment. Based on experience in hiring data scientists and comparing notes with many others that hire data scientists, the most frequent gaps in knowledge are (1) statistics specifically and scientific computing in general and (2) disciplined software engineering.

People good at (1) and bad at (2) write "PhD code" that may or may not be right but you can't tell because it's too disorganized. People good at (2) but bad at (1) get fine-ish looking numbers out of their good looking code but you can't tell whether it's right because they may have ignored or misunderstood fundamental assumptions and correctness of the underlying methods.

There are also seemingly tens of thousands of people on the market who have little experience in either but have adapted projects from examples online into their Github potfolio and put all of the relevant terms into their resume anyway.

I think most aspiring data scientists would be better served going with more introductory texts and really understanding them. Maybe Blitzstein and Hwang's "Introduction to Probability" and then McElreath's "Statistical Rethinking" or Wasserman's "All of Statistics" for people who need more stats.

I'm not even sure what to recommend for developing good software judgment and habits. There doesn't seem to be a shortcut for that. Maybe "Fluent Python" or "Effective Python" for Python people? No idea for the R ecosystem.

skwb · on July 24, 2020

Perspective for (2): it's because no one in graduate training really cares about code quality. Your PI focuses more of your attention on scientific writing, and so there's little to not time to polish your work. The incentives just don't support this work at the graduate training level.

LetThereBeNick · on July 24, 2020

I agree. The only incentive to write neat code is to save yourself the pain and suffering of having to go back through it yourself to fix or add things. I am currently doing data analysis in MATLAB for my PhD, and I know nobody will ever use my code besides me.

I’d like to learn to do my due diligence, but without someone training me, it just takes so much time to learn things like git. I’d rather be recording more data and submitting my paper so I can get the hell out of here

scared2 · on July 26, 2020

Actually the pi doesn't even know about the concept of code quality. Or even your code at all.

arthurjj · on July 24, 2020

Interesting I've never seen any issue with (1) with data scientists with Masters or PhD, more often I've seen it with software engineers who end up having to do data scientist work. Does that track with your experience?

As an software engineer (2) is the worst part of working with data scientists. The amount of times a 'professional' data scientists want to launch non-code reviewed, non source controlled, works-on-my-notebook model or analysis is shocking.

Eugeleo · on Aug 7, 2020

Which books would you recommend to solve the first problem? I’m a CS student, but would love to work with data one day, so I’m looking for some probability/statistics/data science books to read through. I’ve heard great things about Elements of Statisticsl Learning, for example, but I’m not sure whether it’s not too DS oriented, giving poor foundations in statistics and probability.

scottlocklin · on July 24, 2020

"R inferno" will help with the software engineering bits I suppose. R is sort of designed to do this sort of work, and it assumes the end user is more of a statistician than a programmer. Lots of foot-guns. On the other hand, Python has a lot of them as well, and it's NOT designed for this kind of work. It's a sort of mixed bag: R core is vastly better than Python for this sort of task. There's a subset of R packages which are as good as scikit learn (which is very good indeed), but there is also a pile of total shit. R's package manager is also better than anything in the python universe, but node bros manage to screw it up. I loathe python from long experience, so I polarized on R, but python is definitely winning.

I just assume anyone who calls themselves a data scientist is going to be a shit tier programmer who needs to improve over time at this point. The exceptions kind of prove the rule. Imposing test-driven discipline will cure some of the worst tendencies.

I don't have good references on stats and linear regression tier data science, but I'll take someone who understand the basics (I dunno, calculating useful moments from empirical distributions, feature selection in linreg) over some weenie who has some cribbed ipython file in his githubs who claims to understand Hastie.

jointpdf · on July 24, 2020

Honest question: what skills should a data scientist possess to graduate out of “shit tier”? Should we have all of the skills of statisticians, ML engineers, data engineers, software engineers, visualization designers, and domain/communication experts? Can it not be valuable to have some but not all of the above skill sets? Does it matter that software engineers are often “shit-tier statisticians” that understand just enough ML lingo to dismiss it as marketing hype?

I’ve gone out of my way over the years to make learning data science skills as approachable as possible for uninitiated (giving trainings, providing customized learning paths based on someone’s background, offering encouragement), and yet this is almost never reciprocated by engineer types. It’s always just, “data scientists can’t write production quality code”, with no explanation of what production quality entail, or without consideration of the fact that notebook-based data science can have advantages over perfectly modularized code with a battery of tests. See the comment above: “I'm not even sure what to recommend for developing good software judgment and habits.“. It’s like a chess coach admonishing their subject to simply “think harder”. Not helpful.

When curious and open-minded data scientists and software engineers work together, it can be magic. When people snipe at others for their “shitty” skills, it creates a petty and toxic environment.

This comment comes off as a bit of an admonition, but I would greatly appreciate a list like TFA for data scientists looking to shore up their fundamental CS and software development skills.

(PS — The first book I read when teaching myself R was R Inferno, so that ain’t it.)

ims · on July 25, 2020

> See the comment above: “I'm not even sure what to recommend for developing good software judgment and habits.“. It’s like a chess coach admonishing their subject to simply “think harder”. Not helpful.

Hey, it seems like you took this as gatekeeping or something. These skills can definitely be taught or self-learned, I've done it and seen it done many times.

My point was only that I don't know resources that can act as a shortcut (my actual word above), i.e. ways to skip over the longer path of gaining experience through long engagement with the topic. So maybe more like a chess coach saying they don't know any books that let a beginner jump ahead to being a more experienced player?

There are hundreds of past threads on HN about books to level up in software, so clearly some people have thoughts about this. I just don't know what to recommend a data scientist who needs these skills immediately.

jointpdf · on July 25, 2020

What you said wasn’t egregious or anything, no worries. I’ve seen some incomprehensible code from data scientists with PhDs, stuff that has no excuse. I also know of one single resource for essential coding skills specific to data scientists either.

Sometimes a rant on a topic brews in my head for weeks or months, and I will uncork it on a random passerby that brings up the subject—which happened to be you this time.

But, I’ve had coworkers who like clockwork sneer at anything a data scientist wrote. “Why did you do it that way?”. When asked for advice on how to improve it, they huffily say nevermind. It’s ingratiating as hell.

scared2 · on July 26, 2020

R should be killed.

melling · on July 24, 2020

This book by Wes McKinney.

https://www.amazon.com/Python-Data-Analysis-Wrangling-IPytho...

scared2 · on July 24, 2020

I have put list per "expert" to make you feel better.

  Herman:
  - ISL
  - Hands on 
  - Chollet
  - spark+desk

  Miller:

  - Grus
  - thinks stats
  - la done right
  - bishop
  - data intensive

gxqoz · on July 24, 2020

I personally found the Kleppman book to be great. I wouldn't characterize it as essential for data science per se but anyone designing data-intensive applications ought to give it a read.

umeshunni · on July 24, 2020

Whenever an article tries to appeal to authority "e.g. according to the experts", you know it is trash. It's part of my personal click-bait detection heuristic.

danso · on July 24, 2020

Thanks for reaffirming my habit of checking the comments before clicking through. Learned about a lot of useful resources from this and other comments!

scared2 · on July 24, 2020

I agree the title is annoyingly misleading.