Hacker News new | past | comments | ask | show | jobs | submit login

This is a completely, utterly worthless list: the only things that belong on it are Grus (good for python), maybe Bishop (maybe; it's woefully out of date and light on details) and Hastie (his other book is vastly better). The "General interest" books are all horse shit. If you're a pythonista, you should buy Wes McKinney's book. If you're not, you should buy John Mount and Nina Zumel's "Practical data science" which is comparable to Grus.

If you don't know Linear Algebra; the "done right" book is absolutely not done right for people that work with data in the real world: Strang is infinitely better, and Trefethan and Bau and Gollub if you need to go deep.

As someone pointed out below: Kevin Murphy's ML book is actually very good for teaching you how the things work. What's more, you can download python or Matlab for every algorithm in his book (as you can for Bishop at this point).

Nobody needs the Deep Learning books listed. In spite of the hype, it's just not that important compared to naive bayes, LDA and GBM; most people don't have access to the hardware or data sets that make DL useful, and those who do probably studied DL in grad school.

Gads; what trash! Those experts: only one of them appears to actually be an expert; pedants and post-docs don't count.




Seconding this comment. Based on experience in hiring data scientists and comparing notes with many others that hire data scientists, the most frequent gaps in knowledge are (1) statistics specifically and scientific computing in general and (2) disciplined software engineering.

People good at (1) and bad at (2) write "PhD code" that may or may not be right but you can't tell because it's too disorganized. People good at (2) but bad at (1) get fine-ish looking numbers out of their good looking code but you can't tell whether it's right because they may have ignored or misunderstood fundamental assumptions and correctness of the underlying methods.

There are also seemingly tens of thousands of people on the market who have little experience in either but have adapted projects from examples online into their Github potfolio and put all of the relevant terms into their resume anyway.

I think most aspiring data scientists would be better served going with more introductory texts and really understanding them. Maybe Blitzstein and Hwang's "Introduction to Probability" and then McElreath's "Statistical Rethinking" or Wasserman's "All of Statistics" for people who need more stats.

I'm not even sure what to recommend for developing good software judgment and habits. There doesn't seem to be a shortcut for that. Maybe "Fluent Python" or "Effective Python" for Python people? No idea for the R ecosystem.


Perspective for (2): it's because no one in graduate training really cares about code quality. Your PI focuses more of your attention on scientific writing, and so there's little to not time to polish your work. The incentives just don't support this work at the graduate training level.


I agree. The only incentive to write neat code is to save yourself the pain and suffering of having to go back through it yourself to fix or add things. I am currently doing data analysis in MATLAB for my PhD, and I know nobody will ever use my code besides me.

I’d like to learn to do my due diligence, but without someone training me, it just takes so much time to learn things like git. I’d rather be recording more data and submitting my paper so I can get the hell out of here


Actually the pi doesn't even know about the concept of code quality. Or even your code at all.


Interesting I've never seen any issue with (1) with data scientists with Masters or PhD, more often I've seen it with software engineers who end up having to do data scientist work. Does that track with your experience?

As an software engineer (2) is the worst part of working with data scientists. The amount of times a 'professional' data scientists want to launch non-code reviewed, non source controlled, works-on-my-notebook model or analysis is shocking.


Which books would you recommend to solve the first problem? I’m a CS student, but would love to work with data one day, so I’m looking for some probability/statistics/data science books to read through. I’ve heard great things about Elements of Statisticsl Learning, for example, but I’m not sure whether it’s not too DS oriented, giving poor foundations in statistics and probability.


"R inferno" will help with the software engineering bits I suppose. R is sort of designed to do this sort of work, and it assumes the end user is more of a statistician than a programmer. Lots of foot-guns. On the other hand, Python has a lot of them as well, and it's NOT designed for this kind of work. It's a sort of mixed bag: R core is vastly better than Python for this sort of task. There's a subset of R packages which are as good as scikit learn (which is very good indeed), but there is also a pile of total shit. R's package manager is also better than anything in the python universe, but node bros manage to screw it up. I loathe python from long experience, so I polarized on R, but python is definitely winning.

I just assume anyone who calls themselves a data scientist is going to be a shit tier programmer who needs to improve over time at this point. The exceptions kind of prove the rule. Imposing test-driven discipline will cure some of the worst tendencies.

I don't have good references on stats and linear regression tier data science, but I'll take someone who understand the basics (I dunno, calculating useful moments from empirical distributions, feature selection in linreg) over some weenie who has some cribbed ipython file in his githubs who claims to understand Hastie.


Honest question: what skills should a data scientist possess to graduate out of “shit tier”? Should we have all of the skills of statisticians, ML engineers, data engineers, software engineers, visualization designers, and domain/communication experts? Can it not be valuable to have some but not all of the above skill sets? Does it matter that software engineers are often “shit-tier statisticians” that understand just enough ML lingo to dismiss it as marketing hype?

I’ve gone out of my way over the years to make learning data science skills as approachable as possible for uninitiated (giving trainings, providing customized learning paths based on someone’s background, offering encouragement), and yet this is almost never reciprocated by engineer types. It’s always just, “data scientists can’t write production quality code”, with no explanation of what production quality entail, or without consideration of the fact that notebook-based data science can have advantages over perfectly modularized code with a battery of tests. See the comment above: “I'm not even sure what to recommend for developing good software judgment and habits.“. It’s like a chess coach admonishing their subject to simply “think harder”. Not helpful.

When curious and open-minded data scientists and software engineers work together, it can be magic. When people snipe at others for their “shitty” skills, it creates a petty and toxic environment.

This comment comes off as a bit of an admonition, but I would greatly appreciate a list like TFA for data scientists looking to shore up their fundamental CS and software development skills.

(PS — The first book I read when teaching myself R was R Inferno, so that ain’t it.)


> See the comment above: “I'm not even sure what to recommend for developing good software judgment and habits.“. It’s like a chess coach admonishing their subject to simply “think harder”. Not helpful.

Hey, it seems like you took this as gatekeeping or something. These skills can definitely be taught or self-learned, I've done it and seen it done many times.

My point was only that I don't know resources that can act as a shortcut (my actual word above), i.e. ways to skip over the longer path of gaining experience through long engagement with the topic. So maybe more like a chess coach saying they don't know any books that let a beginner jump ahead to being a more experienced player?

There are hundreds of past threads on HN about books to level up in software, so clearly some people have thoughts about this. I just don't know what to recommend a data scientist who needs these skills immediately.


What you said wasn’t egregious or anything, no worries. I’ve seen some incomprehensible code from data scientists with PhDs, stuff that has no excuse. I also know of one single resource for essential coding skills specific to data scientists either.

Sometimes a rant on a topic brews in my head for weeks or months, and I will uncork it on a random passerby that brings up the subject—which happened to be you this time.

But, I’ve had coworkers who like clockwork sneer at anything a data scientist wrote. “Why did you do it that way?”. When asked for advice on how to improve it, they huffily say nevermind. It’s ingratiating as hell.


R should be killed.



I have put list per "expert" to make you feel better.

  Herman:
  - ISL
  - Hands on 
  - Chollet
  - spark+desk

  Miller:

  - Grus
  - thinks stats
  - la done right
  - bishop
  - data intensive


I personally found the Kleppman book to be great. I wouldn't characterize it as essential for data science per se but anyone designing data-intensive applications ought to give it a read.


Whenever an article tries to appeal to authority "e.g. according to the experts", you know it is trash. It's part of my personal click-bait detection heuristic.


Thanks for reaffirming my habit of checking the comments before clicking through. Learned about a lot of useful resources from this and other comments!


I agree the title is annoyingly misleading.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: