"R inferno" will help with the software engineering bits I suppose. R is sort of designed to do this sort of work, and it assumes the end user is more of a statistician than a programmer. Lots of foot-guns. On the other hand, Python has a lot of them as well, and it's NOT designed for this kind of work. It's a sort of mixed bag: R core is vastly better than Python for this sort of task. There's a subset of R packages which are as good as scikit learn (which is very good indeed), but there is also a pile of total shit. R's package manager is also better than anything in the python universe, but node bros manage to screw it up. I loathe python from long experience, so I polarized on R, but python is definitely winning.
I just assume anyone who calls themselves a data scientist is going to be a shit tier programmer who needs to improve over time at this point. The exceptions kind of prove the rule. Imposing test-driven discipline will cure some of the worst tendencies.
I don't have good references on stats and linear regression tier data science, but I'll take someone who understand the basics (I dunno, calculating useful moments from empirical distributions, feature selection in linreg) over some weenie who has some cribbed ipython file in his githubs who claims to understand Hastie.
Honest question: what skills should a data scientist possess to graduate out of “shit tier”? Should we have all of the skills of statisticians, ML engineers, data engineers, software engineers, visualization designers, and domain/communication experts? Can it not be valuable to have some but not all of the above skill sets? Does it matter that software engineers are often “shit-tier statisticians” that understand just enough ML lingo to dismiss it as marketing hype?
I’ve gone out of my way over the years to make learning data science skills as approachable as possible for uninitiated (giving trainings, providing customized learning paths based on someone’s background, offering encouragement), and yet this is almost never reciprocated by engineer types. It’s always just, “data scientists can’t write production quality code”, with no explanation of what production quality entail, or without consideration of the fact that notebook-based data science can have advantages over perfectly modularized code with a battery of tests. See the comment above: “I'm not even sure what to recommend for developing good software judgment and habits.“. It’s like a chess coach admonishing their subject to simply “think harder”. Not helpful.
When curious and open-minded data scientists and software engineers work together, it can be magic. When people snipe at others for their “shitty” skills, it creates a petty and toxic environment.
This comment comes off as a bit of an admonition, but I would greatly appreciate a list like TFA for data scientists looking to shore up their fundamental CS and software development skills.
(PS — The first book I read when teaching myself R was R Inferno, so that ain’t it.)
> See the comment above: “I'm not even sure what to recommend for developing good software judgment and habits.“. It’s like a chess coach admonishing their subject to simply “think harder”. Not helpful.
Hey, it seems like you took this as gatekeeping or something. These skills can definitely be taught or self-learned, I've done it and seen it done many times.
My point was only that I don't know resources that can act as a shortcut (my actual word above), i.e. ways to skip over the longer path of gaining experience through long engagement with the topic. So maybe more like a chess coach saying they don't know any books that let a beginner jump ahead to being a more experienced player?
There are hundreds of past threads on HN about books to level up in software, so clearly some people have thoughts about this. I just don't know what to recommend a data scientist who needs these skills immediately.
What you said wasn’t egregious or anything, no worries. I’ve seen some incomprehensible code from data scientists with PhDs, stuff that has no excuse. I also know of one single resource for essential coding skills specific to data scientists either.
Sometimes a rant on a topic brews in my head for weeks or months, and I will uncork it on a random passerby that brings up the subject—which happened to be you this time.
But, I’ve had coworkers who like clockwork sneer at anything a data scientist wrote. “Why did you do it that way?”. When asked for advice on how to improve it, they huffily say nevermind. It’s ingratiating as hell.
I just assume anyone who calls themselves a data scientist is going to be a shit tier programmer who needs to improve over time at this point. The exceptions kind of prove the rule. Imposing test-driven discipline will cure some of the worst tendencies.
I don't have good references on stats and linear regression tier data science, but I'll take someone who understand the basics (I dunno, calculating useful moments from empirical distributions, feature selection in linreg) over some weenie who has some cribbed ipython file in his githubs who claims to understand Hastie.