I was going to ask, "Why Python 2.x"? But then I just bought the book. Hope you don't mind if I post this excerpt:
> As I write this, the latest version of Python is 3.4. At DataSciencester, however, we use old, reliable Python 2.7. Python 3 is not backward-compatible with Python 2, and many important libraries only work well with 2.7. The data science community is still firmly stuck on 2.7, which means we will be, too. Make sure to get that version.
I use the more popular scientific libraries, e.g. numpy, scikit, nltk....and the bigger ones seem to have been ported over to 3.x. A few libs that haven't that come to mind: mechanize and opencv. Has anyone here had success with using 3.x as a data science professional, or is there some massive gaping hole that I'm missing? (I agree that, "Well, this is what the company has been using" is a decent enough excuse to stay on 2.x in most situations)
Even some projects that claim have been ported, will often have bugs in them because it is new code. Then it is a question of do I have time or want test the port on my production system? I just kind of look at the issue or commit stream and see when issues appearing related to Python 3 start to slow down a bit.
I just finished the ML class from Georgia Tech as part of the OMSCS program. I used SciKit for most of the assignments as they involved NNs, DT, KNN, K-means, EM. This might be a naive question as I'm not a python guy but is there a reason this book is python based but doesn't cover scikit-learn? For example, what need did you see to write code for k-means[1] than to use an implementation already available [2]
The title of the book includes "from scratch" for a reason -- it's from "first principles" where you learn about something by building it up from scratch rather than using an implementation. At the end of each chapter, Joel points out the existing resources you can use after learning about the topic.
makes sense. I took "from scratch" from an understanding perspective rather than implementation. Thanks for clarifying it. Looks like it'll be a great resource.
Looks great Joel! Definitely going to check this out and start working through it. I've noticed the huge bifurcation between extremely applied data science and almost entirely mathematical based. I was always wary of 'learning' data science through applications only, but as you alluded to, it's significantly more exciting. Likewise, most introductory statistics classes are so poorly delivered that many people have a deeply ingrained fear of the underlying concepts.
As a side note, do you attend any data events in Seattle? I'm moving there in June after graduation and would love to talk with somebody doing my dream job.
Does anyone have any recommendations of books that might pair well with this one in the math/data/statistics space? Thought I might pick up a few books and score some free shipping.
Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran is a bit old now but is excellent, 4.5 stars on Amazon from 100+ reviews.[1] A bit of overlap with this one, but there are some great explanations.
Not specifically as a pair with that book, but I found the following books help to build some background to start on Data Science and Machine Learning:
This book looks very close to what my girlfriend is looking for. She's interested in learning bioinformatics and it's been difficult to find a good book that introduces topics in data science in a digestible manner.
If anyone knows the book, can you give a quick overview of how much, math, stats, programming and comp sci. you'd need to read this book? Thank you.
Most of the math is vector space arithmetic. There are a few sections that use matrix multiplication. The probability and stats is stuff like understanding probability distributions and Bayes's Theorem. (It's all covered in the book, but you'd need to be comfortable picking it up and using it.)
In terms of programming, not much. Someone who's never programmed before would probably have a tough time, but the goal is that someone who is bright and hardworking and who can write fairly simple Python programs should not have a problem. Very little CS background required. Maybe basic data structures like list vs dict and so on.
Reading this comment for some reason makes me curious how much time the book spends addressing computer science fundamentals like cpu and memory. My guess is that it's included in bits and pieces along the way but I didn't see anything explicit in the table of contents.
I'm thinking about it in terms of running computation in production environments where you may be constrained by available compute resources or budget. Some people have an intuitive grasp of cpu/memory/bandwidth and can do performance tuning as necessary, but those who don't can find themselves in situations where they waste a lot of resources, such as running a million parallel jobs that each have less than 1 second of CPU time, getting stuck after failing to request or provision nodes with sufficient memory, or performing unnecessary reads and writes.
It's more focused on how to analyze existing biological data with the shell, R, and how to use git.
Personally, I've rarely seen advanced machine learning being used outside of genome-wide association studies, and even there most people just use PLINK's logistic regression without understanding what's being done and call it a day.
Another really good book on how to understand statistics is Motulsky's Intuitive Biostatistics - it introduces all common "tests" and methodologies people working in the life sciences use, but without the formulas (you use R for that anyway). It's more about the caveats of each test, in which situation you'd use it, what can go wrong, how to interpret the results etc., all written in a very lively style.
I can't help but wonder about those recommender systems. With so little material on statistics I have to assume it's only about observational data, which is the best way to make the millionth+1 useless recommendation engine.
And why is it that the "data science" books never discuss DoE?
Most data scientists, particularly those coming from the cs department, lack most probability and statistics fundamentals. I doubt many of them have even heard of an anova, or sampling distributions, F tests, chi2, etc.
To be fair, ml tends to focus very heavily on prediction, not inference / interpretation of betas. In many tree models how to even understand coefs is an open question.
Most of the big ML books are heavily Bayesians, and these subjects are less discussed (though IIRC Gelman's book has a "Bayesian ANOVA"). Even Elements of Statistical Learning, which is very frequentist in its approach, only references ANOVA in passing. Do you have any book to recommend about these fundamentals?
basic level, very approachable, filled with case studies (and with R code to run them easily found), but stupidly expensive: _statistical sleuth_ by ramsey (but, you know, pdfs can be found on the internets)
intermediate level, covers some blocking IIRC: _Statistics for Experimenters_ by Box et al
advanced: I thought quite good, but classmates did not universally love. Unfortunately does not come with case studies or R code to run them; I have a bunch but (very unfortunately) printed instead of computerized and, in any case, probably copyrighted by my professors. _Experiments: Planning, Analysis, Operation_ by Wu and Hamada. The math is not complex but can be involved for various types of blocking designs.