Data Science from Scratch: First Principles with Python

philliproso · on April 26, 2015

The book should add a subtitle "Includes 116 line implementation of an in memory database" https://github.com/joelgrus/data-science-from-scratch/blob/m... +90 line implementation of neural networks, wow that is some beautiful code

joelgrus · on April 26, 2015

Thank you! I worked very hard to make the book's code clear and beautiful.

tux · on April 27, 2015

Please add some summary comments in each file, like what it does.

spot · on April 27, 2015

he wrote a whole book of comments...

klibertp · on April 27, 2015

Normally I'd agree, but in this case the code is just one part of offering. You want docs, you buy the book. I know I will.

danso · on April 27, 2015

I was going to ask, "Why Python 2.x"? But then I just bought the book. Hope you don't mind if I post this excerpt:

> As I write this, the latest version of Python is 3.4. At DataSciencester, however, we use old, reliable Python 2.7. Python 3 is not backward-compatible with Python 2, and many important libraries only work well with 2.7. The data science community is still firmly stuck on 2.7, which means we will be, too. Make sure to get that version.

I use the more popular scientific libraries, e.g. numpy, scikit, nltk....and the bigger ones seem to have been ported over to 3.x. A few libs that haven't that come to mind: mechanize and opencv. Has anyone here had success with using 3.x as a data science professional, or is there some massive gaping hole that I'm missing? (I agree that, "Well, this is what the company has been using" is a decent enough excuse to stay on 2.x in most situations)

rdtsc · on April 27, 2015

Even some projects that claim have been ported, will often have bugs in them because it is new code. Then it is a question of do I have time or want test the port on my production system? I just kind of look at the issue or commit stream and see when issues appearing related to Python 3 start to slow down a bit.

Omnipresent · on April 27, 2015

I just finished the ML class from Georgia Tech as part of the OMSCS program. I used SciKit for most of the assignments as they involved NNs, DT, KNN, K-means, EM. This might be a naive question as I'm not a python guy but is there a reason this book is python based but doesn't cover scikit-learn? For example, what need did you see to write code for k-means[1] than to use an implementation already available [2]

[1] https://github.com/joelgrus/data-science-from-scratch/blob/m... [2] http://scikit-learn.org/stable/modules/generated/sklearn.clu...

treycausey · on April 27, 2015

The title of the book includes "from scratch" for a reason -- it's from "first principles" where you learn about something by building it up from scratch rather than using an implementation. At the end of each chapter, Joel points out the existing resources you can use after learning about the topic.

Omnipresent · on April 27, 2015

makes sense. I took "from scratch" from an understanding perspective rather than implementation. Thanks for clarifying it. Looks like it'll be a great resource.

amarsahinovic · on April 27, 2015

Sorry, I downvoted you by accident, I have a habit of clicking randomly while reading :/

jplahn · on April 26, 2015

Looks great Joel! Definitely going to check this out and start working through it. I've noticed the huge bifurcation between extremely applied data science and almost entirely mathematical based. I was always wary of 'learning' data science through applications only, but as you alluded to, it's significantly more exciting. Likewise, most introductory statistics classes are so poorly delivered that many people have a deeply ingrained fear of the underlying concepts.

As a side note, do you attend any data events in Seattle? I'm moving there in June after graduation and would love to talk with somebody doing my dream job.

joelgrus · on April 26, 2015

I attend a lot of data events in Seattle. Especially Data Science Happy Hours.

sputknick · on April 26, 2015

Any chance there is a discount code to encourage early readers?

joelgrus · on April 26, 2015

AUTHD

(And I didn't know that until you asked, I'm going to edit the blog post.)

arthurcolle · on April 27, 2015

Thanks so much! Just got the ebook. 16 definitely beats 33, pushed me over the edge on my student budget. :)

sireat · on April 27, 2015

This is probably O'Reilly glitch but the discount code only works for ebook(50% off) and print(40% off) versions separately.

For the combined ebook/print package the code does not work. "You did not meet the criteria for this discount"

mparrett · on April 28, 2015

Thanks for the sample and discount. Nicely done! Appreciate the clean code and occasional snarky bits. :)

TorKlingberg · on April 27, 2015

Thanks! Note that the code is for the O'Reilly shop, not for Amazon.

sputknick · on April 26, 2015

Thanks brother. Ordered!

Abundnce10 · on April 27, 2015

Thanks, Joel! Just ordered a hard copy.

barely_stubbell · on April 26, 2015

Does anyone have any recommendations of books that might pair well with this one in the math/data/statistics space? Thought I might pick up a few books and score some free shipping.

jkldotio · on April 27, 2015

Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran is a bit old now but is excellent, 4.5 stars on Amazon from 100+ reviews.[1] A bit of overlap with this one, but there are some great explanations.

[1]http://www.amazon.com/Programming-Collective-Intelligence-Bu...

playing_colours · on April 27, 2015

Not specifically as a pair with that book, but I found the following books help to build some background to start on Data Science and Machine Learning:

1. Vector Calculus, Linear Algebra, and Differential Forms A Unified Approach http://www.amazon.com/Vector-Calculus-Linear-Algebra-Differe...

2. OpenIntro Statistics 2nd Edition https://www.openintro.org/stat/textbook.php

thehoff · on April 26, 2015

Looks interesting, I'll probably pick this up.

Have you posted to DataTau?

jsnk · on April 26, 2015

This book looks very close to what my girlfriend is looking for. She's interested in learning bioinformatics and it's been difficult to find a good book that introduces topics in data science in a digestible manner.

If anyone knows the book, can you give a quick overview of how much, math, stats, programming and comp sci. you'd need to read this book? Thank you.

joelgrus · on April 26, 2015

I know the book, I wrote it!

Most of the math is vector space arithmetic. There are a few sections that use matrix multiplication. The probability and stats is stuff like understanding probability distributions and Bayes's Theorem. (It's all covered in the book, but you'd need to be comfortable picking it up and using it.)

In terms of programming, not much. Someone who's never programmed before would probably have a tough time, but the goal is that someone who is bright and hardworking and who can write fairly simple Python programs should not have a problem. Very little CS background required. Maybe basic data structures like list vs dict and so on.

Goladus · on April 27, 2015

Reading this comment for some reason makes me curious how much time the book spends addressing computer science fundamentals like cpu and memory. My guess is that it's included in bits and pieces along the way but I didn't see anything explicit in the table of contents.

I'm thinking about it in terms of running computation in production environments where you may be constrained by available compute resources or budget. Some people have an intuitive grasp of cpu/memory/bandwidth and can do performance tuning as necessary, but those who don't can find themselves in situations where they waste a lot of resources, such as running a million parallel jobs that each have less than 1 second of CPU time, getting stuck after failing to request or provision nodes with sufficient memory, or performing unnecessary reads and writes.

a_bonobo · on April 26, 2015

She may also enjoy Vince Buffalo's Bioinformatics Data Skills

http://shop.oreilly.com/product/0636920030157.do

It's more focused on how to analyze existing biological data with the shell, R, and how to use git.

Personally, I've rarely seen advanced machine learning being used outside of genome-wide association studies, and even there most people just use PLINK's logistic regression without understanding what's being done and call it a day.

Another really good book on how to understand statistics is Motulsky's Intuitive Biostatistics - it introduces all common "tests" and methodologies people working in the life sciences use, but without the formulas (you use R for that anyway). It's more about the caveats of each test, in which situation you'd use it, what can go wrong, how to interpret the results etc., all written in a very lively style.

sonabinu · on April 26, 2015

There is a sampler for a taste of what's in there http://cdn.oreillystatic.com/oreilly/booksamplers/9781491901...

mdesq · on April 27, 2015

Thanks Joel. I just purchased both the print and ebook copies from O'Reilly. This is exactly what I've been looking for.

kunjaan2 · on April 26, 2015

Could you please post this on /r/machinelearning as well with an AMA? Thanks.

blumkvist · on April 26, 2015

Good that it discusses overfitting.

I can't help but wonder about those recommender systems. With so little material on statistics I have to assume it's only about observational data, which is the best way to make the millionth+1 useless recommendation engine.

And why is it that the "data science" books never discuss DoE?

x0x0 · on April 26, 2015

Most data scientists, particularly those coming from the cs department, lack most probability and statistics fundamentals. I doubt many of them have even heard of an anova, or sampling distributions, F tests, chi2, etc.

To be fair, ml tends to focus very heavily on prediction, not inference / interpretation of betas. In many tree models how to even understand coefs is an open question.

ced · on April 27, 2015

anova, or sampling distributions, F tests, chi2

Most of the big ML books are heavily Bayesians, and these subjects are less discussed (though IIRC Gelman's book has a "Bayesian ANOVA"). Even Elements of Statistical Learning, which is very frequentist in its approach, only references ANOVA in passing. Do you have any book to recommend about these fundamentals?

x0x0 · on April 27, 2015

basic level, very approachable, filled with case studies (and with R code to run them easily found), but stupidly expensive: _statistical sleuth_ by ramsey (but, you know, pdfs can be found on the internets)

intermediate level, covers some blocking IIRC: _Statistics for Experimenters_ by Box et al

advanced: I thought quite good, but classmates did not universally love. Unfortunately does not come with case studies or R code to run them; I have a bunch but (very unfortunately) printed instead of computerized and, in any case, probably copyrighted by my professors. _Experiments: Planning, Analysis, Operation_ by Wu and Hamada. The math is not complex but can be involved for various types of blocking designs.

ced · on April 27, 2015

Thanks a lot, statistical sleuth looks very readable and interesting!

How do you feel about the Bayesian approach to these questions? (cf. Gelman's Bayesian Data Analysis)

Goladus · on April 26, 2015

What is DoE? I hear DoE and I think "Department of Education" or "Department of Energy" which is also what comes up in a web search...

christopheraden · on April 26, 2015

Most likely http://en.wikipedia.org/wiki/Design_of_experiments

ced · on April 26, 2015

Do you have any reading recommendation about DoE?

memilanuk · on April 27, 2015

I've been wondering just about exactly the same thing - why DoE gets no love, while everyone is crazy for ML.