Foundations of Data Science by John Hopcroft [pdf]

luisjgomez · on April 25, 2015

"Please do not put solutions to exercises online as it is important for students to work out solutions for themselves rather than copy them from the internet"

Why exactly?

I've found that being able to walk through a solution can be quite illuminating. There's also an exploration/efficiency tradeoff. At some point you can't spend any more time thinking through a problem (because life) and being able to work through a solution still brings many (if not maximal) benefits.

Homunculiheaded · on April 26, 2015

Robert Ash, imho one of the best writers for mathematical self-study, lists "Include Solutions to Exercises" as #2 piece of advice in 'Remarks on Expository Writing in Mathematics'[0] His quote is better than any summary I could come up with:

"There is an enormous increase in content when solutions are included. I trust my readers to decide which barriers they will attempt to leap over and which obstacles they will walk around. This often invites the objection that I am spoon-feeding my readers. My reply is that I would love to be spoon-fed class field theory, if only it were possible. Abstract mathematics is difficult enough without introducing gratuitous roadblocks."

[0]http://www.math.uiuc.edu/~r-ash/Remarks.pdf

jules · on April 25, 2015

Perhaps because if the answers are online then students have to spend willpower to resist using them. I heard of a professor who let his students choose individually whether their class attendance would be mandatory or not, for the same reason.

That said, most likely it's just to enable homework grading.

rohunati · on April 25, 2015

I agree. I have found that I've done best in courses where the instructor post solutions to all problems, homework, practice, and exams.

bootload · on April 26, 2015

"Why exactly?"

Solutions are one way to solve a problem. Providing an answer dissuades students finding novel answers for themselves.

Hobotron1 · on April 25, 2015

Same- in the event I do get stuck in a textbook, it's usually on one of the "challenge" problems that tend to require a clever insight that you can be led to if you have access to a professor with office hours. I don't, so following a solution up until the clever insight (and no further) is the best I can really do.

mrcactu5 · on April 25, 2015

i know this book is not for everybody but I really like the math approach here. software engineers don't like it when I say "data science has been around 500 years since Kepler and Newton"

here they illustrate a nice connection between large deviations and convex geometry - and have beautiful pictures.

so what are examples of high dimensional vector spaces? the set of all your customers (hopefully in the thousands!!) and the vectors include all of their transactions. there are many other examples

formulaT · on April 25, 2015

Skimming the chapters, it seemed to be taking various fields of mathematics that are used in data science, and presenting the foundations of those fields as the foundations of data science.

I'm biased, but I think that data science is statistics, and therefore the foundations of data science is statistics and probability theory. If you want to understand data science at a fundamental level, I would suggest taking courses in these areas.

cbgb · on April 25, 2015

Just for the record, from the first page: "Background material needed for an undergraduate course has been put in the appendix. For this reason, the appendix has homework problems."

The appendix covers Probability and Linear Algebra.

formulaT · on April 26, 2015

My point was that a book with the title "Foundations of Data Science" should be mostly probability and statistics. Undergrad probability and linear algebra is not a solid foundation in statistics.

Statistics is a powerful lens through which to view all data science. E.g. supervised learning is building a model of the conditional probability P(y|x). Again, I am biased, but I think that methods that do not have some statistical interpretation are unlikely to be useful. E.g. if we take the graph of Facebook users and apply some matrix decomposition algorithm, who cares? What can we do with this decomposition? What does it predict?

jules · on April 26, 2015

While I would like to agree with this based on aesthetics, it didn't work that way in practice. A lot of the most successful machine learning algorithms do not have a statistical grounding or did not when they were invented. E.g. neural networks, SVM, low rank matrix approximation, k-means, decision trees/forests.

formulaT · on April 26, 2015

Yes, the core algorithms do have this tendency (which is the exciting thing about machine learning per se), but statistics provides the context to understand them.

E.g. people used to say "neural networks are a simple, flexible functional form for y = f(X,theta)". This turned out to be wrong: SGD training of neural networks has more advantages than the flexibility of the functional form. But it was a good hypothesis and starting point.

SVMs and decision trees have no statistical justification I know of. Low rank matrix approximation and k-means are justified by latent variables and non-parametric kernel methods respectively. I agree these justifications came after the fact, but they do give a way to understand how these models work.

Most importantly, all of the small tasks surrounding training a model are purely statistical, e.g. cross validation, different measures of accuracy, handling endogenous variables, etc.

mbq · on April 26, 2015

There are two camps, one for which it is all about points, and second for which it is all about odds. When failing, first one fails at validation, while the second at performance. Obviously both mostly use the same tools and tricks, though with different ideologies, and overall produce comparable results.

brational · on April 26, 2015

> Undergrad probability and linear algebra is not a solid foundation in statistics.

Are you trying to argue that statistics is its own field and not built upon mathematics?

formulaT · on April 26, 2015

Clarification: probability and linear algebra are the foundations of statistics. Statistics is the foundation of machine learning.

If I have not studied statistics I will not have a solid foundation in statistics, even if I have studied probability and linear algebra.

IndianAstronaut · on April 25, 2015

Machine learning does take from other fields besides statistics, such as signal processing and optimization.

brational · on April 26, 2015

To be pedantic, those all build on linear algebra & functional analysis. At the end of the day it's all just math... just depends how far you want to zoom out the abstract lens.

tristanz · on April 25, 2015

Can somebody explain to me the underlying theory of this type of book?

The only books that have ever felt coherent to me start with p(data, unknown) as being an approximate model of some domain. Everything then follows smoothly as inference, modeling, and computational methods or shortcuts.

jules · on April 25, 2015

There are two approaches. One approach starts with the data and asks what algorithms can we find to infer something about the data, and the other approach starts with an algorithm and tries to find a dataset on which the algorithm infers something good. This is even more general than machine learning; in many cases you either have a problem and are searching for a solution, or you have a solution and you are searching for a problem. Often you need a mix of both (a bidirectional search if you will). It looks like this book is a mix of both.

In the end though the only thing that matters is whether a particular procedure has predictive value. Even with the most principled Bayesian analysis the model or prior are strictly speaking wrong (usually both), because the real world is complicated, so even there you're only left with testing predictive performance in practice. Careful probabilistic modeling is only valuable insofar as it lets us focus our search for inference procedures on those that are more likely to work in practice. A good counterexample is deep learning, which does not (or did not) have a solid probabilistic justification, but works extremely well in practice.

hyperbovine · on April 25, 2015

I used a previous version of this manuscript in a class, and it was more descriptively called 'Computer Science Theory for the Information Age'. It's a collection of stuff computer scientists ought to know about probability, (high dimensional) statistics, ML, randomized algorithms, etc. They seem to have sexed up the title a bit; at least it does not contain "Big".

jey · on April 25, 2015

I agree, but it's also nice to have a toolbox of actually tractable algorithms. Principled practical data science should approach the theoretical ideal (i.e. p(unknown|theta)), but often has to use some approximations that we actually know how to implement efficiently.

jey · on April 26, 2015

Heh, I meant s/theta/data/.

hyperbovine · on April 25, 2015

*And Ravi Kannan

SpaceManNabs · on April 26, 2015

This seems like a nice book. It covers a lot of math that is bound to be useful in some context in machine learning. I find that books like Machine Learning: A Probabilistic Perspective by Murphy serve better for beginners that have some math background as the beginning chapters introduce a lot of the useful math. It is not as in depth as this book in the math details, but these chapters show which mathematic tools are most used, and students can always find literature for what they don't understand.