Hacker News new | past | comments | ask | show | jobs | submit login
Probabilistic Machine Learning: An Introduction (probml.github.io)
310 points by joaorico on Dec 31, 2020 | hide | past | favorite | 56 comments



The new edition has been split in two parts. The pdf draft (921 pages) and python code [1] of the first part are now available. The table of contents of the second part is here [2].

From the preface:

"By Spring 2020, my draft of the second edition had swollen to about 1600 pages, and I was still not done. At this point, 3 major events happened. First, the COVID-19 pandemic struck, so I decided to “pivot” so I could spend most of my time on COVID-19 modeling. Second, MIT Press told me they could not publish a 1600 page book, and that I would need to split it into two volumes. Third, I decided to recruit several colleagues to help me finish the last ∼ 15% of “missing content”. (See acknowledgements below.)

The result is two new books, “Probabilistic Machine Learning: An Introduction”, which you are currently reading, and “Probabilistic Machine Learning: Advanced Topics”, which is the sequel to this book [Mur22].

Together these two books attempt to present a fairly broad coverage of the field of ML c. 2020, using the same unifying lens of probabilistic modeling and Bayesian decision theory that I used in the first book. Most of the content from the first book has been reused, but it is now split fairly evenly between the two new books. In addition, each book has lots of new material, covering some topics from deep learning, but also advances in other parts of the field, such as generative models, variational inference and reinforcement learning. To make the book more self-contained and useful for students, I have also added some more background content, on topics such as optimization and linear algebra, that was omitted from the first book due to lack of space.

Another major change is that nearly all of the software now uses Python instead of Matlab."

[1] https://github.com/probml/pyprobml

[2] https://probml.github.io/pml-book/book2.html


It's very encouraging to see Matlab losing ground in the educational space. I don't know why so many engineers let their foundational skills to be locked behind a proprietary ecosystem like that.


>I don't know why so many engineers let their foundational skills to be locked behind a proprietary ecosystem like that.

Because no open source toolkit can do what Matlab can do.

The same is true of a lot of high end software: Photoshop, pretty much any serious parametric CAD modeling system (say, SolidWorks), DaVinci Resolve, Ableton Live, etc. When a professional costs $100K+ to employ, paying a few grand to make them vastly more productive is a no brainer. If open source truly offered a replacement, then these costly programs would die. But there just isn't anything close for most work.

Matlab is used for massive amounts of precise numerical engineering design, modeling, and running systems. So while Python is good for some tasks, for the places Matlab shines Python is no where near usable. And before Python catches up in this space, I'd expect Julia to get there faster.


As someone who helped migrate a university course from Matlab to python I must say proprietary features of Matlab had nothing to do with the reason it lasted so long.

Basically, it was mainly inhertia. Older professors that liked it and rarly used anything else and the fact that generally no one gets rewarded for actually rewriting parts of an existing functioning course.

As an instructor you basically create more work for yourself in the first time you migrate a course's programming language. (And you also annoy some senior staff when forcing them to learn new things)


I work at a government r&d/systems engineering center, and it's the same case here. The engineers who went through college with Matlab use that as their default (i.e., when the project doesn't call for something else from the start), while newer engineers don't. As that generation ages out, it'll be more and more sidelined. It's their inertia keeping it around at all.

Proprietary features don't matter here like there. We get MathWorks employees here at least a couple times a year hawking their latest (paid) libraries, but at this point they're always something 5+ years too late, something that already exists in preferred languages--often for free.

Since our clients never deploy Matlab, it doesn't matter if their libraries are fractionally faster in any case besides mockup/experimentation in R&D, and for that I've never met anyone who chooses it for speed there. Plus in this day where even laptops are fast and cloud instances spun up in a few seconds, there's no point. It's also nicer for the dev to complain about not having enough ram to get a better machine than take the time to learn a new language for a specific use case. Likewise the project manager will prefer the quicker solution, buying.

The one item close to a "tie" with Python here is probably migration. Matlab always and Python most of the time get rewritten into something else, Java in my department.


I guess I'll rephrase - if you can't understand a transfer function or a probability distribution without opening Matlab, then you've allowed your own expertise to be held hostage. Unfortunately, I know a large number of professionals for whom this is true.

If you're more productive in Matlab, that's fine. But if you're at a loss without it, that's not.

It doesn't belong in the education system or in educational books.


Conversely, if at every step of learning, you're hindered by inferior tools, you'll learn less, and be at a permanent disadvantage to those using superior tools.

If your job will use tool X, learning it well has value. Those not learning it will be at a disadvantage.

Again, no open source software can do what Matlab can. Why ignore this?


> Again, no open source software can do what Matlab can. Why ignore this?

Can you list (or point to a list of) some of MatLab's features that are absent from other software?


One big feature is a massive amount of built-in functionality [1]. You don't have to find various packages, install them, spend a day fighting version issues, or that some author hasn't upgraded to a recent language version, or used a non-standard logging facility, or any of a zillion other time-sinks you face daily with gluing open source packages together. As soon as a professional has been paid to fight open source integration for 1-2 days, it would have been better if the employer had simply bought matlab.

And, here is by far the biggest issue with open source - the numerical accuracy of lots of it is crap. Matlab (and Mathematica, etc.), have employed professional numerical analysists to create numerically stable, robust algorithms, and has had decades (Matlab started in 1970, under academic numerical analyst Cleve Moeller) of refinement to weed out bugs. It's the difference between using BLAS and writing your own linear algebra package - one is likely far more robust.

Sure, some numerical open source packages are decent, and a few are excellent (BLAS and related). But when you need to glue some together, you end up far too often with stuff that's just flakey for production work.

If you've ever coded the quadratic formula as written in high school textbooks and not known all the mess you just made, then you are what most open source developers are. Taking almost any formula from a paper and just typing it in is surely the wrong way to do it numerically, but this is what open source does. A robust engineering platform should have every such formula analyzed for the proper form(s) for implementation to maintain numerical robustness, and it should also avoid allowing users easy ways to do stuff that is not robust. This is the biggest difference between tools like Matlab and Mathematica versus open source projects.

And, like the time spent fiddling with getting open source to work, as soon as you have one engineering task or design fail due to numerical problems, it would have been vastly cheaper to simply use the better tool - Matlab.

Sure, most people don't use it very much, and rarely run into such problems. People using it for serious work in engineering toolchains or production systems cannot rely on instability of opensource.

And those reasons are why things like Matlab still exist, have incredible revenue, and are growing in use.

For example, want to do some work in python? Well, soon you need numpy. Then you might wat pytorch - but crap, it's numpy-ish, but not numpy. So you learn some more nuances on getting the two to play nicely, to get consistent error messages... Then you need some visualization - again, another package (with a host of dependencies), with different conventions, syntax, uses, and god forbid these packages get a little out of sync between releases - then you get to spend a day chasing that down. Now you want some optimization stuff - pull in scikit, but it's not quite consistent with the other libs... so you spend more time making glue functions between the stuff you want to build. Next you need some finite element analysis stuff - oops, pretty much dead compared to the massive amount of toolkits already in Matlab.

Take a moment and look through the list(s) of functions and toolkits standard in matlab [1]. For an incredible amount of engineering work, what you need is there - you spend less time trying to build enough pieces to start to work and you instead get working on the parts you want.

There's a reason python stole a lot of matplotlib ideas from Matlab - it's quite useful.

[1] https://www.mathworks.com/help/referencelist.html?type=funct...


> As soon as a professional has been paid to fight open source integration for 1-2 days, it would have been better if the employer had simply bought matlab.

I'm a licensed professional, and in my experience it takes 1-2 hours to set up a conda virtualenv with all the packages I need. Whereas if I want Matlab, it takes about a week to talk through the budgeting and licensing options with my employer, find the right number of seats to purchase (other departments might decide to get in on the purchase, so we need to consult broadly), choose which toolboxes we'll pay for, go back and forth on the quotes and POs, and make sure all the licensing really works.

But your mileage may vary.


>I'm a licensed professional, and in my experience it takes 1-2 hours to set up a conda virtualenv with all the packages I need.

Yes, there are problems where Python is an easy solution. And many where it is not. And some where it cannot solve the problem without extreme effort.

Having been in dev a long time, this is the simplest, naive works best case path. If this were how setting up Python worked for everyone, there would not be an incredible amount of forum posts, github issues, setup help and problems, easily found on the internet. If you've not had to change underlying code in some python package or even worse recompile underlying C libraries, then you have not faced the kids of problems many (me included) have.

Ever solve a problem like the one I listed? That is not a simple conda install (and I use conda stuff vastly more than matlab/mathematica, so I'm pretty aware of it's use and features). Many problems I can solve in Mathematica (my preferred tool for certain work) cannot be approached by Python at all (or any open source tools I am aware of, and I have tried pretty much all of the things listed as MMA replacements).

>find the right number of seats to purchase (other departments might decide to get in on the purchase

So you're no longer making an apples to apples comparison - you just solved a bigger problem with the Matlab side.


But isn't Octave supposed to include the same built-in functionality as MatLab?


No. Octave claims to support Matlab syntax, and they largely do, but not completely. And they most certainly don't provide all the packages Matlab has, which is where a lot of the use is.

Octave is also unstable, and I doubt any company needing heavy use of a tool like this in production would trust Octave to not puke. It's just simply cheaper to use the polished and vastly more feature rich tool. Download Octave, go find some decently complex matlab code on github, and try to run it. Do that a bit and see how much works as it should.

Octave lists places they see themselves as different, some of which is core pieces that don't work the same. So if you want to replace some engineering tasks with Octave, it's going to be a mess, in the same way OpenOffice is close to MS Office, until the day you send a proposal with a deadline and it pukes because the other end used MS Word instead of an almost clone.

I've used Octave - it's decent. If you cannot afford Matlab, or your school doesn't have it, or you want to learn "matlab" to get marketable skills, then one can learn on Octave. Most serious engineering will not be done on Octave though.

[1] https://wiki.octave.org/Differences_between_Octave_and_Matla...


In the specific case of ML courses - many of which I have TA-ed or attended classes of, this reason does not ring true at all. Libraries for most standard algorithms are available in some form with a Python interface (or for the more statistical stuff: R). Its almost always the inertia from the initial design of the course.

It is also not true today that not knowing Matlab harms your industry productivity in ML. This might have been true around a decade ago, but most teams outside academia also have moved to non-Matlab resources. And if anything, this has been further reinforced by Deep Learning libraries, the current crop of MLOps tools and cloud-based frameworks.

Matlab might be good for specific areas, but ML has not been a stronghold for a while. It is also important to remember that in the context of numerical accuracy or computation speed, Python is almost always the user-facing layer. You might (correctly) argue that the Python language is slower/faster than X, but this is not a useful metric for comparing libraries and frameworks, where the compute heavy code is probably in C/C++: numpy, tensorflow, pytorch are good examples of this.


Professional costs $100k+ to employ partially because only those able to afford those tools for training get into the field.


Those fields require work to get done, so they use tools that make people as productive as possible. There's simply no open source packages with the wide range of numerical capability that Matlab has.


It's a regression as far as code readability goes for fairly straightforward reasons: almost everything in Matlab is a matrix. Matrices are not first class citizens in Python, and it matters. I use Python a hell of a lot more than Matlab, but for examining how an algorithm works (say, for implementing in another language or modifying it to do tricks), Matlab wins. Go look at these PRML collections in Python and Matlab and see if you disagree:

https://github.com/ctgk/PRML

https://github.com/PRML/PRMLT


I used to feel the same, but three years after making the switch, I've changed my mind. Matlab code has brevity, but sometimes at the expense of clarity. For example, sum(x,axis=1) is more clear than sum(x,1). Especially when matlab has functions like diff() where the second argument is not axis.

Broadcasting in python is a lot more clean than the "bsxfun(@plus, ...)" abomination in matlab. If you think all the "np." is too wordy then just do "from numpy import *". For matrix multiplication you can use "@". Numpy code can be dense but most people choose clarity over brevity.


I'd rather write python than matlab any day (I made this choice, literally in '98): it's a statement about reading. Matlab is closer to a a math notation and python is a clunky programming language. I'd never in a million years write new code in Matlab, but I prefer it for didactics.


The only thing I find nice in what Mathworks offers nowadays is their caps & T-shirts at conferences. MATLAB is on Medicare in deep learning times.


This is probably my favorite introductory machine learning book. The fact that he places almost everything in the language of graphical models is such a good common ground to build off.

This really sets you up to realize that there is (and should be) a lot more to doing a good job in machine learning than simply minimizing an objective function. The answers you get depend on the model you create as do the questions you can hope to answer.

I don't see a clear list of differences between this new edition. Does anyone know what's new?


Agree with you. But none of this is useful for practical (applied) machine learning. I don't want to disappoint you but you can read it as machine learning porn, but otherwise don't waste time on it.


I mean, as a graduate student, it was definitely incredibly useful. As a practicing data scientist, I’d have to say that it’s also incredibly useful.

I’ve used this stuff, and more often, the ideas taught, to break down a problem into a tackle-able set of pieces more times than I can count.

Never underestimate the fundamentals. Too many of my colleagues use models without actually understanding any of it. I’ve debugged so many problems by looking at the technical details in original papers and textbooks.


Are you saying the book itself is ML porn?


Yes, unless you are among 20 top researchers who are working on frontier of ml. Bayesian probabilistic techniques does not work or are very slow for any practical purpose.


Oh crumbs! There I was thinking that by obtaining an estimate of the probabilities of the responses of different groups to an employee survey I was applying a bayesian probalistic approach.

I'm going to have to rethink everything now as since it worked and was quite quick (I didn't even sample using MCMC, just brute force pulled permutations) so it was clearly not a bayesian approach, and I am very very far from one of the top 20 (or 200, or 2000 or 20000, maybe 200000?) researchers...


This may be true for whatever small corner of the data science world you inhabit but it isn’t true in general.

To choose just one example, the analysis of the new UK COVID variant relies on Bayesian modelling, both for the government analysis and the Imperial paper. (https://www.imperial.ac.uk/media/imperial-college/medicine/m...)


Turing.jl[1] is quite usable and isn't slow[2].

[1] - https://turing.ml/dev/ [2] - https://arxiv.org/abs/2002.02702


Can it handle 1000 predictor with 1 million data points?


What is the advantage of placing everything in the language of graphical model? How does the other ML book do it?


Graphical models are just a way to encode relationships between different variables in a probabilistic model. Directed acyclic graphs (DAGs) allow you to specify (most of) the conditional independence structures that you can have between things like parameters and random variables.

This is really useful information because it can help you identify what information is truly relevant for the estimation of certain parameters (so sufficient statistics) or help you crystallize your understanding of the implications of the model you’ve created. In other words, it helps show you the ways in which your model says different aspects of your data should influence others.

This creates testable implications of the model. If your model says that two variables should be conditionally independent given a third, but they’re not, you have an avenue for refinement. You can also clearly identify your assumptions or the implications of your assumptions.

Another great thing about them is that exact inference for certain (most) structures is known to be computationally infeasible. There are a lot of different inference schemes available that can help you with different approximations with various drawbacks/advantages, heuristics that sort of work, or even ways of drawing samples from the true distribution if you can identify the structures. See belief propagation, loopy belief propagation, sequential Monte Carlo, and Markov chain Monte Carlo methods.

On top of this it helps you see everything in a general framework. Lots of the fundamental pieces of ML models are really just slight tweaks to other things. For instance, SVMs are linear models on kernel spaces with a specific structural prior. Same with splines; it’s just a different basis function. All of this helps you see the pieces of different methods that are actually identical. This helps you make connections and learn more effectively, in my opinion.


Absolutely. Very strongly agree with the last paragraph, and that's how I aspire to learn in general. Can you point me to some resources(book or otherwise) that goes over all these relationship in a general framework?


Unfortunately, it's just something you start to notice once you become more familiar with the fundamental math underlying all of this.

The book, Machine Learning: A Probabilistic Perspective by Kevin Murphy (the original book everyone in this thread is talking about) is probably the closest thing I can think of. Its goal is to frame everything around graphical models and probability. It's quite a tome. Still, despite its breadth, it can't possibly cover everything.


For anybody truly serious about this field, I recommend the below book. It has some poor reviews on Amazon, which I was shocked to see, but it is my favourite book and taught me the core of probability theory and statistics, in a way most books don’t. Your understanding of Machine Learning will be better than 90% of those out there, if you can get through the principles in this book.

I topped statistics at the most prestigious university in my country both at the undergrad and postgrad level, and had no problem discussing advanced concepts with Senior PHDs in Quantitative Fields, and I thank this book the most for beginning my journey on this. But, and this is important, make sure to do all the exercises!

https://www.amazon.com/John-Freunds-Mathematical-Statistics-...


> it is my favourite book

Is it your favourite book because of how much your personal history is tied to it, and the time you deboted to it, or are you comparing it against other books based on an analytic review and comparison of several books that you did at some point?

Nothing wrong with the former case, I also have favourites that I recommend, but if it’s the latter the recommendation is more helpful; in that case it would be awesome to detail why this one over others.

This idea of compared review is useful here:

https://fivebooks.com/

And here:

https://www.lesswrong.com/posts/xg3hXCYQPJkwHyik2/the-best-t...


This is a very good post and I agree with your comments.

The book for me was partly great because of its contents and partly because I worked through every problem and realized how much it taught me. I should do a more factual write up on why it’s a great book, I’ll try to when I get some time.


Looking at the table of contents (for someone who is not familiar with the term 'Probabilistic Machine Learning'), is this just covering typical ML methods through the lens of probability?


Answer is not so black and white as everything in ml has to use probability. You can ignore this unless you are among 20 top researchers who are working on frontier of ml. Bayesian probabilistic techniques does not work or are very slow for any practical purpose.


But does it aid in understanding regular models, as they might have a bayesian interpretation?


Yeh for sure, but it's an overkill. It's like reading quantum mechanics to understand Newtonian mechanics. If you want to get a feel of bayesian ml here is an easier book, "Regression and Other Stories" https://avehtari.github.io/ROS-Examples/


Thanks. How's that relate to Gelman's BDA or statistical rethinking?


Bda is more advanced version of this book. Stats rethinking is also good and one can start from scratch as it has more code and less maths.


Ty


Why ML books should be so big? In many cases the books are various sub books pasted as one or extensive bibliography reviews that just list progress without any pedagogy. I would suggest splitting them in 200-250 pp format in parts that can serve independently.


while book 1 looks great, it appears that book 2 is still very rough: https://probml.github.io/pml-book/book2.html


Excited to check it out, this was a game changer for me. It turned me on to Gaussian Processes, which I think are a really fun tool.


also recommend probabilistic methods for hackers as another resource to explore this space:

https://camdavidsonpilon.github.io/Probabilistic-Programming...


Thanks, that looks like a good resource for my learning style. I usually learn new things by playing around with code before diving into theory (I know that this sounds backwards).


Yeh this is more practical and useful book.


The original 2012 book was awesome!

Would love to get my hands on the draft for "Probabilistic Machine Learning: Advanced Topics".

https://probml.github.io/pml-book/book2.html


In my opinion (having read both books and TA’d courses using both), Murphy is a significantly better book than Bishops’s Machine Learning. I’m very excited about a sequel!


Quite excited to read this. Murphy does a great job of explaining concepts from first principles.


I think it would be extremely helpful to map the math into code. Nobody has done this as far as I've seen. I mean you can find Github repositories for some papers but really a set of explicit tutorials from the math to the code would be really helpful.

To say something is "machine learning", I think means that you should show the code not just equations and derivations.

I mean if you only show math and derivations, what's the point? To show off what you know? How is that helpful?


This exists actually, it's not complete yet (I think?) but it covers a lot of the material in the book:

https://github.com/probml/pyprobml


Yeh at least he gives a full derivation for every equation :)


I helped with this project last year. It's very cool to see it taking shape




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: