Fast implementation of DeepMind's AlphaZero algorithm in Julia

cgreerrun · on June 22, 2020

I've been working on a Python implementation that uses Gradient Boosted Decision Trees (LightGBM/Treelite) instead of using a neural network for the value/policy models:

https://github.com/cgreer/alpha-zero-boosted

It's mostly to understand how AlphaZero&Friends work. I'm also curious about how well a GBDT could do, and if there are self-play techniques that can accelerate training.

The nice thing about a GBDT is that, unlike when using a NN, you can do thousands of value/policy lookups per second on a single core. So it should be cheaper to scale self-play and run a lot of self-play experiments (assuming the self-play learnings when using the GBDT model transfer to when you use the more-powerful NN in these environments).

If you're curious about accelerating self-play training, check out David Wu's work (https://arxiv.org/pdf/1902.10565.pdf). He's the creator of KataGo. I implemented his "Playout Cap Randomization" technique in my implementation above and, sure enough, it's much more efficient: https://imgur.com/a/epaKtDY. It seems like it's still early days in terms of how efficient self-play training is.

jonath_laurent · on June 22, 2020

This is very interesting! If your experiments work out, I would be interested in adding "Gradient Boosted Decision Trees" support to AlphaZero.jl.

I saw the work on KataGo and implementing "Playout Cap Randomization" is indeed on my TODO list.

cgreerrun · on June 23, 2020

I'll be sure to let you know how it goes. I have a few self-play experiments that I'm trying out once I get the testing setup finished.

Do you know if there is a discord channel for all of us AlphaZero nerds that I'm missing? If not, we should make one. It'd be great to bounce ideas off each other.

jonath_laurent · on June 23, 2020

Thanks! You can easily find my email address from my github. I am not aware of a canonical discord channel for discussing AlphaZero but the Lc0 community has a discord channel on which I had some interesting conversations.

psandersen · on June 22, 2020

This is really interesting, thanks for sharing!

I've been thinking about extensions to decision tree models that could get the benefits of NNs and it seems like there are a few ideas floating around.

For example; Probabilistic Random Forests have some really interesting properties for noisy datasets, e.g. "The PRF accuracy decreased by less then 5% for a dataset with as many as 45% misclassified objects, compared to a clean dataset." - https://arxiv.org/abs/1811.05994

PRF's might be a natural fit for RL, especially methods using monte carlo tree search.

Speculating here as I'm not adequately familiar with stochastic calculus, but intuitively it seems like probabilistic decision trees could be made differentiable since the hard decision threshold in a tree could be a turned continuous (i.e. every split is logistic regression), which might enable some really interesting applications. I personally dream of being able to cleanly integrate decision trees and tools from NNs in something like pyro for a fully Bayesian model.

cgreerrun · on June 23, 2020

A fast, powerful bayesian model seems like it would be a game-changer. PUCT (the heart of the AlphaZero MCTS that decides which action to choose) really seems setup to model the action choices as a multinomial bayesian inference problem (it already updates the action priors with Dir noise).

Thanks for the link! I don't really know anything about the world of probabilistic trees. I'll check it out.

The only bayesian approach to decision trees I'm familiar with is BART (https://projecteuclid.org/download/pdfview_1/euclid.aoas/127...). I haven't used them, but I'm guessing because it uses MCMC to update the params it's not super fast. I've seen them used in causality applications for partial dependency plots where it's convenient to convey the certainty of a variable's effect.

gbrown · on June 23, 2020

Check out the SoftBART method, I think there are interesting optimizations possible there. XBart also looks promising as an approach.

credit_guy · on June 22, 2020

There appears to be a LogitBoost tree that does what you say, if I understand you correctly.

[1] https://en.wikipedia.org/wiki/LogitBoost

psandersen · on June 22, 2020

Thanks for the reference, from a quick read it seems LogitBoost is a booster for ensembling models under a logistic loss.

I meant that the splitting point in a node in the decision trees that make up a random forest is itself a random variable that follows a distribution. Because its a smooth function (i.e. probability of splitting is 50% at the point and rises/falls smoothly) it should in principle be differentiable* so that the whole model can be trained by SGD and/or fit into an end to end learning pipeline with convolutional layers etc.

*Where I'm hazy, is how can a smooth probability function be differentiable when sampling. I'm brainstorming in the open here, will do some reading on stochastic neural networks.

algo_trader · on June 22, 2020

Its great when you can indeed iterate without 100s of GPU/hrs.

Are there any papers/comparisons/tradeoffs on when GBDT predictive-power plateaus compared to a NN?

EDIT: with self play you can trade-off a cpu budget for both the GBDT depth, a NN depth, and the roll-out depth - which is super interesting

cgreerrun · on June 23, 2020

> Are there any papers/comparisons/tradeoffs on when GBDT predictive-power plateaus compared to a NN?

None specifically that I know of, but I haven't searched.

"Shallow learning" GBDTs can do pretty well on MNIST (https://www.kaggle.com/c/digit-recognizer/discussion/61480), getting 98%+ accuracy compared to the 99%+ of NNs. So I figured if they can handle MNIST, they can probably handle connect 4, and would be useful to explore the self-play training efficiency aspects of AlphaZero (at orders of magnitude less compute time/cost)

> with self play you can trade-off a cpu budget for both the GBDT depth, a NN depth, and the roll-out depth - which is super interesting.

Definitely. It'll be interesting to see if a deeper MCTS search with a less powerful model can do pretty well. I'm still fairly ignorant about the MCTS literature, but I've definitely seen MCTS married to other value/policy models (linear regressions, for e.g.) that used large numbers of playouts years before Alpha Go came out. Those didn't work out, so seems like the DL aspect of Alpha Zero is somewhat essential to be able to learn games as complex as Go.

jonath_laurent · on June 23, 2020

I am wondering if your idea of using GBDTs in combination with AlphaZero might not be most influential in areas where no neural network architecture is known to provide the right inductive bias for the problem at hand.

I think neural models are pretty unbeatable in many classic RL environments because convolutional neural networks are REALLY good at learning visual representations. In some sense, I suspect that the great success of AlphaGo Zero comes in big part from the fact that it really makes sense to analyze a Go board as a 2D image using convolutional networks: convolutional networks provide the right inductive bias for the problem of learning to play Go.

However, there are tasks where neural network are not as good, such as symbolic manipulation tasks (I am in a good position to know this as I'm doing research in the area of automated theorem proving). I would be very curious to see how your approach fares for those tasks.

cgreerrun · on June 23, 2020

> because convolutional networks provide the right inductive bias for the problem of learning to play Go.

Agreed, seems like there will be a set of environments that GBDT won't work on and a CNN will. I'm only mildly familiar with CNNs, but one thing I wanted to try was to add convolutional features to a GBDT some way. I'm sure it's been tried and worked/failed before, but it would be an interesting exercise (at least for me) to learn why/if it works.

It'd be valuable if there was a way to get 95% of the benefits of CNNs (scale/position invariance, etc.) with less compute, even if the accuracy was lower. Right now the GBDT is probably making a ton of redundant split points for connect 4 to re-detect the same position-shifted pattern. There's gotta be a somewhat generalizable way to at least make that more efficient.

> such as symbolic manipulation tasks (I am in a good position to know this as I'm doing research in the area of automated theorem proving). I would be very curious to see how your approach fares for those tasks.

Definitely! I really want to see a non-game application. I think most people think of AlphaZero as "a cool technique to make SOTA board game agents", but to me it excites me because it seems like a very generalizable technique that'll be applied to other important types of problems.

I know as much as Jon Snow about symbolic manipulation, but if you have an example symbolic manipulation problem that can be shoehorned into an environment (state, actions, transitions, rewards) I'd be down to code it up and see how it does.

jorgemf · on June 22, 2020

how good is your AI so far?

cgreerrun · on June 23, 2020

I'm trying to answer that right now, actually.

For connect 4, once it's trained a bit it seems to do really well. At 800 MCTS playouts (<.4s), it goes from "I can beat it and sometimes we draw" before training to "I pray for a draw, it almost always beats me" after ~4 hours of training.

Connect 4 is a solved game, so it should be possible to sample the space of the trillions of (position, who should win?, what are the best move(s)?) tuples and compare the answers to your value/policy models to get some kind of objective error. I haven't had time to do that, but having that benchmark is nice to have so you don't have to do a "ladder tournament" against some reference bot(s) like you do for Go where you don't know what ideal play is.

After training it for 10 hours on Quoridor (using my personal laptop), it still can't beat me, but it doesn't seem anywhere close to plateauing. It goes from the agents aimlessly wandering around the board looking for victory row and randomly placing walls, to putting walls that thwart the opponent and navigating to the victory row.

I decided to implement PCR and try out some self-play techniques on Connect Four before I give it another go for Quoridor; a few days of self-play improvements can speedup training 10x. That's where I'm at now...

Once I test a few strategies I was thinking of firing up a c5a24x, 96-core box on AWS and giving it another go. It's ~1-2$/hr at the spot price so I can probably do a lot of damage for 50$ or so.

tromp · on June 22, 2020

The implementation includes Connect Four as an example application. While the standard board size of 7x6 is indeed solved, as they note, and in fact all sizes up to 8x8 are [1], they could have picked 9x8 or 9x9 which are currently unsolved. The latter is the new standard size on Little Golem which upgraded from 8x8 when that was solved.

[1] https://tromp.github.io/c4/c4.html

[2] http://www.littlegolem.net/jsp/games/gamedetail.jsp? gtid=fir

[3] http://www.littlegolem.net/jsp/forum/topic2.jsp?forum=80&top...

jonath_laurent · on June 22, 2020

I completely agree with you. Let me just add two remarks. First, although picking 9x9 boards makes connect-four intractable for bruteforce search indeed, I would be suprised if it made it much more difficult for AlphaZero, which relies on the generalization capabilities of the network anyway. Second, using a solved game for the tutorial is a feature, not a bug. This allows precise benchmarking of the resulting agent as a ground truth is known.

dnautics · on June 22, 2020

That's really cool and I didn't think of that. I just wanted clarification: that means you train the agent without the deterministic solution and your "validation/test" (I'm not sure what those phases are called in unsupervised learning) sets are done without the deterministic solution.

jonath_laurent · on June 22, 2020

Yes, the agent is trained without access to the deterministic solution.

tromp · on June 22, 2020

I did not see an evaluation of how close to perfection the agent becomes. Did you compute any sort of error rate (by finding moves that turn a won position into a non-won one or a drawn position into a lost one) ? And how this error rate drops over time as learning advances? That would indeed be very interesting to see.

vishvananda · on June 22, 2020

My team did an implementation of alpha zero connect four a couple of years ago. Our findings are in a series of blog posts starting at https://medium.com/oracledevs/lessons-from-implementing-alph.... We didn't manage to get to perfection either on policy, but got pretty close. You can play against some versions of the network here: https://azfour.com

jonath_laurent · on June 22, 2020

Your series of blog articles has been an important source of inspiration in writing AlphaZero.jl and I cite it frequently in the documentation. Thanks to you and your team!

jonath_laurent · on June 22, 2020

Such an evaluation is available in the tutorial: https://jonathan-laurent.github.io/AlphaZero.jl/dev/tutorial...

Admittedly, the connect four agent is still far from perfect but there is a lot of margin for improvement as I have done very little hyperparameters tuning so far.

jonath_laurent · on June 22, 2020

Author here: I am happy to answer any question you may have about AlphaZero.jl. :-)

master_yoda_1 · on June 22, 2020

I am confused about the FAST part, it is faster than all the other implementation (some of them are in c++) or it is just julia implementation and you think it is fast? I am asking because if julia is faster than c++ for ml/dl I would prefer to use it for production use cases.

jonath_laurent · on June 22, 2020

This needs clarification indeed. As I explain in the documentation, the aim of AlphaZero.jl is not to compete with hyper-specialized and hyper-optimized implementations such as LC0 or ELF OpenGO. These implementations are written in C++ with custom CUDA kernels and they are optimized for highly distributed computing environments. They are also very complex and therefore pretty inaccessible to students and researchers.

The philosophy of AlphaZero.jl is to provide an implementation of AlphaZero that is simple enough to be widely accessible for students and researchers, while also being sufficiently powerful and fast to enable meaningful experiments on limited computing resources. It has the simplicity of the many existing python implementations, while being consistently between one and two orders of magnitude faster.

More generally, the AlphaZero algorithm is extremely general and I think it can find applications in many research domains (including automated theorem proving, which is my own research area). I have been surprised to see that, despite the general excitement around AlphaZero, very few people actually tried to build on it. One explanation, I think, is the lack of accessible open-source implementations. I am trying to bridge this gap with AlphaZero.jl.

O_H_E · on June 23, 2020

Always great to see someone who finds something broken, and fixes it for others to move forward.

Thank you.

dklend122 · on June 23, 2020

In addition to the excellent answer by the author below, I'd like to say that Julia can get within spitting distance (or even sometimes exceed) C++ speeds (and even BLAS). So if a comparable amount of work went into optimizing specific paths through generic code (or even the generic code itself), it could be as fast. Also, one can write CUDA kernels in pure Julia.

master_yoda_1 · on June 23, 2020

Looks like I need to start learning julia

vishvananda · on June 22, 2020

Nice work on this! I was behind the implementation at oracle which you referenced in the tutorial. I still keep tabs on the lc0 crowd which seems to be pushing into new ideas. Did you pull anything else from the leela crowd besides prior-temperature? It looks like maybe you also tried a WLD output head as well?

jonath_laurent · on June 22, 2020

What do you mean by WLD output head?

So far, the main idea I have pulled from the Lc0 crowd is to have a prior temperature indeed. The next thing I am planning to add is the possibility to batch inference requests across game simulations instead of relying on asynchronous MCTS. In your blog series, you anticipate the problem of the virtual loss introducing some exploration bias in the search but ultimately concludes that it does not change much:

[Citation from your blog series]: "Technically, virtual loss adds some degree of exploration to game playouts, as it forces move selection down paths that MCTS may not naturally be inclined to visit, but we never measured any detrimental (or beneficial) effect due to its use."

Interestingly, it seems that the LC0 team had a different experience here. I myself ran some tests and going from 32 to 4 workers (for 600 MCTS simulations per turn) on my connect-four agent results in a significant increase in performances. This may be due to the fact that I use a much smaller neural network than yours, which is ultimately not as strong.

Related to this, there is a question I have wanted to ask you since I found your blog article series: did you make experiments with smaller networks and what were the results? What is the smallest architecture you tried and how did it perform?

vishvananda · on June 22, 2020

The lc0 group has switched the result prediction to predict win, loss, and draw probabilities instead of just win/loss. Some information can be found in https://lczero.org/blog/2020/04/wdl-head/

vishvananda · on June 22, 2020

we did a lot of our early experimentation with small networks. I don't think we went any smaller than 5 layers of 64 filters as we mentioned here: https://medium.com/oracledevs/lessons-from-alpha-zero-part-5...

jonath_laurent · on June 22, 2020

And what were the results of these experiments? What error rate can you reach with the smallest network architecture you tried for example?

vishvananda · on June 22, 2020

Unfortunately I don't remember the exact numbers, but I think it was a couple percentage points worse than we were able to get with the large models.

jonath_laurent · on June 22, 2020

This is interesting, thanks! Is there anything else you can tell me about the results of your experiments with small networks? I am really interested in this.

For example: did you notice than increasing or decreasing network size required significant changes in other hyperparameters? Are small networks learning faster at the beginning of training before they start to plateau?

MaxBarraclough · on June 22, 2020

Hi, thanks for this great project.

Connect Four was used as a demonstration. I presume this is because it's much easier/cheaper to train a Connect Four AI, compared to Go?

jonath_laurent · on June 22, 2020

Yes. Go 19x19 would be completely intractable on a single machine (one comment is citing a $25 million cost estimate in computing power to train AlphaGo Zero). A more reasonable target would be Go 9x9 but even this would be an extreme challenge on a single machine.

There is an Oracle blog article series about training a close-to-perfect Connect Four player using AlphaZero. Even here, they had to rely on multiple GPUs.

You have to keep in mind that AlphaZero is an extremely sample-inefficient learning technique, even for simple problems. Rather, the strengths of this algorithm is that 1) it is pretty generic and 2) it can leverage huge amounts of computation.

patagurbon · on June 22, 2020

Do you have any thoughts about multi GPU training? I haven't seen many options for Flux previously, but didn't dig very much.

jonath_laurent · on June 22, 2020

Multiple GPUs support definitely belongs to the TODO list. However, I am currently limited by the state of CUDA.jl on this, as it does not have a device-aware memory pool yet.

I am also looking forward to CUDA.jl supporting f16 and int8 computations, which may enable another big speedup.

doublesCs · on June 22, 2020

No questions. Just wanted to thank you for sharing. People like you make the world better one tiny bit at a time.

jonath_laurent · on June 22, 2020

Thanks for your kind message.

dandanua · on June 23, 2020

This is great! But what are your thoughts about MuZero? :)

jonath_laurent · on June 23, 2020

I think that MuZero is a fascinating algorithm, but that a lot of news articles are misleading when they present it as a new, superior substitute for AlphaZero.

MuZero is solving a harder problem, in which the learning agent does not have a model of the environment from the start (e.g. it does not know the rules of the game a priori). This makes it potentially applicable to a larger number of real-world challenges.

However, I haven't seen any evidence that it is any better than AlphaZero at learning games such as Chess or Go. Although DeepMind reports that their MuZero agent "slightly exceeds the performances of AlphaZero on Go", they say nothing about the training time and tuning effort spent on each.

As far as I understand and in the absence of further data, I think AlphaZero is still the superior choice to solve games with known rules, especially if you don't have DeepMind's level of computing resources.

If anyone knows better about this, I would be happy to be proven wrong though.

FiberBundle · on June 22, 2020

Does anybody know how long it would take to train an alphazero go version using one gpu? In [1] they claim that it took 13 hours until the model was able to beat the original alphago version, but they don't state what hardware they used.

[1] https://deepmind.com/blog/article/alphazero-shedding-new-lig...

newswasboring · on June 22, 2020

From an offline chat with the original author,

The ELF OpenGo paper[1], which is an open implementation of AlphaGo Zero developed by Facebook AI:

"First, we train a superhuman model for ELF OpenGo. Af-ter running our AlphaZero-style training software on 2,000GPUs for 9 days, our 20-block model has achieved super-human performance that is arguably comparable to the 20-block models described in Silver et al. (2017) and Silveret al. (2018)."

[1]: https://arxiv.org/pdf/1902.04522.pdf

arijun · on June 22, 2020

I can’t find it now but iirc there was a blog post on HN about a month ago that estimated their training costs at $25 million, using many TPU pods.

cgreerrun · on June 22, 2020

Here was the guestimation: https://www.yuzeh.com/data/agz-cost.html

jonath_laurent · on June 22, 2020

I agree with the quoted numbers. As I mentioned in another comment, you have to keep in mind that AlphaZero is an extremely sample-inefficient learning technique, even for simple problems. However, it has two major strengths: 1) it is pretty generic and 2) it can leverage huge amounts of computing power.

klipt · on June 23, 2020

What would be an example of a more sample efficient algorithm?

wrsh07 · on June 22, 2020

That was with at least one or more tpu pods, iirc

https://cloud.google.com/tpu/docs/system-architecture

tbenst · on June 22, 2020

First of all this is very cool. Dunno if author is on here, but I’m curious why both Flux and Knet are used rather than just one of them (Flux seems the most Julianic?).

Also, is this really faster than PyTorch/TF? Last time I benchmarked Flux for non-trivial networks, the speed was quite good with small models but memory usage was ~5x higher than pytorch, and I couldn’t fit my models on the GPU for flux. For large models, I had to compromise on batch size in Julia, although maybe with Zygote.jl the memory issues have been resolved?

jonath_laurent · on June 22, 2020

I suspect FLux/Knet are still slightly slower and less memory efficient than PyTorch/TF, although things are moving very fast here!

This is not relevant in understanding AlphaZero.jl speed though. The reason it is much faster than Python implementations is because tree search is also a bottleneck, and Julia shines here!

tbenst · on June 22, 2020

Ah, I hadn’t appreciated this. Thanks for making & sharing your code!

jonath_laurent · on June 22, 2020

Author here. AlphaZero.jl supports both Flux and Knet indeed and users can choose whatever framework they want to use.

As far as I understand, Flux and Knet have different strengths. I think Knet is a bit more stable and mature for large-scale Deep Learning, but Flux shines for "scientific-ML" usecases where low AD overhead is crucial.

ViralBShah · on June 22, 2020

While some may be addressed and others are being addressed, what would really help us if people file issues when they don't find performance to be adequate. If you still have the code handy, please do open some issues.

tbenst · on June 25, 2020

I ran the test 15 months ago using example code from Metalhead vs PyTorch examples repo. Unfortunately my test consisted of staring at nvidia-smi, so don’t have code handy. I believe I benchmarked Resnet.

Edit: I also had an issue with VGG and opened an issue: https://github.com/FluxML/Metalhead.jl/issues/42. Perhaps this has since been resolved

dunefox · on June 22, 2020

> 5x higher than pytorch, and I couldn’t fit my models on the GPU for flux. For large models, I had to compromise on batch size in Julia

I had the exact same experience. While I like Julia and Flux I can't use it in this state for my models.

dklend122 · on June 22, 2020

Would you mind opening corresponding issues on the repo? That would help guide the ongoing compiler work.

metalwhale · on June 22, 2020

Disclaimer: I'm not the author. Just want to share this awesome project.

likeaj6 · on June 22, 2020

This is awesome! I worked on a similar project in the past for the game Hex

Did a writeup here about it: https://notes.jasonljin.com/projects/2018/05/20/Training-Alp...

https://github.com/likeaj6/alphazero-hex

jonath_laurent · on June 22, 2020

Actually, I found your blog article when I was reading about AlphaZero and I found it useful!

mtgp1000 · on June 22, 2020

I don't know anything about Julia...how hard would this be to port to python or a c-style language?

Edit: I was mainly asking because I was curious about the relative expressiveness Julia...

ViralBShah · on June 22, 2020

I was going through this project over the weekend. And while I can't recall where exactly in the docs I read this, I am quite sure the author mentioned that there are various python projects but they are quite slow. Other implementations such as leela chess zero have a lot of C++ and are difficult to follow.

In fact, one of the things we want to do is maximize the performance of the Julia implementation. We hope to co-develop the compiler and ML stack to address these issues as they come up.

newswasboring · on June 22, 2020

I don't know if you saw it here, but a similar point is made in the readme section "Why should I care about this implementation".[1]

https://github.com/jonathan-laurent/AlphaZero.jl#why-should-...

doublesCs · on June 22, 2020

Truly truly thank you for your work <3

newswasboring · on June 22, 2020

Not sure you are thanking jonath_laurent (original author of package of discussion) or ViralBShah (co-creator of Julia). But I concur on both accounts :D

doublesCs · on June 22, 2020

Viral. I thanked Jonath in a different post :-P

When I see projects like this (I mean especially Julia, but also people sharing their work on packages like this) I feel very fortunate that elements of the free software movement are still alive.

instance · on June 22, 2020

There are already implementations out there in Python. [1]

The point of that project is to be a very fast alternative to those implementations while being more accessible than a C++ implementation.

[1] https://github.com/suragnair/alpha-zero-general

adamnemecek · on June 22, 2020

Why would you? Julia shines in this use case.