MuZero: Mastering Go, chess, shogi and Atari without rules

ignoranceprior · on Dec 23, 2020

Whoa, this is extremely impressive. Quotes from the BBC article:

> "For the first time, we actually have a system which is able to build its own understanding of how the world works, and use that understanding to do this kind of sophisticated look-ahead planning that you've previously seen for games like chess.

> "[It] can start from nothing, and just through trial and error both discover the rules of the world and use those rules to achieve kind of superhuman performance."

> [...] MuZero is effectively able to squeeze out more insight from less data than had been possible before, explained Dr Silver.

https://www.bbc.com/news/technology-55403473

It seems like we're getting much closer to artificial general intelligence from two directions: reinforcement learning (such as MuZero), and sequence prediction (such as GPT-3 and iGPT). Very interesting times to be in the AI field.

bko · on Dec 23, 2020

I've noticed all the top performing AI reinforcement algorithms i hear about know next to nothing about the initial rules. And not only do they perform as well as more supervised methods, but much better

The one exception is self driving. I listened to the Lex Fridman interview with ceo of waymo recently and he made a case for the controlled environment (e.g. separate detection from decision making and planning) and pushed back against the end to end approach that doesn't make any preconceived assumptions about the environment. As an example he takes red lights. They're clearly human engineered signals, so it makes sense to have a module that can explicitly determine the signal as opposed to learning the behavior

But that's true about other games as well and end to end methods still outperform. Which makes me ask, is end to end learning an inevitability for self driving as well or is this the one domain special due to complexity or other aspects?

philipkglass · on Dec 23, 2020

A machine controlling a real car gathers feedback no faster than real time and cannot afford to learn the meanings of street signs from the consequences of ignoring them. A machine can learn the rules of Atari games from scratch by playing them orders of magnitude faster than real time and treating "death" as one signal among many.

In order for a machine to learn driving the same way it learns Atari games, it seems that it would need an extremely high fidelity virtual environment to learn in. The high fidelity requirement would necessitate a lot of up-front investment in trying to get the simulations right. You might spend a whole career just trying to build a drivable Virtual Philadelphia as challenging as the real thing. The details would also make it much more expensive to run training sessions at high multiples of real time.

Given those factors, I'm not surprised that self-driving vehicle experiments just use real environments and don't try to learn the fundamental rules from scratch. But it's an interesting point that these choices may make it harder for agents to keep improving.

bko · on Dec 23, 2020

There's a middle ground too which alpha go leveraged which bootstrapped learning from actual human gameplay predicting user actions. That's what comma.ai does and AlphaGo still performed significantly better than explicit rules or higher level abstractions. That and a mix of simulations might yield better results

ars · on Dec 23, 2020

If you could make a virtual environment that was that good, you would have already solved the self driving problem.

I wonder if Google would be willing to pay people to add cameras to their cars to collect real world data in far larger scale.

samatman · on Dec 24, 2020

I do believe that's the basis of Tesla's play in the space: a bet that enough cameras collecting real-world data can beat out a dedicated from-scratch self-driving system in deployment.

Whether this will work remains very much an open question.

gaudat · on Dec 23, 2020

That might be happening now already. I think Google sourced their dataset from dashcam videos uploaded to YouTubr?

martamorena284 · on Dec 24, 2020

Really not true. We have all the tools available... Just model a city in Unreal Engine 4 (GTA 6 anyone?). Your sensors, like LIDAR could do ray-queries, cameras could render views, etc. It's not 100% real, but probably real enough to learn the basics. Photorealistic graphics should be sufficient for AI to learn real-world interaction. In the end our whole world might be just that, a simulation. So I see zero reason why we couldn't train a self-driving car AI in one.

And the coolest thing is that this could also be used as a basis for a AAA video game, and these tend to make billions as well these days, so it's a win win for everyone. AI companies with funding should invest heavily into virtual reality and gaming, because they will need to perfect this to train their models.

mckirk · on Dec 24, 2020

The problem is that these simulations are far from perfect, and AIs are great at exploiting paths-of-least-resistance. So you'd end up with an AI that could drive through your virtual city flawlessly by overfitting on some cues that you wouldn't even notice, but are reliable enough because of the necessarily lower complexity of the simulation.

It wouldn't translate to the real world; unless maybe you add enough noise to the simulation to prevent the AI from using too simple cues, but at that point, it's questionable how much insight the AI could still distill from the simulation.

Shorel · on Dec 24, 2020

In that case, I would like to see what the AI can learn in Euro Truck Simulator 2.

grandmczeb · on Dec 24, 2020

Minor tangent, but I think you’re referring to this video[1] with Dmitri Dolgov. He’s the CTO of Waymo, not CEO.

[1] https://youtu.be/P6prRXkI5HM

chongli · on Dec 23, 2020

What does it mean to “not be given the rules”? If you set a child down in front of a chess board with the pieces nearby and they are not aware of the rules, I doubt they’d ever figure out how to play even a single correct game of chess. Heck, the child may decide to put the pieces in their mouth or dress them up as make belief characters.

Without any concept of the rules you have no way of even knowing that you’ve set up the pieces for a legal starting position, never mind executing a legal move to open the game.

This is really bizarre.

ignoranceprior · on Dec 23, 2020

This is explained in Appendix A of the paper ("Comparison to AlphaZero"): https://arxiv.org/pdf/1911.08265.pdf

Basically, AlphaZero was provided with a simulator that was able to distinguish legal and illegal moves and determine which future game states would be wins or losses. This was used to generate the search tree of possible states and actions.

MuZero doesn't have access to a simulator, it only has access to its direct environment. MuZero excludes actions that are immediately illegal, which solves the problem you mention in your penultimate paragraph, but it needs to learn the game's dynamics in order to determine which future moves and states are possible.

thomasahle · on Dec 23, 2020

The point is this

> AlphaZero used the set of legal actions obtained from the simulator to mask the policy network at interior nodes. MuZero does not perform any masking within the search tree, but only masks legal actions at the root of the search tree where the set of available actions is directly observed. The policy network rapidly learns to exclude actions that are unavailable, simply because they are never selected.

MuZero still masks legal moves, but only at the root. All its parts are eventually trained on the output of its root, and so learn the legal moves.

The justify this root level masking by how the Atari will only allow you to perform legal moves, while a weak enough player may consider illegal moves while planning in your head.

The main thing that's slightly "hidden under the rug" is that for "masking" to make sense in the first place, MuZero needs to know a set of all moves that may be legal at some point in the games.

chongli · on Dec 23, 2020

Oh okay. So it's a technique that essentially allows the tree search to be less "pedantic" about the rules in future states. Very interesting.

I would love to see how this might go for more complicated games such as NES adventure games and RPGs.

orlp · on Dec 23, 2020

> while a weak enough player may consider illegal moves while planning in your head

This isn't just weak players. E.g. strong chess players often consider moves as if blocking pawns weren't there. They might consider a bishop to be on a strong diagonal despite there being a blocking pawn because they can imagine moves that would happen if that pawn would disappear.

thomasahle · on Dec 24, 2020

I suppose you are right. But MuZero won't be able to do this, since it's training forces it to consider legal moves in its planning.

orlp · on Dec 24, 2020

No it doesn't. MuZero does its planning entirely in its own latent space (it may not even actually think of the game in terms of 'moves' but in whatever steps it considers relevant instead), only the output is filtered for legal moves.

It's no different than a monkey operating a chess computer that makes sure the monkey only performs legal moves. Your suggestion would be akin to suggesting that the chess computer would be affecting the monkey's mind so that it can only think in terms of legal chess moves.

klipt · on Dec 24, 2020

Seems you could equivalently treat rule breaking as a loss, and any algorithm sophisticated enough to learn how to win will also learn to avoid breaking the rules.

llamaz · on Dec 24, 2020

Interestingly enough, this is exactly how one of the world chess champions, Jose Raul Capablanca, was said to have learned chess as a child.

It may be true, or perhaps it was a story concocted in order to emphasize his innate talent.

thomasahle · on Dec 23, 2020

>> "[It] can start from nothing, and just through trial and error both discover the rules of the world

Unless they've changed a lot of things since the original paper, this is a bit exaggerated.

MuZero learns what moves are allowed in a given position/situation, but it still needs to know a finite overall set of possible actions.

E.g. for chess, it isn't told which fourty moves a available at each point in its search tree, but it still knows to only consider 64x64 discrete options.

tromp · on Dec 23, 2020

> only consider 64x64 discrete options

It wouldn't know that moving its King from e1 to g1 must be accompanied by moving its rook from h1 to f1.

Or that moving a white pawn from e5 to d6 must in some cases be accompanied by removing the black pawn on e6.

I guess the environment does these in response. That doesn't suffice for

moving its pawn from b7 to b8 must be accompanied by replacing said pawn by some other piece.

bshanks · on Dec 23, 2020

This is celebrating their Nature publication today. Here is the preprint:

https://deepmind.com/research/publications/Mastering-Atari-G... https://arxiv.org/pdf/1911.08265.pdf

syntaxing · on Dec 23, 2020

As impractical as the idea is, reinforcement learning is so damn fun. I highly recommend others to play around with it. I originally was using the famous fork of OpenAi baseline, stable baseline but had issues with tuning with Optuna. I recently stumbled across Ray from Berkley [1] and it has a newer and fancier built-in hyper-parameter tuner. Even as a hardware engineer that's only a software hobbyist can make the computer play some atari games. I think my next step is to try to make my own Super Mario agent.

[1] https://docs.ray.io/en/latest/index.html

orlp · on Dec 24, 2020

Eh, I'd say it's fun if you have a couple thousand TPUs lying around.

If you're just messing around with 1 GPU and a desktop PC you should be happy to get Atari breakout to work.

syntaxing · on Dec 24, 2020

Definitely the worse part of RL is that it takes so long to train. But it works surprisingly well on Google Colab or equivalent.

confuseshrink · on Dec 24, 2020

The published hyperparameters are usually ridiculously conservative, for the simple games like breakout and pong you can usually converge in far fewer frames than in the papers.

orlp · on Dec 24, 2020

I have tried reproducing the papers, with mixed success. I do not share your sentiment.

jedharris · on Dec 23, 2020

Same topic as a year ago, but deserves much more examination than it got then.

MasterScrat · on Dec 23, 2020

Yeah it's basically the exact same thing as in Nov 2019 right?

They're just hyping up their Nature publication. Or did I miss something?

ArtWomb · on Dec 23, 2020

Curious if MuZero was ever unleashed upon the ProcGen Benchmark? Where immediate rewards are sparse. But once the underlying generators are "solved". They can be readily exploited ;)

https://www.amazon.science/blog/neurips-reinforcement-learni...

skybrian · on Dec 23, 2020

The full Atari game list (Appendix I) is interesting. It's not better at every game, and scores a zero on Pitfall and Montezuma's Revenge.

lacker · on Dec 24, 2020

You might be interested to read https://deepmind.com/blog/article/Agent57-Outperforming-the-... which has an alternative approach that does better on those games. Basically those games involve lots of “exploration” to find the winning states, so you need some algorithm that is incentivized to explore through a large state space even when it hasn’t seen any reward there.

webmaven · on Dec 23, 2020

I may be missing something, but it seems that what is being described is a neural net architecture that can be trained on any of several games to get impressive results for those game.

NOT that one neural net can be trained to play all of the games.

So, while this is an interesting result and makes using the same architecture for specific applications easier and a bit more plug-and-play with little to no modification of the code, what it accomplishes is reducing the effort required by a software engineer or researcher on adapting the software before training even begins, but it pretty much still requires the same amount of training.

What this doesn't seem to do is allow the same trained network to be applied to multiple tasks (which I think most of the AGI comments are assuming), and it certainly doesn't generalize anything among the games it is trained on.

aleem · on Dec 23, 2020

Point this thing at the stock market and see how that game plays out.

hnracer · on Dec 24, 2020

Many very smart people have tried and failed. State of the art remains very basic supervised models with hand engineered features. In the markets, data is permanently scarce, so these methods don't work well. In the RL problems that DeepMind is solving, data is literally unlimited, and that's the problem space that these methods have been designed for.

patagurbon · on Dec 24, 2020

It's not so clear to me how you would train a reinforcement learning agent for the stock market. You have historic data for prices etc. But that's more of a supervised learning thing. You could set it loose on one of those realtime market simulators, but the agents actions wouldn't have any impact on the simulation right?

hnracer · on Dec 24, 2020

There's two problems in markets, price prediction and execution (ie what to do with your prediction). The former is a supervised learning problem but the latter is an action space problem ie an RL problem. Although nobody in industry has gotten any RL methods too work, they overfit to the incredibly small data sets.

Buttons840 · on Dec 24, 2020

I'd still be most impressed to see an AI beat the top Civilization players. No mechanical advantage since it's turn based, but there are several different types of decisions to make beyond just "move a piece". AIs haven't yet conquered such environments.

It would also give the gaming industry a kick in the pants to start making better AIs.

dwohnitmok · on Dec 24, 2020

I am fairly confident a team of DeepMind's calibre could put together an AI in fairly short order that would demolish top-level Civilization players. Despite my confidence, I still would love to see such a thing made.

DeepMind made a good effort with AlphaStar at building an AI that could compete with top-level humans in Starcraft. It wasn't superhuman; it could still be consistently beaten by the absolute best Starcraft players, especially as Zerg or Terran. However, as Protoss, AlphaStar was truly a pro-level player. I'm somewhat surprised DeepMind didn't go further and try to optimize AlphaStar to truly be superhuman. I'm not sure if that indicates a fundamental limitation of their approach or whether it was a shift in approach. This was with successively refined limitations on AI action speeds that caused AlphaStar to really rely on strategy and tactics rather than brute force speed.

Regardless, real-time strategy games feel much more difficult than turn-based strategy games to develop a good AI for. Just being able to split things into discrete turns seems like a massive simplification.

iamcreasy · on Dec 24, 2020

Would it be able to play(and win) Among Us as an imposter or we are still far away from that?

lacker · on Dec 24, 2020

Far away - the Atari games tested do not include multiplayer logic or any communication with other agents.

vermilingua · on Dec 23, 2020

DeepMind seems to be building the Wintermute to OpenAI’s Neuromancer. Where’s Turing?

willowwonder45 · on Dec 24, 2020

Curious as to what is "Turing" in this context?

vermilingua · on Dec 26, 2020

In Neuromancer (William Gibson's 1984 genre-defining cyberpunk novel), the Turing Police enforce laws prohibiting the creation of any superintelligent AI. I don't want to spoil anything, so we'll leave it there.

kovek · on Dec 23, 2020

Amazing! Anyone has ideas on how to: 1. Bet on AGI, 2. Encourage AGI?

I have a strong belief that it could grow and I’d like to contribute (and join the development)

mellosouls · on Dec 23, 2020

It's not obvious this has much to do with AGI in the sense of human level sentience.

ve55 · on Dec 23, 2020

I think it definitely does. Is this AGI, or even close to it? No. But is it in the direction of AGI, or otherwise some small building block of it, that when worked on for decades, may contribute? Definitely

ska · on Dec 23, 2020

It's still not that obvious. There has been a lot of interesting stuff in this current iteration of "AI", but the overall approach could still end up being a dead end with respect to AGI itself.

It's an old discussion, and while a few of the deep learning results are really impressive I don't think any of them have fundamentally changed that discussion, yet.

dwaltrip · on Dec 23, 2020

Even if it is a dead end, that is still valuable knowledge.

ska · on Dec 23, 2020

Of course, but the claim wasn't that it had no value.

You can easily handwave that all generated knowledge might be indirectly useful; i think that's fair but also different than the distinction I drew.

xiphias2 · on Dec 23, 2020

AGI is the main goal of DeepMind. They try to mimic human planning by finding similar strategies to what the human brain does, although of course there's never guarantee that their way is the right way.

xiphias2 · on Dec 23, 2020

Buy Alphabet or OpenAI stock. Or even Tesla, which also helps us to get closer to self driving cars. Although I can't comment on whether the trades themselves will give you profit or not, as that's impossible to say at this point :)

edouard-harris · on Dec 23, 2020

OpenAI isn't publicly traded, but the terms of Microsoft's investment in them give MSFT exposure to their upside. For a retail investor, buying MSFT is probably the best way to bet on this kind of outcome.

deegles · on Dec 23, 2020

Is there a way to learn Go from scratch using these AIs? I wonder if it would pay off in the long run to be fully trained by one.

cgreerrun · on Dec 23, 2020

What you can do is checkout the algorithm at particular stages of development. AlphaZero&Friends start out not being very good at the game, then over time they learn and eventually become super human. You typically checkpoint the weights for the model at various stages. So early on, the algo would be like a 600 elo player for chess and then eventually get to superhuman elo levels. If you wanted to train using an AlphaX algo, you can gradually play against underdeveloped versions of the algo until you can beat them by loading up the weights at increasing stages of deveopment.

If you're curious how it would work, I implemented AlphaZero (but not Mu yet) using GBDTs instead of NNs here: https://github.com/cgreer/alpha-zero-boosted. Instead of saving the "weights" for a GBDT, you save the split points for the value/policy model trees, but the concept is the same.

panabee · on Dec 23, 2020

thanks for sharing! this is very interesting. why did you use GBDTs instead of NNs?

cgreerrun · on Dec 23, 2020

> Thanks for sharing!

You're welcome.

> Why did you use GBDTs instead of NNs?

I mostly wanted to build an implementation to see how it worked; I was more familiar with GBDTs than NNs, so I figured I'd start with that. At its heart, AlphaZero is the marriage of two great ideas: using a Monte Carlo Tree Search (MCTS) to efficiently look ahead and find good moves and using a powerful ML model (like a ResNet) as a bot's intuition about which positions are good to be in (value network) and which moves are good when you're in which positions (policy network). So if a GBDT is powerful enough for your use case, the "ML Model" component in the MCTS+ML Model AlphaZero setup should be able to be swapped out with it if you want.

But I was also curious if GBDTs would do almost as well as a NN, because GBDTs can be much more efficient w.r.t. cost/energy. At the time when AlphaZero came out, I think it cost >$10M to train a superhuman Go algo. Nowadays KatoGo [1] can do it for <$50K. The most expensive part of training is the self play. You basically have bots play millions of games against each other and learn from the results of those games. Getting value/policy predictions each move from the ML models is a majority of the computation during self play, so if you make that more efficient, you should be able to train a bot faster/cheaper.

Check out this HN thread if you're interested in more AlphaX shenanigans: https://news.ycombinator.com/item?id=23599278

[1] https://github.com/lightvector/KataGo

panabee · on Dec 23, 2020

will do, thanks for sharing this other HN thread. also sent an email to the address listed on your github profile.

kadoban · on Dec 23, 2020

You can play against open source reimplementations of some of the ideas behind AlphaGo family AIs. LeelaZero was one of the early ones, KataGo is probably your best bet right now. Sai is also in the mix.

All are _very_ strong. KataGo is ungodly strong, it beats pros.

Learning Go is about more than just playing against strong players, but it could help. The biggest difficulty is that the strong AIs aren't actually that good at playing handicap games, and they're also almost completely unable to explain to you why you should play one move over another.

yadaeno · on Dec 23, 2020

For reference the best human player ~3800 elo. Alpha go zero is ~5200. A 400 elo difference means the better player will win 99/100 games if I remember correctly. Board game ai is definitely in a league of its own.

tromp · on Dec 23, 2020

> strong AIs aren't actually that good at playing handicap games

You wouldn't know that from watching JBXKataBot playing on KGS with a typical 5-7 stone handicap.

kadoban · on Dec 24, 2020

For the use being discussed (teaching beginners), you really need 9 stones, and it needs to work well. 7 stones is getting close though. I'll take a look at those games.

Last I saw I remember katago playing up to maybe 4 stones pretty well but the games being poorer quality over that.

Shorel · on Dec 24, 2020

> All are _very_ strong.

That's the problem. To learn, we need an AI that can be just a little bit stronger than humans, but at the same time we need an AI that makes natural moves, not an AI that makes great moves 90% of the time and clear blunders 10% of the time.

And playing go, the AI should be able to give handicap stones and play reasonable teaching moves.

devindotcom · on Dec 23, 2020

I think it would just smoke you from the outset. As far as I know it doesn't have a structured intelligence it can scale back - it would make the optimal move every time, destroying you like it destroyed top-tier players.

I tried learning Go a little while back but hit a wall. Was thinking about trying this more gamified option:

https://www.wolfeystudios.com/TheConquestOfGo.html

plants · on Dec 23, 2020

My rudimentary understanding of most reinforcement learning systems is that there is an "probability of optimality" associated with each action. Wouldn't there be a way to make the AI take the Nth optimal move or vary the degree of optimality with each move?

Imnimo · on Dec 23, 2020

It turns out that taking a superhuman player and making it play like a weak human is surprisingly tricky. It's not so hard to make a weak player - you just take suboptimal moves instead of the best moves. But often these suboptimal moves are bizarre. A weak human chess player will lose their pieces as they fall prey to forks and skewers and so on - tricks that are hard to see coming for new player. But they will rarely actively throw a piece away by moving it into danger. Even a novice human is pretty decent at looking one move ahead. But to a chess engine, or an MuZero agent, a move that loses the queen immediately and a move that leads to a sequence that loses the queen in five turns are basically equal. And so an artificially-weak MuZero agent, or an artificially-weak Stockfish agent will tend to make 'mistakes' that not even a weak human would make. This makes them a little difficult to learn from.

There does exist research on how to make a human-like weak player: https://arxiv.org/abs/2006.01855

The basic idea is to look at weak human games and try to predict when a mistake will be made. But I don't know if there's any approach that can do that without access to a corpus of human errors.

kadoban · on Dec 23, 2020

For these you are able to pick out worse moves than the optimal from them, but that's not actually the same thing as "play like a beginner, okay now play like an intermediate". These things are still openish questions and if nothing else there's a lot of room for improvement in tools to help you learn and review games.

_hark · on Dec 23, 2020

You can make KataGo play moves that keep the score roughly even since it has a trained score head, e.g. kataJigo [1]. This will keep the game even to your level, a nice way to train.

[1] https://github.com/sanderland/katrain#ais

kadoban · on Dec 24, 2020

I'm not sure that's a good idea at all for training, though it is a really neat trick.

For training you really want your good moves to be rewarded and your bad moves pointed out, but if the AI just plays up or down to match what you do instead, there's no signal getting back to you on how you're doing.

hakuseki · on Dec 24, 2020

Yes, you can just `pip install katrain` followed by `python -m katrain` to get started. Personally I would recommend at least reading abut the rules first (unlike MuZero).

I think the strength or lack thereof of your opponent is actually much less important than the strength of the AI you use to review your games. After each game you should study the AI's advice and learn the moves it recommends.

mark_l_watson · on Dec 23, 2020

I watched the Alpha Go vs. Lee Sedol games live. Big fan.

That said, I think Deep Mind should go all in for solving practical real world problems.

syntheticmindai · on Dec 23, 2020

They have used muzero to do video compression and saved 5% of bits. Source: david silver wired.co.uk interview

Jabbles · on Dec 24, 2020

I couldn't find that source, please elaborate.

tobessebot · on Dec 24, 2020

I mean they pretty much solved protein folding this year...

mark_l_watson · on Dec 25, 2020

+1 Yes, that is impressive and practical.

ngcc_hk · on Dec 24, 2020

We live only once. Could this uniqueness meant some of life decision must be done without repeating the case billion of time, which obviously s impossible.

I read qm. But does this actually useful for partial info. This is also another life situation where you never have full information.

I still wonder about the intelligence.

asbund · on Dec 24, 2020

Yea, wake me up when these models understand rule of physics and rule of law