You might want to make it clearer that the agents don't actually receive any visual observations, but rather directly the xy positions of all other agents and objects.
> We’ve provided evidence that human-relevant strategies and skills, far more complex than the seed game dynamics and environment, can emerge from multi-agent competition and standard reinforcement learning algorithms at scale. These results inspire confidence that in a more open-ended and diverse environment, multi-agent dynamics could lead to extremely complex and human-relevant behavior.
This has been well established for a while already, e.g. the DeepMind Capture the Flag paper above, AlphaGo discovering the history of Go openings and techniques as it learns from playing itself, AlphaZero doing the same for chess, etc.
The state space here looks pretty small, it seems to me that with so much training it's just a case of brute-force search. When I think of "tool use" in regards to the intelligence of early humans, I imagine something more like [0] where the state space is enormous and it takes a good deal of reasoning and planning to get to a desired result.
It's unclear to me that we navigated the state space so discretely. My guess would be that we used a combination of rock throwing + stock hitting before eventually deciding that combining the two might be fruitful.
After the idea is polished it looks clever, but it may have been invented through a series of mostly random steps
Amazing. Very cool to see this sort of emergent behavior.
I also very much enjoyed this section:
"We propose using a suite of domain-specific intelligence tests that target capabilities we believe agents may eventually acquire. Transfer performance in these settings can act as a quantitative measure of representation quality or skill, and we compare against pretraining with count-based exploration as well as a trained from scratch baseline."
Along with the videos, I can't help but get a very 'Portal' vibe from it all.
"Thank you for helping us help you help us all." - GLaDOS
Looks awesome. I tried coding up a multi-agent system for my CS degree and it was incredibly complicated. I was trying to implement an algorithm I found to give each agent emotions of fear, anger, happiness and sadness in order to change their behaviours... it was way more difficult than I expected but you can read more about it here if you're also interested in this stuff. The 3D graphics in this example are way cooler than my 2D shapes.
The animations are nice, compared to a default visualization with dots and lines moving around. Was this done just for the public release, or was it worth it to researchers to have an eye-pleasing visualization while doing the experiments?
The environment was actually an important part of the project. It does physics simulation. Having such a 'realistic' environment allowed the agents to discover all sorts of cheats (they appear at the end of the article).
Right but to fully understand what's going on you need to also visualize the physics in a 3D world - just dots and lines and squares wouldn't fully show what's going on. This may be close to the simplest visualization that made sense
An accurate 3D visualization could have been a lot simpler than this. The actors are most likely modeled physically as simple cylinders; all the character animations are extraneous. And there's plenty of subtle effects in the seekers' vision cones, the reflective floor, the uneven box landscape outside, etc.
So, as tlb said, I'm curious if all of that was added for the public release, or if the researchers set it up while running the experiments. It seems like it would be fun.
The visualizations look great, but wouldn't run on an N64, which had many physics games. I'm wondering the same thing as the OP--was this advanced level of graphics used during the research, or was the styling added after the fact for readers? A low res visualization seems like it would do the job equally well, but maybe not. Curious what they are finding and whether there are benefits to having a great looking visualization during the EDA phase.
Researchers have much better graphics tools available to them today than they did in the N64 era. Basic familiarity with e.g. Unity would be enough to run these sorts of simulations.
One plausible, perhaps optimal strategy in the second arena is for the hiders to build a shelter around the seekers and lock them in place, circumventing the whole cat and mouse over ramps and ramp surfing (which the seekers would never be able to access). I wonder why this strategy is not arrived at.
That's always a good question! One thing to remember is that in RL you are looking through large solution spaces. You probably aren't going to find a global optima (if one even exists!). What will happen is that a local optima is found and just one that _works_. This is why having a feature rich space is important, because it helps you escape the locality, but also remember that we don't even know what the solution space looks like and what an optimal solution is.
It is also entirely possible that you retrain something from scratch and find a different local optima. Self play can help with this as well as multi-agents, but we're still not guaranteed to find every solution nor solutions that appear obvious to us. RL just tries things (often randomly) till they start working (then they bias towards what worked).
At one point in the video, it looked like a hider moving a object past a frozen seeker jostled the seeker with it. I wonder if it's possible to use the objects to push seekers together, then "jail" them.
This is incredible. The various emergent behaviors are fascinating. I remember being amazed a decade ago by the primitive graphics in artificial life simulators like Polyworld:
It seems that OpenAI has a great little game simulated for their agents to play in. The next step to make this even cooler would be to use physical, robotic agents learning to overcome challenges in real meatspace!
I'm doing something like this as a hobby but only single agent. The input is camera images and reward is based on a stopped/moving flag determined by changes between successive images as well as favoring going forward over turning. So far, it can learn to avoid crashing into walls, which is about all I'd expect. Trying to find good automated rewards without building too much special hardware is difficult. It's a vanilla DQN.
hmm, yes in the story I'm envisioning the AIs don't wipe out humanity because they have achieved sentience, but just because it turns out killing all humans is an optimizing component of solving some other problem.
I think we humans have already solved this problem you describe... we call them laws. We use these laws to prevent people doing bad things, and I see no reason why they can't be described to an AI to drive its behavior to one that isn't going to end humanity .
I think you're misunderstanding the problem. Expressing complex rules in a machine-readable format is the least of the issues here. The main problem is that training AIs to optimize certain behaviors within constraints very frequently leads to them accidentally discovering "loopholes" that would never have occurred to a human (as with "box surfing" here). The AI doesn't know it's "cheating"; its behavior may be emergently complex, but its model of our desires is only what we tell it.
A naive and unlikely example would be telling an AI to maximize human happiness and prevent human harm, so it immobilizes everyone and sticks wires into their pleasure centers. Everyone is as happy as it is possible for a human to be, and no one is doing anything remotely dangerous!
The actual dangers will be stranger and harder to predict. I'm not saying we can't find a way to make strong AI safe. I'm just saying that it's a much trickier task than you imply.
I've always been very incredulous that there would be any possibility of taking something sufficiently complex to be considered an AGI and hard-coding anything like the 3 rules into it.
By the same token, I’m extremely suspicious of the idea that such a sufficiently complex AGI could also be dumb enough to optimize for paper clip production at the expense of all life on earth (or w/e example).
...and many would say that’s because us humans are bad at imagining optimizing agents without anthropomorphizing them. This is a reasonable, even typical suspicion that many people share! The best explanation I know of why it’s unfortunately wrong is by Robert Miles in a video, but if you prefer a more thorough treatment, you could also read about “instrumental convergence” directly. If you find a flaw in this idea, I’d be interested to hear about it! :)
Sorry, just saw this. I think it’s his assumption that an AGI will act strictly as an agent that’s flawed. It requires imagining an agent that can make inferences from context, evaluate new and unfamiliar information, form original plans, execute them with all the complexity implied by interaction with the real world, reprogram itself, essentially do anything... except evaluate its own terminal goal. That’s written in stone, gotta make more paperclips. The argument assumes almost unlimited power and potential on the one hand, and bizarre, arbitrary constraints on the other.
If you assume an AGI is incapable of asking “why” about its terminal goal, you have to assume it’s incapable of asking “why” in any context. Miles’ AGI has no power of metacognition, but is still somehow able to reprogram itself. This really isn’t compatible with “general intelligence” or the powers that get ascribed to imaginary AGIs.
I’m certainly no expert, but I expect there will turn out to be something like the idea of Turing-completeness for AI. Just like any general computing machine is a computer, any true AGI will be sapient. You can’t just arbitrarily pluck a part out, like “it can’t reason about its objective”, and expect it to still function as an AGI, just like you can’t say “it’s Turing complete, except it can’t do any kind of conditional branching.” EDIT better example: “it’s Turing complete, but it can’t do bubble sort.”
This intuition may be wrong, but it’s just as much as assumption as Miles’ argument.
I’m also not ascribing morality to it: we have our share of psychopaths, and intelligence doesn’t imply empathy. AGI may very well be dangerous, just probably not the “mindlessly make paperclips” kind.
This is visually very impressive, of course, but what is the significance of
this work? I am not very familiar with intelligent agents research so I don't
understand to what extent learning cooperative tool use in an adversarial
environment (if I understand correctly what is shown) represents an important
advancement of the state of the art in intelligent agents research, or not.
In any case this is a simulation- so it's basically impossible to take the
learned model and use it immediately in a real-world environment with true
physics and arbitrary elements, let alone with unrestricted dimensions (the
agents in the article are for the most part restricted to a limited play
area). So if I understand this correctly the trained model is only good for
the specific simulated environment and would not work as well under even
slightly different conditions.
I love how the 3D visualization and game selection make their research immediately relatable - right down to the cute little avatars!
"We’ve shown that agents can learn sophisticated tool use in a high fidelity physics simulator"
I always suspected to evolve intelligence you need an environment rich in complexity. Intelligence we're familiar with (e.g. humans) evolved in a primordial soup packed with possibilities and building blocks (e.g. elaborate rules of physics, amino acids, etc). It's great to see this concept being explored.
It reminds me of Adrian Thompson's experiments in the 90's running generational genetic algorithms on a real FPGA instead of mere simulations [1].
After 5000 generations he coaxed out a perfect tone recognizer. He was able to prune 70% of the circuit (lingering remnants of earlier mutations?) to find it still worked with only 32 gates - an unimaginable feat! Engineers were baffled when they reverse-engineered what remained: if I recall correctly, transistors were run outside of saturation mode, and EM effects were being exploited between adjacent components. In short, the system took a bunch of components designed for digital logic but optimized them using the full range of analog quirks they exhibited.
More recent attempts to recreate his work have reportedly been hampered by modern FPGA's which make it harder to exploit those effects as they don't allow reconfiguration at the raw wiring level [2].
In Thompson's own words:
"Evolution has been free to explore the full repertoire of behaviours available from the silicon resources provided, even being able to exploit the subtle interactions between adjacent components that are not directly connected.... A 'primordial soup' of reconfigurable electronic components has been manipulated according to the overall behavior it exhibits"
Thanks! I like that section on how batch size affects convergence. I wonder how parameter size limits would similarly affect which Stages could be reached. I would not be surprised if you could hit those stages with 100x or fewer params.
You mean what is actually stored in "memory" to play those actions? It's usually the trained model, which is a graph with many layers, each containing many nodes and being connected to other layers. Depending on the model size, it can take anywhere from a few MB to tens or hundreds of GBs, but usually the smaller the better (as having a too large model will lead to over-fitting, meaning that it has enough data to only learn the strategies needed for the current problem and not generalize the solving of such problems).
Yup, this is what I’m talking about. I’ve been following googles WANNs and other methods for reducing the size of these “strategies” drastically. In the WANN paper they have an example of going from About 2500 Parmams to 40. I’ve got my own hunch as to where it is going, but I’m wondering if openai is studying this at all.
Like there are simple primitives in civil engineering like the level and pulley, I expect we find a zoo of tiny primitives which are the things that these agents are learning.
Instead of teaching the "AI" intelligent rules or rules for creating rules for maximising their goals.
They teach them nothing, which means they have 0 usable high level knowledge.
And the "AI" pure bruteforce for finding empirically best solutions for this ridiculously simple universe.
How is that advancing research?
This is just a showcase of what modern hardware can do, and also a showcase of how far we are from teaching intelligence.
My brain understand the semantics of this universe and would have been able to find most strategies without simulating the game more than once in my head.
So definitely this is a showcase of how far (bruteforce is like step 0) we (or at least openAI) are from making AGI.
Some AI researchers believe that using learning methods with no built-in prior knowledge and throwing a bunch of compute at them is the path to building effective AI. I'm thinking of Richard Sutton in particular:
I personally don't agree with his emphasis on model-free learning, but it's not the case that people are building model-free RL agents because they don't understand the trade off that they're making.
How do you know your own brain isn't running thousands of parallel simulations in your head, even though you perceive it only once? How did your brain learn to reason about physics in the first place if not by repeatedly finding objects in your environment and randomly manipulating them?
I wonder if it’s possible to incorporate a monkey see monkey do aspect to the learning algorithm that could observe human’s playing the game and incorporate that information into its models?
Yes, it's called imitation learning and is a subfield of reinforcement learning. The problem is that even a small error could gradually accumulate and cause the sequence of actions to diverge. RL agents learn not just how to act in a given situation but also to evaluate possible actions, situations and even to model the environment. That way they can adapt dynamically instead of diverging from the optimal actions.
Interesting, ideally it uses the observed human behaviors to seed/inform it’s own attempts as a shortcut to advanced behavior without the many millions of generations needed.
Great viz, design & structure! But for the first time, I had the impression that you didn't report anything new or different. All the takeaways of this work were pretty obvious given the last couple of years research. Am I missing anything?
I have a friend who observed similar emergent behavior in an a-life (gene-based from what I understand) simulation he created, in an environment of "tanks in a maze" (or something like that).
The "genes" consisted a simplified assembler (run on a VM) that could describe a program the tank would use to control itself - it could sense other tanks within line-of-sight to a certain degree, it could sense walls, it could fire its cannon, move in a particular direction, sense when another tank had a bearing (cannon pointed) on itself, etc.
He set up 100 random tanks (with random "genes"/programs) and let the simulation run. Top scorers (who had the most kills) would be used to seed the next "generation", using a form of sexual "mating" and (pseudo-) random mutation. Then that generation would run.
He said he ran the simulation for days at a time. One day he noticed something odd. He started to notice that certain tanks had "evolved" the means to "teleport" from location to location on the map. He didn't design this possibility in - what had happened was (he later determined) that a bug he had left in the VM was being exploited to allow the tanks to instantaneously move within their environment. He thought it was interesting, so he left it as-is and let the simulation continue.
After a long period of running, my friend then noticed something very odd. Some tanks were "wiggling" their turrets - other tanks would "wiggle" in a similar fashion. After a while all he could deduce was that in some manner, they were communicating with each other, similar to "bee dancing", and starting to form factions against each other...
...it was at that point he decided things were getting much too strange, and he stopped the experiment.
Sadly, he no longer has a copy of this software, but I believe his story, simply because I have seen quite a bit of other code and have worked closely with him on various projects since (as an adult) to know that such a system was well within his capability of creating.
At the time, he was probably only 16 or 17 years old, the computer was a 386, and this was sometime in the early 1990s. I believe the software was likely a combination of QuickBasic 4.5 and 8086 assembler running under DOS, as that was his preferred environment at the time.
I've often considered recreating the experiment, using today's technology, just to see what would happen (at the time he related this to me, as an adult, he asked me how difficult it would be to make a more physical version of this "game"; I'm still not sure if he meant scale model tanks, or full-sized - knowing him, though, he would have loved to play with the latter).
Does anyone know if there are some accessible GitHub projects that can do something similar to this? Would like to set up a new project with my nephew :)
You might want to make it clearer that the agents don't actually receive any visual observations, but rather directly the xy positions of all other agents and objects.
This also seems very similar to "Capture the Flag: the emergence of complex cooperative agents" (https://deepmind.com/blog/article/capture-the-flag-science)?
Regarding the conclusion:
> We’ve provided evidence that human-relevant strategies and skills, far more complex than the seed game dynamics and environment, can emerge from multi-agent competition and standard reinforcement learning algorithms at scale. These results inspire confidence that in a more open-ended and diverse environment, multi-agent dynamics could lead to extremely complex and human-relevant behavior.
This has been well established for a while already, e.g. the DeepMind Capture the Flag paper above, AlphaGo discovering the history of Go openings and techniques as it learns from playing itself, AlphaZero doing the same for chess, etc.