Entropy Maximization and intelligent behaviour

RangerScience · on July 6, 2017

Okay, TL;DR:

"Causal Entropic Forcing" is something like an AI's utility function, where the agent attempts to maximize future possibilities. Since this is meaningless (all possible futures are possible), what you actually want to do is make it as easy as possible to get to those futures - aka, their entropic adjacency, hence the name, causal entropic forcing.

However, CEF requires that the agent can actually predict possible future states of the system, which comes with some serious issues. In the original paper, this is covered by access to perfect simulators, but those aren't available in real-world situations.

This post discusses how to (possibly) use recurrent neural networks to make such predictions; how to do so effectively, and with consideration of the NN's confidence in it's predictions.

It's pretty cool!

clickok · on July 6, 2017

There's a lot details that need to be filled in to get a working algorithm from this idea, though. Like how to properly explore enough of the state space that you can estimate the ensuing entropy, and if it's possible to learn the utility and its variance in a sample-efficient manner using an RNN. It might be better to start off with an environment with unknown dynamics but an exact representation before going full nonlinear function approximation.

Nonetheless it's an interesting post; I like the idea of coming up with abstract "goals" that can be applied to any environment (without having to construct a reward function) that yield complex behavior. Even if it doesn't do precisely what you want it's useful for exploration and perhaps a good stepping stone towards the desired behavior.

On a related note, I believe you can learn to predict the entropy of a Markov process using reinforcement learning, so it might be possible to extend it towards control.

I wrote up the basic idea: http://rl.ai/posts/generalized-returns-entropy.html which argues that if you had some sort of state transition model you could construct a reward function from it, and then learning the value of a state is also the "expected entropy" starting from that state. The state transition model can itself be learned, so no simulator is required. The reason I say "I believe" is that this is the product of original research, that is, procrastinating on my thesis. So there's a risk that I've made an error somewhere or missed prior work.

highd · on July 6, 2017

In that case, aren't the big benchmark gains being claimed mostly the result of changing the problem to allow a perfect simulator in the system? The original author is claiming benchmark results either with a perfect simulator or with a pre-trained neural network mimicking a simulator. It seems like a massive change to the original problem. Otherwise the utility function is very similar to Q learning, just optimizing for future "flexibility" instead of future "reward".

Basically we should be considering the current posted results versus Q learning with an equally accurate pre-trained forward simulator, which I don't think anyone has done.

RangerScience · on July 6, 2017

> gains being claimed

AFAIK, the claims being gained are that they didn't have to supply the system with any goals, and it "figured out" the basic tests - included tool use and cooperation.

> with an equally accurate pre-trained forward simulator

AFAIK, that's exactly what the OP is discussing: how would this system perform when you replace the perfect simulator with an RNN trained to predict?

highd · on July 6, 2017

I'm referring to posts like this: https://entropicai.blogspot.fr/2017/06/openai-first-record.h...

RangerScience · on July 6, 2017

I'm not familiar with these works. Reading...

[edit]

Okay, I don't really understand what they're doing, so, my guess is that they have a component that predicts future states of the game, and they use something inspired by fractals to determine which future states to sample. Then, they're either using the score as the metric for the "value" of that future state (in which case it's not CEF), or they're ignoring the score and measuring something corresponding to entropy (or future-possibilies-remaining since it's pacman), and then they are using CEF.

It's like Data playing that weird chess game; the CEF aspect doesn't help you play any better, but it gives you a different win condition that turns out to "win" better than directly trying to win.

If, that is, my bad understanding is in any way accurate.

felippee · on July 6, 2017

Yes, this is very cool stuff and I've been advocating for it in previous posts. There is indeed the problem of prediction. Prediction seems to be an omnipresent paradigm in many aspects of perception, even in early vision.

This, contrary to beating deep learning models to death, may hold the key to artificial intelligence.

I have my own little model which I propose for prediction, I call the Predictive Vision Model, more info here: http://blog.piekniewski.info

I'm looking for more people who see the potential of this.

RangerScience · on July 6, 2017

This (causal entropic forcing) is one of the coolest ideas I've ever come across; it's one of my go-to stories about machine learning and philosophy.

It combines well with Jeremy England's theory that life is entropically inevitable: https://www.scientificamerican.com/article/a-new-physics-the...

..and I wonder sometimes if you could make a religion out of all this; morality and existence based on entropic math. Consider that the goal of CEF is maximized possibilities, smoke a bowl, and think about fractals and holograms.

One of the things I find really interesting about CEF is that it doesn't specifically help with understanding or predicting the world around you; it just gives a very effective way to determine what possible actions you should actually do. Given that (AFAIK) the human brain/mind is itself a combination of many systems, it seems to me to be very elegant that a CEF agent is also a combination of systems, each of which have limitations and issues.

eli_gottlieb · on July 6, 2017

>One of the things I find really interesting about CEF is that it doesn't specifically help with understanding or predicting the world around you; it just gives a very effective way to determine what possible actions you should actually do.

Well, it gives one way to prescribe actions, but no way to prescribe actions we actually care about. Maximizing possible futures rightfully ought be a mere subgoal or consequence of the actual prescriptions we care about.

Personally I like free-energy theory best, but it really still needs some work to distinguish which "predictions" change to accommodate prediction-errors and which drive action. The original equations basically claim they both change at the same time to minimize the free-energy, but by then the generative models and recognition densities themselves approach tautology.

In a certain sense, you could view anything which behaves "teleologically", which self-organizes and moves itself preferentially into some states over others, as engaging in active inference on some generative density. The problem is to describe or prescribe what the generative and recognition densities actually are, lest the theory just be mere philosophy.

MrQuincle · on July 7, 2017

Isn't free-energy nothing more than a maximization approximation to Bayesian inference.

p(u|x)=p(x|u)p(u)/p(x)

+ max in log space: max ln p(x|u) + ln p(u)

+ use variational approximation, e.g. Kullback-Leibler: min ln KL(q(u)|p(x|u)) + ln p(u)

+ define free energy: F = ln p(u) - KL, so it can be maximized

Hence, rather than with KL where we minimize over a ratio with conditional p(x|u), we maximize over the joint 1/p(x,y). So we optimize for both likelihood and prior.

Sounds to me as a sloppy Bayesian approach. :-) Normally, the prior is intended to be used as full distribution. Not to get a max. probable value from.

In this approach we maximize for both prior and likelihood. Seems logical that we get all kind of possible trade-offs. Which should we choose? And why is free-energy so perfect?

eli_gottlieb · on July 7, 2017

The "free-energy principle" in this case isn't just variational inference via a free-energy cost derived from the KL divergence. It involves treating action (control signals) as a variational parameter to the recognition density. The agent thus treats their likelihood function as a descriptive model of the world, and their prior as a prescriptive model: they update their beliefs (recognition density) to be accurate Bayesian inferences, but then act so as to reduce prior improbability (despite accumulation of likelihood data).

smallnamespace · on July 7, 2017

> Maximizing possible futures rightfully ought be a mere subgoal or consequence of the actual prescriptions we care about.

To play devil's advocate though, it may be the case that our normal prescriptive preferences have the consequence of maximizing possible futures. E.g. 'attempt to survive' -> you will have future possibilities. It's possible that these are equivalent goals.

eli_gottlieb · on July 7, 2017

Well the case is that our normal prescriptive preferences don't maximize possible futures. That's actually what we mean by prescriptive: they remove weight from futures we don't actually like, and give it to those we do. We call them "preferences" because they prescribe a non-maximum entropy (constrained, nonuniform) distribution over future events.

canjobear · on July 7, 2017

> I wonder sometimes if you could make a religion out of all this

https://www.amazon.com/Our-Mathematical-Universe-Ultimate-Re...

highd · on July 6, 2017

I've been trying to parse this body of work - there doesn't seem to be a writeup on the exact implementation, just that it's using "Causal Entropic Forces". They have this writeup on the optimization implementation here: https://arxiv.org/pdf/1705.08691.pdf

One red flag for me is that they're simultaneously claiming that there's no training during the OpenAI Gym while also claiming that the optimization approach is relevant. In that case, what is being optimized? It seems like they might be optimizing over previous simulations - there's frequent reference to having access to a "simulator". In that case, that should effectively count as training, right? I was under the impression that the OpenAI Gym was supposed to benchmark untrained approaches so they could be compared by learning time. Hence the gradually increasing training curves in the other approaches.

gabrielgoh · on July 6, 2017

I think an analogy can be made with Bayesian statistics. In principle, Bayesian statistics requires no training, just a way of sampling from the posterior, usually done with expensive MCMC methods.

Here, we do not need training of any kind either, just a monte-carlo simulation of the environment and an approximation of which path has the greatest path entropy. Bsaically given a state, you do

- Compute the path entropy for all states you can move to

- Move into the state with greatest path entropy

The tradeoff here is that all the work occurs in inference - every decision requires a complex simulation. In training based approaches the heavy lifting is done during training, and inference is easy

highd · on July 6, 2017

Yes - the issue is that the work is currently presented as requiring "no training", but it has simply relocated that problem to constructing a perfect simulation of the environment. It then uses the fact that current benchmarking systems have available simulations to "cheat" rather than learning that function itself. One of the most difficult and interesting parts of reinforcement learning is constructing the function that determines the evolution of the system. If you know the evolution function a priori the problem is mostly trivial - i.e. alpha-beta search, graph searching, etc.

It's interesting that this merit function works in the absence of a real reward signal, but there's no fair comparison against systems using a reward signal due to this huge alteration to the problem that is providing a perfect simulation.

gabrielgoh · on July 6, 2017

i agree completely, and that what's happening is nothing more than brute force search. Though I do think this is still interesting as the reward here is potentially much more well-conditioned than the rewards in RL.

Having said that there are situations where this will fail completely, e.g. in maze solving, where the goal is not to play to keep playing but to play to reach the end.

highd · on July 6, 2017

It seems like a more comparable reinforcement learning thing to do would be to combine the entropy criterion with a known reward when available in some way and then do Q learning on that without the simulation requirement. Then in cases where reward is uncertain or infrequent you fall back to a flexibility heuristic.

robertsdionne · on July 7, 2017

Maybe like https://pathak22.github.io/noreward-rl/

RangerScience · on July 6, 2017

So, I'm a little confused by what you're asking - are you asking more about the original ArXiv paper, or the Paulispace post?

I think I can explain what you're confused about, if I can understand what it is you're confused about better :)

highd · on July 6, 2017

The original work. What is the optimization problem being solved precisely? What exactly is done prior to submission to the OpenAI gym? What data does the system have access to prior to submission and during runtime?

RangerScience · on July 6, 2017

Okay, so, from my understanding:

The system has access to available actuators (AFAIK, the X or X+Y position of the agent), a perfect simulator (given this action, that position is the result), and an equation to measure the energy of the system (in a physics / entropy sense).

The first example is the inverted pendulum (segway). The agent can move along X, and it takes more energy to go from the down / fallen position to the upright position, than vice versa. Thus, the upright position has better entropy (I never get the +- right with entropy, so I don't know if that means more or less entropy).

Since the system knows the entropy present in all possible future states of the system (via the perfect simulator plus the entropy math), it can make a sort of "map", and plot a path from where it is to the global max.

In simpler terms, it's optimizing how much energy it takes to get from the current state to all possible future states of the system: in simpler terms, it's way easier (literally, takes less energy) to let the segway fall down than to stand it up in the first place.

Does that help?

highd · on July 6, 2017

I understand. It appears that constructing the problem this way is a very unfair way to measure if this idea works compared to other reinforcement learning approaches. If you can simulate the system perfectly you can always just simulate k steps for all possible inputs and pick the one that works best.

RangerScience · on July 6, 2017

Right, but - how do you measure "what works best"?

CEF is an answer to "what works best" that's (theoretically?) applicable to all systems.

pizza · on July 7, 2017

I remember this -- entropica -- from, well must have been like 5 or 6 years ago now

http://www.entropica.com/

pzone · on July 6, 2017

This blog post seems to be a comment or response aimed at people who already understand the paper, not an exposition for someone encountering it for the first time. I think I'm moderately well versed in probability and information theory and couldn't make heads or tails of it.

RangerScience · on July 6, 2017

I think I understand it all well enough to explain! Want to ask some questions?

mehwoot · on July 7, 2017

Maximizing your number of future options is not always a good idea. Sometimes fewer options are better provided that these are more useful options

I guess I'm missing something, because this seems to negate the entire point... isn't the point that number of future options is a good measure of "more useful options"?

TuringTest · on July 7, 2017

I think that sentence is meant to highlight one weakness of that measure. It may be a good heuristic in many circumstances, but if you have direct knowledge about the problem domain (like in the football players example), applying this specific heuristic may give better results than using the generic one. To solve difficult problems with approximate methods, you usually need to combine several heuristics anyway.

mrdrozdov · on July 7, 2017

At quick glance, this work seems related to Information Maximization like is done in the papers for InfoGAN, VIME, and Intrinsic Motivation (for automatic goal-setting in RL).

canjobear · on July 6, 2017

How does this relate to concepts like AIXI and Solomonoff induction?

chriswarbo · on July 7, 2017

Solomonoff induction is an (uncomputable) method which takes a sequence of inputs and predicts the subsequent inputs. If that sequence comes from some sensor, like a camera, then it can be used to predict what that sensor will detect in the future (and hence, indirectly, what the future state of the world will be). Solomonoff induction is completely passive, it doesn't say anything how to choose an action to take.

AIXI applies Solomonoff induction to a reinforcement learning (RL) setting: the sequence is split into three parts: "observations" (passive input, e.g. from a camera), "actions" (which are under the agent's control) and "rewards" (which are numbers). AIXI uses Solomonoff induction to calculate what the total future rewards will be, if the sequence so far were followed by action A; or by action B; etc. and then performs whichever of those actions gave the largest predicted reward. This does tell us which action to take (at least, computable approximations do), but it relies on there being a source of reward; all sorts of "AI safety" research (e.g. intelligence.org ) is based around what such a reward should look like, and ways that an AI might achieve high reward whilst subverting our intentions.

This 'causal entropic force' is a sort of implicit reward: the system is rewarded when it is able to efficiently reach other states; so it ends up 'putting itself in a good position', whatever that might mean in a particular situation.

It hand-waves away a few key points: it needs a good predictor (e.g. Solomonoff induction, or something computable), and it also seems to need a world model which tells it what the "states" are. Solomonoff and AIXI don't need to be given a model: they build their own implicitly. They do need their input to be hooked up, e.g. to take pixels from a camera or whatever, but that's a known property of the implementation (e.g. the hardware available on a robot), whereas there's usually a bunch of ways we could model the world, with no "obvious" right answer, and that can directly affect how the system behaves.