I can only barely follow what's being said, based on by unfamiliarity with the field, but it seems to say "agents used to only be able to choose an optimal strategy if they were competing against agents with less information (less possibly strategies known)[1], but now they have discovered they can formally come to a Nash equilibrium based on the use of something called a reflective oracle."
Basically, two equal level participants should result in a Nash Equilibrium?
1: Except for cases where the set of possible strategies is small, such as the Prisoner's Dilemma. See footnote 1 in the article.
Basically, two equal level participants should result in a Nash Equilibrium?
No, whether you have a Nash equilibrium depends on the strategies used, you can always have agents following strategies that aren't a Nash equilibrium independent of their relative strengths.
The problem they solved is that agents were unable to reason about themselves - what will my opponent do? He is just like me so he will ask himself what will my opponent - that is me - do. But I will of course try to figure out what my opponent will do but he is just like me and so he will... Simplified of course, but here you have some infinite regress from which you have to break free. The solution they came up with and analyzed is to just pick an arbitrary, possibly randomized answer in certain situations.
Thanks for the clarification. So I take it that means I should read the following:
The key feature of reflective oracles is that they avoid diagonalization and paradoxes by randomizing in the relevant cases.2 This allows agents with access to a reflective oracle to consistently reason about the behavior of arbitrary agents that also have access to a reflective oracle, which in turn makes it possible to model agents that converge to Nash equilibria by their own faculties (rather than by fiat or assumption).
as not that the agents will converge on a Nash equilibria, but that now it's possible to model agents that do converge on a Nash equilibria, which previously we could not (at at least not definitively)? That's what it seems to say outright, which I missed previously, I just want to make sure I'm not approaching it from the wrong context.
Don't quote me on that, I have really nothing to do with game theory, but as I understand it the paper it shows that reflectiveoracle-computable policies are optimal in the discussed setup (Theorem 25) and that they will yield a Nash equilibrium if all policies are asymptotically optimal (Theorem 28) which is possible because a limit computable reflective oracle exists (Theorem 6) and Thompson sampling is asymptotically optimal. So achieving a Nash equilibrium still requires that all agents play along, they have to have asymptotically optimal policies, if other agents show erratic behavior you are unable to make sense of, you can not respond optimally.
You can still respond optimally in a local sense, that is derive a dominant strategy. It will take longer than an agent that is not reflective though, at least by one sampling step.
> He is just like me so he will ask himself what will my opponent - that is me - do. But I will of course try to figure out what my opponent will do but he is just like me and so he will...
In other words, exactly what Vizzini was trying to reason out in the Princess Bride...
Sounds like minimax with a depth limit. Lots of notations in the paper, but it looks like min and max are both used.
I thought this was an interesting snippet:
>At the core of the following construction is a dogmatic
prior [LH15a, Sec. 3.2]. A dogmatic prior assigns very
high probability to going to hell (reward 0 forever) if the
agent deviates from a given computable policy π.
> just pick an arbitrary, possibly randomized answer in certain situations.
Based on the explanation in Footnote 2, it sounds like the "certain situations" are ones that "almost never" occur, in the measure-theoretic sense. Is that reading correct? If so, that's very interesting to me--the intractability caused by self-reflection (which is similar to the intractability caused by Turing-completeness) is actually confined to a small region of the problem space, and if you just paper those over then you can proceed.
I think this is not correct. If the oracle is queried whether something happens that it can not know, it will always (have to) give a random answer, the thing with hitting the probability p exactly is misleading. The paper gives the example of a touring machine querying the oracle whether the probability of itself halting is smaller or larger than 50 % and then doing exactly the opposite. So the oracle can not tell, assigns a 50 % probability to both halting and not halting and therefore always randomizes the answer. It is really more about the nature of the query, the probability thing is more of a technicality.
As I understand it they are essentially talking about running the oracle for some time and then asking for the current probability of the result being one or zero. If the thing asked is computable at all and in less than the time the oracle ran, then the answer will be certain, otherwise the oracle may have accumulated some evidence and assign different probabilities to both results or may still have no idea after the given time and will still be indifferent.
The question is then, does the oracle believe that the result is X with probability larger than p. If the oracle does not know, it will think the result will be X with 50 %. If you set p above 50 %, you will always get zero, if you set it below 50 %, you will always get one and if you set p to 50 %, then the oracle will randomize. No matter what p is, you won't get any information out of the oracle in that case and that is of course by design because otherwise you could let the oracle solve the halting problem and that would imply such a oracle does not exist.
If you asked the oracle about a Turing machine simulating a die - they are nondeterministic Turing machines with access to randomness - you would have to set p to 1/6 and ask whether the oracle believes the next roll will yield for example a six. But if the outcome is really generated from true randomness, the oracle can not know what the next roll will yield, has itself to assign a probability of 1/6 to rolling a six and will again randomize every time.
I don't think the arbitrary solution can apply to real-world applications. A logarithmic debit from a players preferred payoff as a computation cost might make more sense applied to the real world.
At a point the cost of calculating an opponent's preferred payoff becomes too high relative to the perception of payoff itself. By way of example, this is actually how many litigation negotiations are resolved.
I think these limits on individual computation, if I may mix disciplines, manifest as emotions or beyond that detachment. And I think there's an opportunity for modeling real world games more accurately viewed from that perspective.
Basically, two equal level participants should result in a Nash Equilibrium?
1: Except for cases where the set of possible strategies is small, such as the Prisoner's Dilemma. See footnote 1 in the article.