Here's an example. Suppose there are two buttons, A and B. If you press A for the nth time, then you get reward n. If you press B for the nth time, then you get reward 0 if n is not a power of 2, or reward omega (the first infinite ordinal number) if n is a power of 2.
If the above rewards are shoehorned into real numbers---for example, by replacing omega with 9999 or something---then an RL agent would misunderstand the environment and would eventually be misled into thinking that pressing A yields more average reward.
There are no infinite rewards in biology and yet mathematicians seem to do just fine answering these sorts of questions.
I don’t think you want to encode your problem domain in your reward system. It’d be like asking a logic gate to add when you really should be reaching for an FPU. Maybe I’m missing something though?
>There are no infinite rewards in biology and yet mathematicians seem to do just fine answering these sorts of questions
This is only a problem if you're already assuming we do everything based on our biological reward systems, and in the current context that would be circular reasoning.
Imagine the treasury creates a "superdollar", a product which, if you have one, you can use to create any number of dollars you want, whenever you want, as many times as you want. Obviously a superdollar is more valuable than any finite number of dollars, and humans/mathematicians/AGIs would treat it accordingly, regardless of the finiteness of our biological reward systems.
> This is only a problem if you're already assuming we do everything based on our biological reward systems
Is there some other way that we are do it beside our biological reward system? It sure looks like we get an apple and not an infinite reward when we pick the right answer to be selecting button B. I understand that might not satisfy you.
>Is there some other way that we are do it beside our biological reward system?
Seems to me that's what this whole paper we're discussing is about. If you're already convinced that there is no other way, then you're basically already agreeing with the paper, "Rewards are enough".
What's the behavior your trying to get the AI to do in this example? Learn how to compute the power of 2? This is a task that can be accomplished much more simply with a different reward system. For example, have A always equal 1 and B equal 2 if it is a power if 2 and 0 otherwise.
I understand you can use non real numbers, that's not what I was asking. I'm asking what's a behaviour you can't replicate using a reward system based on real numbers.
>I'm asking what's a behaviour you can't replicate using a reward system based on real numbers
So glad you asked! I can give an answer which people will love who take the necessary time to understand it. It's complicated, you might have to re-read it a few times and really ponder it. It's about automatic code generation (though it might not look like it at first).
Definition 1: Define the "Intuitive Ordinal Notations" (IONs) to be the smallest set P of computer programs such that for every computer program p, if all the things p outputs are IONs, then p is an ION.
Definition 2: Inductively associate an ordinal |p| with every ION p as follows: |p| is defined to be smallest ordinal which is bigger than every ordinal |q| such that q is an output of p. Say that p "notates" |p|.
Finally, to answer your question, I want the AGI to write programs which are IONs notating large ordinals, accompanied by arguments convincing me they really are IONs. An easy way to incentivize this with RL would be as follows. If the AGI writes an ION p and an argument that convinces me it's an ION, I will grant the AGI reward |p|. If the AGI does anything else (including if its argument does not convince me), then I'll give it reward 0.
You can't correctly incentivize this behavior using reals. The computable ordinals are too non-Archimedean to do so.
If the above rewards are shoehorned into real numbers---for example, by replacing omega with 9999 or something---then an RL agent would misunderstand the environment and would eventually be misled into thinking that pressing A yields more average reward.