I don't think it does, because from this paper this kind of backfeeding is apparently quite difficult to train.
I've said it before, but I think it's just something like Quiet-STaR, but simplified. They have a bunch of question answer pairs, many of which are difficult. They generate a lot of tokens from the question (let's say, 3x the length of the expected answer), summarise whatever is generated and reinforce whenever it generates the right answer.
o1 ist most likely just 4o optimized for CoT with some fine tuning or perhaps merely with a dedicated system prompt (which is probably the reason why they don't let you access it in the API) and enforced structured output. In fact you can recreate something very similar using 4o and the right system prompt + structured outputs.
I've said it before, but I think it's just something like Quiet-STaR, but simplified. They have a bunch of question answer pairs, many of which are difficult. They generate a lot of tokens from the question (let's say, 3x the length of the expected answer), summarise whatever is generated and reinforce whenever it generates the right answer.
I don't think o1 is something complicated.