So is GRPO that much better because it ascribes feedback to a whole tight band o...

pizza 19 days ago | parent | context | favorite | on: DeepSeek-R1: Incentivizing Reasoning Capability in...

So is GRPO that much better because it ascribes feedback to a whole tight band of ‘quality’ ranges of on-policy answers while the band tends towards improvement in the aggregate, or is it just faster algorithm = more updates for a given training duration?