What type of prompts were you feeding it? My limited understanding is that reasoning models will outperform LLMs like GPT-4/Claude at certain tasks but not others. Prompts that have answers that are more fuzzy and less deterministic (ie. soft sciences) will see reasoning models underperform because their training revolves around RL with rewards.