My team has measurably gotten our LLM feature to have ~94% accuracy in widespread reliable tests. Seems fairly confident, speaking as an SWE not a DS orML engineer though.
Yeah, I've had similar results. Even with GPT-o1, I find almost all errors at this point come from the web search functionality and the model taking X random source as an authority. It's interesting that I find my human intelligence in the process is most useful for hand-collecting the sources and data to analyze -- and, of course, for directing the process across multiple LLM queries.