Yup. Exactly this. As soon as enough people get screwed by the ~80% accuracy rat...

agentcoops · 2025-07-12T14:24:07 1752330247

Charitably, I don’t understand what those like you mean by the “whole facade” and why you use these old machine learning metrics like “accuracy rate” to assess what’s going on. Facade implies that the unprecedented and still exponential organic uptake of GPT (again see the actual data I linked earlier from Mary Meeker) is just a hype-generated fad, rather than people finding it actually useful to whatever end. Indeed, the main issue with the “facade” argument is that it’s actually what dominates the media (Marcus et al) much more than any hyperbolic pro-AI “hype.”

This “80-20” framing, moreover, implies we’re just trying to asymptotically optimize a classification model or some information retrieval system… If you’ve worked with LLMs daily on hard problems (non-trivial programming and scholarly research, for example), the progress over even just the last year is phenomenal — and even with the presently existing models I find most problems arise from failures of context management and the integration of LLMs with IR systems.

daveguy · 2025-07-12T17:11:23 1752340283

Time will tell.

12345hn6789 · 2025-07-12T21:17:08 1752355028

My team has measurably gotten our LLM feature to have ~94% accuracy in widespread reliable tests. Seems fairly confident, speaking as an SWE not a DS orML engineer though.

agentcoops · 2025-07-16T16:12:56 1752682376

Yeah, I've had similar results. Even with GPT-o1, I find almost all errors at this point come from the web search functionality and the model taking X random source as an authority. It's interesting that I find my human intelligence in the process is most useful for hand-collecting the sources and data to analyze -- and, of course, for directing the process across multiple LLM queries.