So I think this is an excellent post. Indeed, LLM maximalism is pretty dumb. They're awesome at specific things and mediocre at others. In particular, I get the most frustrated when I see people try to use them for tasks that need deterministic outputs and the thing you need to create is already known statically. My hope is that it's just people being super excited by the tech.
I wanted to call this out, though, as it makes the case that to improve any component (and really make it production-worthy), you need an evaluation system:
> Intrinsic evaluation is like a unit test, while extrinsic evaluation is like an integration test. You do need both. It’s very common to start building an evaluation set, and find that your ideas about how you expect the component to behave are much vaguer than you realized. You need a clear specification of the component to improve it, and to improve the system as a whole. Otherwise, you’ll end up in a local maximum: changes to one component will seem to make sense in themselves, but you’ll see worse results overall, because the previous behavior was compensating for problems elsewhere. Systems like that are very difficult to improve.
I think this makes sense from the perspective of a team with deeper ML expertise.
What it doesn't mention is that this is an enormous effort, made even larger when you don't have existing ML expertise. I've been finding this one out the hard way.
I've found that if you have "hard criteria" to evaluate (i.e., getting the LLM to produce a given structure rather than an open-ended output for a chat app) you can quantify improvements using Observability tools (SLOs!) and iterating in production. Ship changes daily, track versions of what you're doing, and keep on top of behavior over a period of time. It's arguably a lot less "clean" but it's way faster, and because it's working on the real-world usage data, it's really effective. An ML engineer might call that some form of "online test" but I don't think it really applies.
At any rate, there are other use cases where you really do need evaluations, though. The more important correct output is, the more it's worth investing in evals. I would argue that if bad outputs have high consequences, then maybe LLMs also aren't the right tech for the job, but that'll probably change in a few years. And hopefully making evaluations will be easier too.
It's true that getting something going end-to-end is more important than being perfectionist about individual steps -- that's a good practical perspective. We hope good evaluation won't be such an enormous effort. Most of what we're trying to do at Explosion can be summarised as trying to make the right thing easy. Our annotation tool Prodigy is designed to scale down to smaller use-cases for instance ( https://prodigy.ai ). I admit it's still effort though, and depending on the task, may indeed still take expertise.
Yeah, there's a little bit of flex there for sure. An example that recently came up for me at work was being able to take request:response pairs from networking events and turn them into a distributed trace. You can absolutely get an LLM to do that, but it's very slow and can mess up sometimes. But you can also do this 100% programmatically! The LLM route feels a little easier at first but it's arguably a bad application of the tech to the problem. I tried it out just for fun, but it's not something I'd ever want to do for real.
(separately, synthesizing a trace from this kind of data is impossible to get 100% correct for other reasons, but hey, it's a fun thing to try)
The first one also compares GPT-4 to the researches themselves. Smaller specialized models don't beat humans at these tasks. That's why turk is used here in the first place (It's certainly not cheaper) and why GPT beating them is worthy of a paper on its own.
Well it really depends on the task. If it can be done with a regex, use a regex. We can’t make categorical statements about LLMs being better. It depends.
You can also probably distill a large model to a smaller one while maintaining a lot of performance. DistillBert is almost as good as Bert at a fraction of the inference cost.
GPT-3.5 and 4 also currently aren’t deterministic even with temperature zero, which is a nightmare for debugging.
The gold standard they're comparing against was done by humans though. And a task-specific model trained on that data will be better at that task than GPT-4.
What's definitely true is that getting decent data often takes some care, especially in how you define the task. And mechanical turk is often especially tricky to use well.
I wanted to call this out, though, as it makes the case that to improve any component (and really make it production-worthy), you need an evaluation system:
> Intrinsic evaluation is like a unit test, while extrinsic evaluation is like an integration test. You do need both. It’s very common to start building an evaluation set, and find that your ideas about how you expect the component to behave are much vaguer than you realized. You need a clear specification of the component to improve it, and to improve the system as a whole. Otherwise, you’ll end up in a local maximum: changes to one component will seem to make sense in themselves, but you’ll see worse results overall, because the previous behavior was compensating for problems elsewhere. Systems like that are very difficult to improve.
I think this makes sense from the perspective of a team with deeper ML expertise.
What it doesn't mention is that this is an enormous effort, made even larger when you don't have existing ML expertise. I've been finding this one out the hard way.
I've found that if you have "hard criteria" to evaluate (i.e., getting the LLM to produce a given structure rather than an open-ended output for a chat app) you can quantify improvements using Observability tools (SLOs!) and iterating in production. Ship changes daily, track versions of what you're doing, and keep on top of behavior over a period of time. It's arguably a lot less "clean" but it's way faster, and because it's working on the real-world usage data, it's really effective. An ML engineer might call that some form of "online test" but I don't think it really applies.
At any rate, there are other use cases where you really do need evaluations, though. The more important correct output is, the more it's worth investing in evals. I would argue that if bad outputs have high consequences, then maybe LLMs also aren't the right tech for the job, but that'll probably change in a few years. And hopefully making evaluations will be easier too.