So I think this is an excellent post. Indeed, LLM maximalism is pretty dumb. The...

syllogism · on Sept 13, 2023

(Author here)

It's true that getting something going end-to-end is more important than being perfectionist about individual steps -- that's a good practical perspective. We hope good evaluation won't be such an enormous effort. Most of what we're trying to do at Explosion can be summarised as trying to make the right thing easy. Our annotation tool Prodigy is designed to scale down to smaller use-cases for instance ( https://prodigy.ai ). I admit it's still effort though, and depending on the task, may indeed still take expertise.

axiom92 · on Sept 13, 2023

> tasks that need deterministic outputs and the thing you need to create is already known statically

Wow, interesting. Do you have any example for this?

I've realized that LLMs are fairly good at string processing tasks that a really complex regex might also do, so I can see the point in those.

phillipcarter · on Sept 13, 2023

Yeah, there's a little bit of flex there for sure. An example that recently came up for me at work was being able to take request:response pairs from networking events and turn them into a distributed trace. You can absolutely get an LLM to do that, but it's very slow and can mess up sometimes. But you can also do this 100% programmatically! The LLM route feels a little easier at first but it's arguably a bad application of the tech to the problem. I tried it out just for fun, but it's not something I'd ever want to do for real.

(separately, synthesizing a trace from this kind of data is impossible to get 100% correct for other reasons, but hey, it's a fun thing to try)

intended · on Sept 13, 2023

Classification tasks come to mind

og_kalu · on Sept 13, 2023

LLMs are better at that though. Sure you may not require them but it certainly wouldn't be for a lack of accuracy.

https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...

https://arxiv.org/abs/2303.15056

janalsncm · on Sept 14, 2023

That compares ChatGPT to mechanical turk, not to a smaller, more specialized model. Mechanical Turk is just crowdsourcing.

og_kalu · on Sept 14, 2023

The first one also compares GPT-4 to the researches themselves. Smaller specialized models don't beat humans at these tasks. That's why turk is used here in the first place (It's certainly not cheaper) and why GPT beating them is worthy of a paper on its own.

janalsncm · on Sept 14, 2023

Well it really depends on the task. If it can be done with a regex, use a regex. We can’t make categorical statements about LLMs being better. It depends.

You can also probably distill a large model to a smaller one while maintaining a lot of performance. DistillBert is almost as good as Bert at a fraction of the inference cost.

GPT-3.5 and 4 also currently aren’t deterministic even with temperature zero, which is a nightmare for debugging.

syllogism · on Sept 15, 2023

The gold standard they're comparing against was done by humans though. And a task-specific model trained on that data will be better at that task than GPT-4.

What's definitely true is that getting decent data often takes some care, especially in how you define the task. And mechanical turk is often especially tricky to use well.