> tasks that need deterministic outputs and the thing you need to create is alre...

phillipcarter · on Sept 13, 2023

Yeah, there's a little bit of flex there for sure. An example that recently came up for me at work was being able to take request:response pairs from networking events and turn them into a distributed trace. You can absolutely get an LLM to do that, but it's very slow and can mess up sometimes. But you can also do this 100% programmatically! The LLM route feels a little easier at first but it's arguably a bad application of the tech to the problem. I tried it out just for fun, but it's not something I'd ever want to do for real.

(separately, synthesizing a trace from this kind of data is impossible to get 100% correct for other reasons, but hey, it's a fun thing to try)

intended · on Sept 13, 2023

Classification tasks come to mind

og_kalu · on Sept 13, 2023

LLMs are better at that though. Sure you may not require them but it certainly wouldn't be for a lack of accuracy.

https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...

https://arxiv.org/abs/2303.15056

janalsncm · on Sept 14, 2023

That compares ChatGPT to mechanical turk, not to a smaller, more specialized model. Mechanical Turk is just crowdsourcing.

og_kalu · on Sept 14, 2023

The first one also compares GPT-4 to the researches themselves. Smaller specialized models don't beat humans at these tasks. That's why turk is used here in the first place (It's certainly not cheaper) and why GPT beating them is worthy of a paper on its own.

janalsncm · on Sept 14, 2023

Well it really depends on the task. If it can be done with a regex, use a regex. We can’t make categorical statements about LLMs being better. It depends.

You can also probably distill a large model to a smaller one while maintaining a lot of performance. DistillBert is almost as good as Bert at a fraction of the inference cost.

GPT-3.5 and 4 also currently aren’t deterministic even with temperature zero, which is a nightmare for debugging.

syllogism · on Sept 15, 2023

The gold standard they're comparing against was done by humans though. And a task-specific model trained on that data will be better at that task than GPT-4.

What's definitely true is that getting decent data often takes some care, especially in how you define the task. And mechanical turk is often especially tricky to use well.