Hacker News new | past | comments | ask | show | jobs | submit login

> tasks that need deterministic outputs and the thing you need to create is already known statically

Wow, interesting. Do you have any example for this?

I've realized that LLMs are fairly good at string processing tasks that a really complex regex might also do, so I can see the point in those.




Yeah, there's a little bit of flex there for sure. An example that recently came up for me at work was being able to take request:response pairs from networking events and turn them into a distributed trace. You can absolutely get an LLM to do that, but it's very slow and can mess up sometimes. But you can also do this 100% programmatically! The LLM route feels a little easier at first but it's arguably a bad application of the tech to the problem. I tried it out just for fun, but it's not something I'd ever want to do for real.

(separately, synthesizing a trace from this kind of data is impossible to get 100% correct for other reasons, but hey, it's a fun thing to try)


Classification tasks come to mind


LLMs are better at that though. Sure you may not require them but it certainly wouldn't be for a lack of accuracy.

https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...

https://arxiv.org/abs/2303.15056


That compares ChatGPT to mechanical turk, not to a smaller, more specialized model. Mechanical Turk is just crowdsourcing.


The first one also compares GPT-4 to the researches themselves. Smaller specialized models don't beat humans at these tasks. That's why turk is used here in the first place (It's certainly not cheaper) and why GPT beating them is worthy of a paper on its own.


Well it really depends on the task. If it can be done with a regex, use a regex. We can’t make categorical statements about LLMs being better. It depends.

You can also probably distill a large model to a smaller one while maintaining a lot of performance. DistillBert is almost as good as Bert at a fraction of the inference cost.

GPT-3.5 and 4 also currently aren’t deterministic even with temperature zero, which is a nightmare for debugging.


The gold standard they're comparing against was done by humans though. And a task-specific model trained on that data will be better at that task than GPT-4.

What's definitely true is that getting decent data often takes some care, especially in how you define the task. And mechanical turk is often especially tricky to use well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: