Its been known that most of these models hallucinate research articles frequently, perplexity.ai seems to do quite well in that regard. Not sure why that is your specific metric though when LLMs seem to be improving across a large class of other metrics.
I strongly agree with your perspective. Not long ago, I also came across the evaluation data regarding Perplexity.ai. Unfortunately, it appears that Perplexity.ai's performance in multilingual aspects isn't as commendable as expected.
The data from the "AI Search Engine Multilingual Evaluation Report (v1.0) | Search.Glarity.ai" indicates that generative search engines have a long road ahead in terms of exploration, which I find to be of significant importance.
This wasn't a "metric." It was a test to see whether or not this LLM might actually be useful to me. Just like every other LLM, the answer is a hard no: using this chatbot for real work is at best a huge waste of time, and at worst unconscionably reckless. For my specific question, I would have been much better off with a plain Google Scholar search.
If your everyday work consists of looking up academic citations then yeah, LLMs are not going to be useful for that - you'll get hallucinations every time. That's absolutely not a task they are useful for.
There are plenty of other tasks that they ARE useful for, but you have to actively seek those out.
(Hi Simon, I am laughing as I write - I just submitted an article from your blog minutes ago. Then stumbled into this submission, and just before writing this reply, I checked the profile of "simonw"... I did not know it was your username here.)
Well, assuming one normally queries for information, if the server gives false information then you have failure and risk.
If one were in search for supplemental reasoning (e.g. "briefing", not just big decision making or assessing), the server should be certified as trustworthy in reasoning - deterministically.
It may not be really clear what those «plenty [] other tasks that they ARE useful for» could be... Apart from, say, "Brian Eno's pack of cards with generic suggestion for creativity aid". One possibility could be as a calculator-to-human "natural language" interface... Which I am not sure is a frequent implementation.
Many of the most interesting uses of LLMs occur when you move away from using them as a source of information lookup - by which I mean pulling directly from information encoded into their opaque model weights.
Anything where you feed information into the model as part of your prompt is much less likely to produce hallucinations and mistakes - that's why RAG question answering works pretty well, see also summarization, fact extraction, structure data conversion and many forms of tool usage.
Uses that involve generating code are very effective too, because code has a form of fact checking built in: if the model hallucinates an API detail that doesn't exist you'll find out the moment you (or the model itself via tools like ChatGPT Code Interpreter) execute that code.