Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Chatbot Arena Leaderboard (lmsys.org)
118 points by tosh on May 25, 2023 | hide | past | favorite | 39 comments


I've been working extensively with LLMs on a generative storytelling side project (named www.generativestorytelling.ai because I am terrible at naming things) and once prompts start getting complex, ChatGPT wins by a landslide. I can do all sorts of complicated prompts to ChatGPT[0] and it will, by and large, come up with great output.

Meanwhile, Bard gets confused by basic things such as "after this message I will send another one, do not reply until the second message is sent" and instead tries to immediately reply.

IMHO not very many people doing reviews of chatbots are really pushing them the bots to their limits, and those who are pushing the bots really hard are often too busy to take the time and make their work public (which is the reason I am developing in the open!)

[0] https://github.com/devlinb/arcadia/blob/main/backend/src/rou...


> (named www.generativestorytelling.ai because I am terrible at naming things)

A bit beside the point - but struck me as odd given the context; chatgpt is pretty good at naming things:

Model: GPT-4

User

Suggest a better name for: "working extensively with LLMs on a generative storytelling side project (named www.generativestorytelling.ai"

ChatGPT

Certainly, here are a few suggestions for a name for your project:

    "NarrativeAI: Automated Generative Storytelling"
    "TellTalesAI: Collaborative Story Crafting"
    "StoryForge.ai: LLM-driven Story Creation"
    "ScriptSmith.ai: Mastering Narrative Generation"
    "StoryGenix.ai: Redefining Generative Storytelling"
    "PlotBender.ai: The Future of Digital Storytelling"
    "NarrativeNexus.ai: Blending LLMs and Storytelling"
    "TaleSpin.ai: Spinning Stories with AI"
    "ProsePrism.ai: Reflecting Imagination"
    "EpicInk.ai: Digitalizing Generative Storytelling"
Please note that it's also crucial to check for the availability of these domains before you decide on one.

User

Be more creative

ChatGPT

Absolutely! Here are some more creative and less conventional suggestions:

    "AInkWoven.com: Stories Threaded in Code"
    "FableFabric.ai: Weaving Narratives with AI"
    "MythMechanics.ai: Engineering Epic Tales"
    "FantasiaForge.ai: Crafting Dreams with LLMs"
    "EpicEther.ai: Conjuring Stories from the Cloud"
    "NeuralNovelettes.ai: Intelligent Design, Intelligent Stories"
    "ChronicleCodex.ai: Archiving AI's Imagination"
    "QuantumQuill.ai: Penning Possibilities with LLMs"
    "SagaSilicon.ai: Where Stories Meet Tech"
    "PapyrusPixel.ai: Marrying Old Traditions with New Possibilities"
Remember, the availability of these domain names is also an important factor to consider. ChatGPT may produce inaccurate information about people, places, or facts.


Hah! A brilliant use of ChatGPT, thank you for the idea!

Maybe I'll go through and have GPT start naming my variables as well. :-D


Have you tried Claude on stories? my goodness, it seemed out of this world amazing a couple months back


ChatGPT 3.5 has been improving in its creative output pretty much every other week since I started looking at it. The improvements don't get announced, things just sort of change. Breaks prompts all the time though! For my usage, sometime around 3 weeks ago was probably peak, a few days ago its started getting really wordy, so much so that it runs into the 4k token limit before it can complete stories now!


Obligatory follow-up question: I'm guessing you have a reason to only mentioning and using GPT-3.5 in this project, but in terms of more general evaluation of complex prompt handling, did you have a chance to test GPT-4, and if yes, how does it fare in comparison?

I predict that, if you tested it, the answer will be "GPT-4 is much better than GPT-3.5 at this" - but the part I'm interested in is hearing your opinion on how much and how exactly is GPT-4 better at handling these kinds of prompts/interactions.

(If my prediction is wrong and GPT-4 isn't much better than GPT-3.5 here, that would be even more interesting to hear about.)


I'm using GPT3.5 because it is a public facing site running off of my personal credit card. :)

Back when I first started out, GPT-4 was generating more creative stories for sure, and it obeyed instructions better. I have had to do a lot of prompt engineering to make sure GPT3.5 keeps to the format I have specified. If you take a look at the source I linked to, I basically taught GPT3.5 a syntax that uses emojis, which is super cool because in addition to the emojis I gave examples of, GPT3.5 has come up with its own use cases. For example when a princess got exiled, it used a door emoji! Sadly I have had to turn the temperature down to keep things on track, at higher temperatures GPT3.5 tends to forget the syntax.

GPT4 is easier to direct, it follows instructions much better, and the stories are possibly better, but I haven't used it nearly as much for this particular project.


Significantly better, a more comprehensive answer than I could ever provide is the Sparks of AGI paper.



I completely agree. For simple or short conversations, the models are similar enough, even for the low parameter models. But as the conversation gets more complicated, the difference between the models becomes more obvious. The benchmarks used today only give a basic understanding of the language model's abilities and you have to use it yourself to see how capable it really is.


I wonder how to measure this?


Build something. Play with it. Use it everyday and watch the models evolve.

Not everything can be (easily) measured, especially when it comes to creative output. Heck even things like "which UI API is better to use" is incredibly subjective, people working at a high level building CRUD apps will look at VB6 and think it is amazing, while others want low level graphics primitives.

Programmatically using an AI is almost like having a partner you have to negotiate with. Miscommunications are common, the AI occasionally decides to change its answer format, or just improvise brand new things. For the https://www.generativestorytelling.ai project I had to come up with a a syntax and parser that is adaptable to changes, while staying under a token limit. The token limit is one reason I didn't use JSON, but the other reason is that I've had friends tell me that GPT will sometimes throw brand new syntax in with its answers, so I wanted a response format that was loose enough to allow the AI to be creative without breaking my parser!

It is an incredibly different experience than any other system I've ever worked with.


Unfortunately the Arena is missing some of the strongest "open" models, such as WizardLM Uncensored 30B. In fact they don't have any Llama 30B/65B based models, just 13B models.


Are there metrics for WizardLM Uncensored? I can’t find anything showing it to be a strong contender, and it seems like the primary focus of its training was to remove censoring, which seems orthogonal to high performance.


Maybe there are some small tests (GPT4 comparing it to other contenders), but it's mostly other people's experience trying a lot of these models.

Also it being 30B makes it already quite stronger than those in the leaderboard.

And by the way, the censoring does seem to hurt the performance a bit.

Edit: Here are some scores https://www.reddit.com/r/LocalLLaMA/comments/13shhgf/gotzman...


Yes, considering GPT models are huge, they should probably compare them against the huge local models (as well as the 13B ones which are shockingly close).


Only the bottom 2 out of top 10 are open-source and available for commercial use. So if you want to use an open-source LLM for your commercial product, be aware that your competitors who use proprietary LLMs through APIs will outperform you dramatically. Or am I missing something?


Note that the FLAN-T5 variant in 10th place is 3B parameters large, there is another FLAN-T5 variant that has 11B parameters that should perform better (and the fact that a 3B model is able to compete with Alpaca-13B is impressive by itself).


What you're missing is not reflected on the leaderboard right now, Guanaco 65B.


Guanaco is LLaMA tune and thus is irrelevant for commercial use, isn't it?


Ah true! It isn't for commercial use.


I'd say your missing the importance of not being bound to a proprietary model, and of not having to explain to your customers why you send their data to a third party. It's still early days - definitely if you need the sota performance this second, you don't have any options. But in the fairly near term, I see no evidence that the proprietary generic models will keep their leads in a way that's meaningful for commercial products. Do you?


Very interesting! Quite surprised to see PaLM-2 ranked even lower than open-sourced Vicuna.


The "win matrix" (dissimilarity Matrix) seems very interesting, looks eg like Vicuna13b paired against gpt4 wins 20% of the time. Larger difference than I'd have guessed based on scores.


Yeah the win matrix is what you want to look at if you haven't internalized or memorized what various Elo differences mean


Reading the evaluation of Palm2 there, Google's really shooting themselves in the foot. "Why is the sky blue?" "I'm sorry, as a language model..."


All this is pure speculation:

I have absolutely no inside info or actual knowledge to back up this hunch, but part of me thinks Google released PaLM2 just to get something out the door that was somewhat comparable to OpenAI's models.

That way they can say they put something out and that they're "competing", but internally, maybe they're really looking forward to (and working on) their next-gen AI models. Something to compete with a hypothetical future GPT-5, but maybe they're looking to get their version out faster than OpenAI can. I feel like Google is really going to give it their all on the next-gen model they release (not PaLM 2, but whatever they're working on after it).

Curious to hear other people's thoughts on this.


Honestly, it confuses me. I’m sure PaLM2 does well on certain metrics but the experience of actually using it can be bafflingly bad. (My favorite was someone asking for a table comparing agricultural businesses and being given instructions on how to build a physical table.)

I don’t know if this was an oversight or them just throwing up their hands knowing they needed to release something. Rumors I’ve seen claim that Gemini will be far more powerful, so I guess they’re just biding their time.

Idly, I do sometimes think back to Blake Lemoine. Is this seriously what he thought was a conscious AGI? Or have they heavily gimped it in the process of aligning it?


I love this push to provide Arenas!

I am also a big fan of the human rating style of evaluation [0]

I think they are complementary approaches.

[0] https://colab.research.google.com/drive/1kK6xasHiav9nhiRUJjP...


Has anyone done a price comparison on running all of these models for common use cases? For example, to make a chatbot that searches a vector DB to answer questions.


When do you (HN readers) think that we'll have an open source model that scores 1150 or higher, and where do you think it'll come from?


Compute is the major bottleneck.

Preface: I do not consider LLaMA or any of its fine-tuned derivatives to be truly open-source, since they can't be used for commercial purposes and have highly restrictive licenses. If it weren't for the leaked weights, models like Vicuna wouldn't exist.

I think it's somewhat unlikely that a purely open-source model can catch up in the near-term without one or a combination of the following happening: a) significant funding for the compute resources required, potentially through massive donations by one or more wealthy open-source advocates, with the expectation of nothing in return since it wouldn't be proprietarily valuable b) breakthroughs in design or architecture that significantly reduce necessary compute resources for initial training and/or fine-tuning c) experts in cutting-edge AI research (the best of the best) being willing and legally allowed to contribute their unique knowledge to open-source projects, without restriction d) another company or well-funded organization intentionally and transparently releasing an in-house trained foundational model similar to LLaMA or GPT-4 to the public, along with weights, full source code, plus permissible licensing terms that allow for commercial use and further modification

I'd say the odds are slim in the near-term future, but honestly it's anyone's guess.


There are already efforts to recreate the llama weights under open source licenses (eta: days/nowish).

https://github.com/openlm-research/open_llama


A Manticore 30B based on RedPajama is probably only a few weeks away. So we might be pretty close.


We need this for code understanding models like StarCoder and the like


Why is there no Claude+ ? Seems like their competitor to GPT-4.


It’s there, verbiage used is Claude (vs. Claude Instant, I assume it’s an artifact of their names in API, but anthropic hasn’t given me access yet :( )


Claude+ is only called like that by Poe -- it's actually Claude v1.3 and Claude Instant v1 (as reflected on the site).


If you want to compare chatbots in real life, try https://ChatHub.gg , an all-in-one chatbot client I build.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: