I've been working extensively with LLMs on a generative storytelling side project (named www.generativestorytelling.ai because I am terrible at naming things) and once prompts start getting complex, ChatGPT wins by a landslide. I can do all sorts of complicated prompts to ChatGPT[0] and it will, by and large, come up with great output.
Meanwhile, Bard gets confused by basic things such as "after this message I will send another one, do not reply until the second message is sent" and instead tries to immediately reply.
IMHO not very many people doing reviews of chatbots are really pushing them the bots to their limits, and those who are pushing the bots really hard are often too busy to take the time and make their work public (which is the reason I am developing in the open!)
> (named www.generativestorytelling.ai because I am terrible at naming things)
A bit beside the point - but struck me as odd given the context; chatgpt is pretty good at naming things:
Model: GPT-4
User
Suggest a better name for: "working extensively with LLMs on a generative storytelling side project (named www.generativestorytelling.ai"
ChatGPT
Certainly, here are a few suggestions for a name for your project:
"NarrativeAI: Automated Generative Storytelling"
"TellTalesAI: Collaborative Story Crafting"
"StoryForge.ai: LLM-driven Story Creation"
"ScriptSmith.ai: Mastering Narrative Generation"
"StoryGenix.ai: Redefining Generative Storytelling"
"PlotBender.ai: The Future of Digital Storytelling"
"NarrativeNexus.ai: Blending LLMs and Storytelling"
"TaleSpin.ai: Spinning Stories with AI"
"ProsePrism.ai: Reflecting Imagination"
"EpicInk.ai: Digitalizing Generative Storytelling"
Please note that it's also crucial to check for the availability of these domains before you decide on one.
User
Be more creative
ChatGPT
Absolutely! Here are some more creative and less conventional suggestions:
"AInkWoven.com: Stories Threaded in Code"
"FableFabric.ai: Weaving Narratives with AI"
"MythMechanics.ai: Engineering Epic Tales"
"FantasiaForge.ai: Crafting Dreams with LLMs"
"EpicEther.ai: Conjuring Stories from the Cloud"
"NeuralNovelettes.ai: Intelligent Design, Intelligent Stories"
"ChronicleCodex.ai: Archiving AI's Imagination"
"QuantumQuill.ai: Penning Possibilities with LLMs"
"SagaSilicon.ai: Where Stories Meet Tech"
"PapyrusPixel.ai: Marrying Old Traditions with New Possibilities"
Remember, the availability of these domain names is also an important factor to consider.
ChatGPT may produce inaccurate information about people, places, or facts.
ChatGPT 3.5 has been improving in its creative output pretty much every other week since I started looking at it. The improvements don't get announced, things just sort of change. Breaks prompts all the time though! For my usage, sometime around 3 weeks ago was probably peak, a few days ago its started getting really wordy, so much so that it runs into the 4k token limit before it can complete stories now!
Obligatory follow-up question: I'm guessing you have a reason to only mentioning and using GPT-3.5 in this project, but in terms of more general evaluation of complex prompt handling, did you have a chance to test GPT-4, and if yes, how does it fare in comparison?
I predict that, if you tested it, the answer will be "GPT-4 is much better than GPT-3.5 at this" - but the part I'm interested in is hearing your opinion on how much and how exactly is GPT-4 better at handling these kinds of prompts/interactions.
(If my prediction is wrong and GPT-4 isn't much better than GPT-3.5 here, that would be even more interesting to hear about.)
I'm using GPT3.5 because it is a public facing site running off of my personal credit card. :)
Back when I first started out, GPT-4 was generating more creative stories for sure, and it obeyed instructions better. I have had to do a lot of prompt engineering to make sure GPT3.5 keeps to the format I have specified. If you take a look at the source I linked to, I basically taught GPT3.5 a syntax that uses emojis, which is super cool because in addition to the emojis I gave examples of, GPT3.5 has come up with its own use cases. For example when a princess got exiled, it used a door emoji! Sadly I have had to turn the temperature down to keep things on track, at higher temperatures GPT3.5 tends to forget the syntax.
GPT4 is easier to direct, it follows instructions much better, and the stories are possibly better, but I haven't used it nearly as much for this particular project.
I completely agree. For simple or short conversations, the models are similar enough, even for the low parameter models. But as the conversation gets more complicated, the difference between the models becomes more obvious. The benchmarks used today only give a basic understanding of the language model's abilities and you have to use it yourself to see how capable it really is.
Build something. Play with it. Use it everyday and watch the models evolve.
Not everything can be (easily) measured, especially when it comes to creative output. Heck even things like "which UI API is better to use" is incredibly subjective, people working at a high level building CRUD apps will look at VB6 and think it is amazing, while others want low level graphics primitives.
Programmatically using an AI is almost like having a partner you have to negotiate with. Miscommunications are common, the AI occasionally decides to change its answer format, or just improvise brand new things. For the https://www.generativestorytelling.ai project I had to come up with a a syntax and parser that is adaptable to changes, while staying under a token limit. The token limit is one reason I didn't use JSON, but the other reason is that I've had friends tell me that GPT will sometimes throw brand new syntax in with its answers, so I wanted a response format that was loose enough to allow the AI to be creative without breaking my parser!
It is an incredibly different experience than any other system I've ever worked with.
Unfortunately the Arena is missing some of the strongest "open" models, such as WizardLM Uncensored 30B. In fact they don't have any Llama 30B/65B based models, just 13B models.
Are there metrics for WizardLM Uncensored? I can’t find anything showing it to be a strong contender, and it seems like the primary focus of its training was to remove censoring, which seems orthogonal to high performance.
Yes, considering GPT models are huge, they should probably compare them against the huge local models (as well as the 13B ones which are shockingly close).
Only the bottom 2 out of top 10 are open-source and available for commercial use. So if you want to use an open-source LLM for your commercial product, be aware that your competitors who use proprietary LLMs through APIs will outperform you dramatically. Or am I missing something?
Note that the FLAN-T5 variant in 10th place is 3B parameters large, there is another FLAN-T5 variant that has 11B parameters that should perform better (and the fact that a 3B model is able to compete with Alpaca-13B is impressive by itself).
I'd say your missing the importance of not being bound to a proprietary model, and of not having to explain to your customers why you send their data to a third party. It's still early days - definitely if you need the sota performance this second, you don't have any options. But in the fairly near term, I see no evidence that the proprietary generic models will keep their leads in a way that's meaningful for commercial products. Do you?
The "win matrix" (dissimilarity Matrix) seems very interesting, looks eg like Vicuna13b paired against gpt4 wins 20% of the time. Larger difference than I'd have guessed based on scores.
I have absolutely no inside info or actual knowledge to back up this hunch, but part of me thinks Google released PaLM2 just to get something out the door that was somewhat comparable to OpenAI's models.
That way they can say they put something out and that they're "competing", but internally, maybe they're really looking forward to (and working on) their next-gen AI models. Something to compete with a hypothetical future GPT-5, but maybe they're looking to get their version out faster than OpenAI can. I feel like Google is really going to give it their all on the next-gen model they release (not PaLM 2, but whatever they're working on after it).
Honestly, it confuses me. I’m sure PaLM2 does well on certain metrics but the experience of actually using it can be bafflingly bad. (My favorite was someone asking for a table comparing agricultural businesses and being given instructions on how to build a physical table.)
I don’t know if this was an oversight or them just throwing up their hands knowing they needed to release something. Rumors I’ve seen claim that Gemini will be far more powerful, so I guess they’re just biding their time.
Idly, I do sometimes think back to Blake Lemoine. Is this seriously what he thought was a conscious AGI? Or have they heavily gimped it in the process of aligning it?
Has anyone done a price comparison on running all of these models for common use cases? For example, to make a chatbot that searches a vector DB to answer questions.
Preface: I do not consider LLaMA or any of its fine-tuned derivatives to be truly open-source, since they can't be used for commercial purposes and have highly restrictive licenses. If it weren't for the leaked weights, models like Vicuna wouldn't exist.
I think it's somewhat unlikely that a purely open-source model can catch up in the near-term without one or a combination of the following happening:
a) significant funding for the compute resources required, potentially through massive donations by one or more wealthy open-source advocates, with the expectation of nothing in return since it wouldn't be proprietarily valuable
b) breakthroughs in design or architecture that significantly reduce necessary compute resources for initial training and/or fine-tuning
c) experts in cutting-edge AI research (the best of the best) being willing and legally allowed to contribute their unique knowledge to open-source projects, without restriction
d) another company or well-funded organization intentionally and transparently releasing an in-house trained foundational model similar to LLaMA or GPT-4 to the public, along with weights, full source code, plus permissible licensing terms that allow for commercial use and further modification
I'd say the odds are slim in the near-term future, but honestly it's anyone's guess.
It’s there, verbiage used is Claude (vs. Claude Instant, I assume it’s an artifact of their names in API, but anthropic hasn’t given me access yet :( )
Meanwhile, Bard gets confused by basic things such as "after this message I will send another one, do not reply until the second message is sent" and instead tries to immediately reply.
IMHO not very many people doing reviews of chatbots are really pushing them the bots to their limits, and those who are pushing the bots really hard are often too busy to take the time and make their work public (which is the reason I am developing in the open!)
[0] https://github.com/devlinb/arcadia/blob/main/backend/src/rou...