Hacker News new | past | comments | ask | show | jobs | submit login
Mistral 7B Fine-Tune Optimized (openpipe.ai)
234 points by tosh 10 months ago | hide | past | favorite | 103 comments



>Model merging is, to me, one of the most counterintuitive empirical results in modern deep learning. It turns out that you can actually more-or-less naively merge the weights of two different models and produce a new one that captures some or all of the abilities of its parents!

I would hope the article give some more details on model merging. Is it merging two different fine-tuned models, one fine-tuned on dogs, another fine-tuned on cats, and the merging of the two different models is good on cats and dogs as if by magic?

Like fine-tune one model just on Python and test it thoroughly, fine-tune one on Java and test it thoroughly, and then if the need arises for a project that uses both Java and Python, merge the two together and use that. If there is no need for Java, use the one fine-tuned just on Python.

Pretty magical indeed! Let alone the fact, that a separate smaller model of half a billion parameters could figure out how to merge the two together. If the cost of LMs could be reduced by a factor of 100, why not reduce it by a factor of 1000?


This is not so surprising if you consider the fact that finetuning is extremely sparse and barely imparts any new knowledge to the model. The paper "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch"[1] made this clear:

> We initially demonstrate that SFT LM (either encoder- or decoder-based) always tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE, when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank structures akin to LoRA [25]

Insofar as those adaptations are mostly distinct, you can just preserve both sets and that's what explains successes of merging, I guess.

1. https://arxiv.org/abs/2311.03099


Funnily enough, and not so concidentally, this has been well known in practice by...drumroll please...America's greatest innovators, the Adult Entertainment Hobbyists.

It doesn't have order-of-magnitude, or I'd even wager 50%, benefits in enabling smaller models. But you nailed it exactly. Fine tune on dogs, fine tune on cats, then...just...average the weights. And you have something better than the original with minimal loss from finetuning.

LoRA's end up being more popular for that use case because they're easier to combine and mix, match, and scale. Model merging is still a key technique for a successful base model.


not a bad model, becomes incoherent at above 8k token, and it's not helped by the fact that's very verbose, but seems very coherent and stay on topic closely until then: https://chat.openai.com/share/089d1b8c-3467-4c01-af9f-6568c0...

fails at math of course, even if the problem is very easy, like all mistrals. good for genration, probably not the best for RAG, there's mistral tunes that stay coherent to 16k tokens, and that cuts down chunking significanty


~Is the 'k' in your token sizes a typo?~

Edit: mistook tokens for parameters for a moment there. Keeping up with AI jargon is exhausting for an idiot like me.


No it's the sequence length like how much long is the string in the prompt so to say, 8192 token and it starts losing coherence and by 10000 tokens it was emitting gibberish, like empty lines and half words, I didn't put the worst part into the link. What do you mean by ELII?


Explain like I'm an idiot :D


Ah hope it was clear enough in the answer :D

What your see in the link is the copy paste of a discussion between me and the model in question, that I pasted into gpt4 with the instructions to evaluate it.the answer with the votes in 10/10 is gpt evaluating the chart between me and the smaller model. The smaller model is producing the text after ASSISTANT, the question that I do as USER is part of a fixes script that I run with every new model so that I have a sort of a validation set before doing some more rigorous testing.


Yes, indeed. Thank you for the additional context!


> fails at math of course

what did OpenAI do for the LLM to know "if given a math question, write Python for it, and run the code in order to get result" instead of trying to do the math itself?


It trained the model with a lot of data to write code instead (probably sandwiched between some special tokens like [run-python]. The LLM runner then takes the code, runs it in a sandbox, and feeds the output back into the prompt and lets GPT continue inferencing. But TL;DR: it trained the model to write code for math problems instead of trying to solve them itself.


It also has some training on problem decomposition. Many smaller models fail before writing the code, they fail when parsing the question.

You can ask them to serialized a problem in prolog, and see exactly when their understanding breaks - this is open hermes 2.5: https://pastebin.com/raw/kr62Hybq


If anyone wants to finetune their own Mistral 7b model 2.2x faster and use 62% less memory - give our open source package Unsloth a try! https://github.com/unslothai/unsloth thanks! :)


We've tried to sell variants of the open source models to our existing enterprise customers.

I think the adage about "a solution needs to be 10x other solutions to make someone switch" applies here.

Saying something performs slightly better than the industry standard offerings (OpenAI) means that OpenAI is going to laugh all the way to the bank. Everyone will just use their APIs over anything else.

I'm excited about the LLM space and I can barely keep up with the model names, much less all the techniques for fine tuning. A customer is going to have an even worse time.

No one will ever get fired for buying OpenAI (now that IBM is dead, and probably sad Watson never made a dent).

I do use Mistral for all my personal projects but I'm not sure that is going to have the same effect on the industry as open source software did in the past.


> I think the adage about "a solution needs to be 10x other solutions to make someone switch" applies here.

It's already superior to OpenAI because it doesn't require an API. You can run the model on your own hardware, in your own datacenter, and your data is guaranteed to remain confidential. Creating a one-off fine-tune is a different story than permanently joining your company at the hip to OpenAI.

I know in our bubble, in the era of Cloud, it's easy to send confidential company data to some random API on the Internet and not worry about it, but that's absolutely not the case for anyone in Healthcare, Government, or even normal companies that are security conscious. For them, OpenAI was never a valid consideration in the first place.


> It's already superior to OpenAI because it doesn't require an API.

But the quality is not superior to OpenAI however. I run Mistral 7B on LM Studio, and I can't get far before it starts giving me wrong answers.

ChatGPT-4 on the other hand is correct most of the time (and knows to trigger Python code evaluation or RAG to answer questions). This makes it useful.


what is the most prominent use case for private LLMs, doctor notes?


Proprietary and sensitive information. Personally, I use a self-hosted LLM because I don't trust how my conversations with hosted generative AI services will be used.


This. I also use open source self hosted LLMs for exactly this reason.

Sure, I use OpenAI APIs for certain heavy lifting tasks that don't involve sensitive information, but for anything sensitive it's self hosted LLMs all the way.


Great answers above, but long term: Personal assistants. I truly think that’s a privacy line people won’t cross, even after seeing Alexa and Google Maps enter into our lives; I think people would rather have nothing than a robot that knows every detail of their health, schedule, feelings, plans, etc. in some vaguely defined server somewhere.


Don’t Google already have that information from your searches, emails, calendar, etc? Obviously you have to trust they don’t misuse it, but it’s basically the same thing as some personal assistant having it to me.


Yeah, but I think this is less of a technical line than an emotional one.

For example: I wanted my personal assistant to track hygiene, which is a natural use case. But then you arrive at the natural conclusion that either a) the user needs to enter the data themselves (“I brushed my teeth and washed my face and took X medications at Y time”), or b) you need some sort of sensor in the bathroom, ranging from mics or radio sensors up to a tasteful camera. And a million subtle versions of (b) is where I see people going “no, that’s weird, it’s too much info all together”


It's not about Google anymore; that ship has sailed for most people by now. But about giving all this data to yet another company. Also, it's not the same data at all.

Some data might never travel across a Google account, but very well over ChatGPT.

If you're processing personal data of other person, then you don't really have a choice in the matter: gain permission from them to transfer their data to a third party or self-host the model.



Definitely healthcare, or for certain industries (HFT/Finance/...) where for various reasons _everything_ must be run on prem.


As long as you fit your regulatory requirements, it's incorrect.


You could use it to query against any kind of B2B customer information and provide insight, citations and context without any of the data leaving your private server.

When building something similar powered by OpenAI I had a real pain in the ass anonymizing the data, then de-anonymizing the answers before showing it to the customer.

Also in my example, I'm sure using a string like "Pineapple Cave Inc." instead of the real business name hurt the AI's ability to contextualize the information and data and that hurt the LLM somewhat -- right?


Anything related to the business or medium and large enterprises, government


Personalized metaspaces, game worlds, content without paying a rent seeker copyright holder.

Education and research without gatekeepers in academia and industry complaining about their book sales or prestige titles being obsoleted

Whole lot of uses cases that break us out of having to kowtow to experts who were merely born before us trying to monopolize exploration of science and technology

To that end I’m working on a GPU accelerated client backed by local AI, with NERFs and Gaussian splatting built in.

The upside to being an EE with MSc in math; most of my money comes from engineering real things. I don’t have skin in the cloud CRUD app/API game and don’t see a reason to spend money propping up middle men who, given my skills and abilities, don’t add value

Programmers can go explore syntax art in their parent’s basement again. Tired of 1970s semantics and everyone with a DSL thinking that’s the best thing to happen to computing as a field of inquiry ever.

Like all industries big tech is monopolized by aging rent seekers. Disrupt by divesting from it is my play now.


This translates to "right now, porn" and aspirations. (n.b. NERFs that can be rendered client side take O(days) to train with multiple A100s)


Forgot re-creation/preservstion of existing content I paid for by translating footage into physics, color, and geometry models, map them to my clients render pipeline. Level 1-1 of New Super Mario Bros is pretty much completely translated. No copyright problems if I don’t distribute it :)

Like I said, most of my money is wfh design of branded gadgets. Not really the sort to care about the reach of others; if content industry collapses because people don’t need to spend money on it, meh. More interested in advancing computing. Pour money into R&D of organic computers, rather than web apps running on the same old gear with more HP under the hood. yawn

I want bioengineered kaiju sized dogs and drug glands that stoke hallucination I’m on another planet.

Humanity is a generational cup and string. Time to snip the 1900s loose.


Hey, I'm the post author. This is a totally fair point! I do think though that depending on your specific requirements open-source models can be a 10x+ improvement. For example, we serve Mistral 7B for less than 1/10th the cost of GPT-4-Turbo, which is the model most of our users are comparing us to.


I serve ~300tk/s of Mistral 7B for $0.60/hr by renting a cloud 3090. That's a lot cheaper than GPT-4-Turbo, though the quality is closer to GPT-3.5.

Mixtral 8x7b is closer to GPT-4 quality though and only 2x the compute requirement of Mistral 7B.


This is the 10x I was looking for. Great post by the way!


> I think the adage about "a solution needs to be 10x other solutions to make someone switch" applies here.

Cheaper and faster is also better. The cheapest version of GPT-4 costs $0.01/$0.03 per 1K input/output tokens [1]. Mistral AI is charging 0.14€/0.42€ per ONE MILLION input/output tokens for their 7B model [2]. It's night and day.

If people can start fine-tuning a 7B model to do the same work they were doing with GPT-4, they will 100% switch.

[1]: https://help.openai.com/en/articles/7127956-how-much-does-gp...

[2]: https://docs.mistral.ai/platform/pricing/


OpenAI is nothing like IBM in its heyday. I bet a very healthy proportion of companies will not share their data with OpenAI. I saw some numbers on this a while back I don't have the link handy. Trust has to be earned.


Actually, I think microsoft is going to laugh all the way to the bank, because probably most enterprises will use the Azure OpenAI service instead of directly buying OpenAI’s offerings.


all they need is an API compatible client library so there is no actual switching cost between models other than configuration. There's a reason OpenAI is adding all sorts of add-on features like assistants and file upload, because they know models themselves are going to be a commodity and they need something to lock developers on their platform


Code execution and RAG are not going to lock people in. They are 1000x easier to replicate than the model, which as you say, is already becoming a commodity.

My pet theory is that OpenAI are cooking high quality user data by empowering GPT with all these toys + human-in-the-loop. The purpose is to use this data as a sort of continual evaluation sifting for weak points and enhancing their fine-tuning datasets.

Every human response can carry positive or negative connotation. The model can use that as a reward signal. They claimed to have 100M users, times let's say 10K tokens per month makes 1T synthetic tokens. In a whole year they generate about as much text as the original dataset, 13T. And we know that LLMs can benefit a lot from synthetic data when it is filtered/engineered for quality.

So I think OpenAI's moat is the data they generate.


I have the opposite problem.

We are beseiged by vendors promising the earth from their amazing AI tools and we peel back 1 surface layer and they are just shoving things wholesale into GPT-4. When I ask "can we please deploy this on a local model" they run off scared. I can't get any vendor to give us anything except OpenAI.


There's a lot of truth to this, but I have seen clients get really interested in local models—mostly due to cost and/or confidentiality. For example, some healthcare clients will never upload medical records to OpenAI, regardless of the enterprise agreement.


The problem here is that the platform offering here is overly complicated to get started with and quite limited. 2000 dataset entries for $50 a month when I can do 10x as many as that on Colab for free with axolotl or unsloth? Yeah, no thanks.


I've read all the comments here, many of which contradict my points. I used to agree with those ideas but then tried to sell LLMs to customers. My takeaway is that customers will pretend they care about privacy and needs for on prem installations, but they will, at least right now, go with a vendor that tells them their data is protected and not investigate the truth of that.

Zoom got away with it and still does and no one got fired for using zoom.

I'm happy to have a debate with someone that has successfully sold those ideas to a customer, but I'm skeptical until then.


I think at this point the "10x other solutions" should be measured for the cost. If I can process, in perpetuity, 100s of millions of tokens for the cost that OpenAI can do for 10s of millions of tokens one time, that is already past the threshold.


The real thing is the switching costs. Sure we start with openAI. But at some hackathon in 9 months somebody will try mistral and if that saves real money and still works it feels like any easy swap.


how are you using it for your project?


One thing that most people don't realize is that (full parameter)finetuned models are costly unless you run it in batched mode. Which means unless the request rate is very high and consistent, it is better to use prompts with GPT-3.5. e.g. batch of 1, mistral is more expensive than GPT-4[1].

[1]: https://docs.mystic.ai/docs/mistral-ai-7b-vllm-fast-inferenc...


I cloud host Mistral 7B for 20x cheaper than GPT-4-Turbo.

And Mistral 7B API is $0.00/1M tokens, i.e. free : https://openrouter.ai/models/mistralai/mistral-7b-instruct


But this doesn’t apply to self-hosted, o?


It does. LLMs are most efficient when running large batches, so the gpu cost is super high if you’re underutilizing it. It will cost more than a cloud provider like open ai who has the volume to keep their GPUs saturated


Yup. It’s also important to mention that OpenAI enjoys the luxury of having large clusters of H100s (the last time I checked).


Doesn’t really follow instructions too well, if you ask it to list 10 songs on or 5 things it’s give you way more. I’m not sure why some models do it well like Mistral instruct v1, ChatGPT 3.5/4 but here it extremely verbose and it outputs like a short circuited robot


They released a base model. It is not instruction-tuned, so it won't really follow instructions unless you fine-tune it to do that.

"There are lots of Mistral fine-tunes. Why another one?

A very healthy ecosystem of Mistral fine-tunes already exists, but they’re typically optimized for direct use. We wanted something different — a model optimized to be the strongest base model for further fine-tunes to be built on."


Then how come the base model can somewhat follow instructions but not very well, or why is it that the base model won’t follow instructions well?


Base models are just trying to autocomplete the input text. The most logical completion for an instruction is something approximately like what you asked, but base models are raw. They have not been taught to follow instructions, so they generally do a poor job. They're especially bad at knowing when to stop, and they will often generate their own questions to answer, which they will then answer, followed by more questions and more answers.

When chat models are trained, they are first pre-trained (the "PT" in "GPT"), which creates a base model, then they are "fine tuned" (RLHF, aligned, whatever you want to call it).

A base model can be fine tuned with an instruction dataset (like OpenOrca[0]) to learn how to follow instructions or how to chat. It can also be fine-tuned with a collection of any inputs and the expected outputs, and learn how to do that specific task.

OpenPipe appears to specialize in fine-tuning base models for specific applications. They wanted a better base model. If you want it instruction-tuned, I'm sure they would be happy to help with that, or you can wait for someone in the community to make one of those from their base model... but I believe the whole point of the article is that a small, specialized model can outperform a large, general model. Their goal does not seem to be to build a tiny, general, chat-tuned model that outperforms GPT-4 in everything. They want you to train the base model on a very specific task, with the expectation that it will outperform GPT-4 and be tremendously cheaper to run at the same time. Many LLM tasks are centered around summarization, extraction, or classification, which have nothing to do with chatting.

[0]: https://huggingface.co/datasets/Open-Orca/OpenOrca


>> Doesn’t really follow instructions too well,

This is the biggest problem we're having swapping LLMs. While Langchain allows easy swap, and while we dont care as much about quality during integration testing, etc...the bigger problem is following directions. OpenAI does well at outputting a JSON if I ask for one. Unfortunately now our software has come to expect JSON output in such cases. Swap it to, say, llama2 and you dont get JSON even if asking for one. This makes swapping not just a quality decision but an integration challenge.


I haven't used the llama2 models much in quite awhile, because they just aren't very good compared to other options that exist at this point. The instruction-tuned variants of Mistral and Mixtral seem to have very little trouble responding in JSON when I ask for it. However, with LLMs that you run yourself, you can also enforce a grammar for the response if you want to, guaranteeing that it will respond with valid JSON (that matches your schema!) and no extraneous text.

Something potentially helpful here: https://github.com/ggerganov/llama.cpp/discussions/2494

If you fine-tuned a base model (like the one in the article) on various inputs and the expected JSON output for each input, it would probably do even better.


In my experience, Llama 2 (70B) can semi-reliably provide JSON output when provided with clear instructions and various distinct but similarly structured examples. It goes from “semi-reliably” to “consistently” when fine-tuned.

The primary issue I’ve run into is exhausting the context window much sooner than I’d like. Fine-tuning tends to mostly fix this issue though.


At least ChatGPT 3.5 also has that problem. Ask it to summarize in X sentences, chances are it’s a wrong amount.


I wonder if model merging allows for peer to peer collaborative learning ala folding at home. I also wonder if it can have protective effects against catasrophic forgetting similar to the way DNA is merged in sexual reproduction.


Not sure what the point is?

It’s well know that small fine tunes outperform big models for specific tasks.

But unless my task happens to be something similar to what was tested and fine tuned here it doesn’t really help?


I’m really struggling to find a use case for these local models when even ChatGPT 3.5 can do it as good as any of them so far.


The article shows (fine tuned) Mistral 7B outperforming GPT-4, never mind GPT-3.5.


This model is not close to even 3.5 from when I used it. It first of all does not follow instructions properly and it just runs on and on


What you're describing is the behavior you get from any base model that has not been instruction-tuned. The article is clear that this model is not for "direct use". It needs tuning for a specific application.


how does one fine tune it to follow instructions? I would have thought they have open source training set for these instruction-follow fine tunes?


Not everyone wants to send all their data to OpenAI or Microsoft. Sometimes it isn't legally possible even if you want to. And not every use-case is blessed with a permanent internet connection.

And for some use-cases, the "alignment" work on GPT 3.5 and 4 gets more in the way than it helps (even OpenAI admits that alignment makes the model perform worse, even on generic benchmarks).


is it possible to fine tune something like Mistral 7B on a large PDF (i’m thinking like multi-hundred page spec/standard docs) and ask it questions on the topic?


I believe RAG is more appropriate for this! While you can certainly fine tune on a pdf, you are essentially fine turning with batch size == 1, so you should not expect good results! Also you need a label (for example summary) in order to fine-tune!


Anytime I see a claim that our 7b models are better than gpt-4 I basically stop reading. If you are going to make that claim, give me several easily digestible examples of this taking place.


Anecdotally, I finetuned Mistral 7B for a specific (and slightly unusual) natural language processing task just a few days ago. GPT-4 can do the task, but it needs a long complex prompt and only gets it right about 80-90% of the time - the finetuned model performs significantly better with fewer tokens. (In fact it does so well that I suspect I could get good results with an even smaller model.)


I have a fine tuned version of Mistral doing a really simple task and spitting out some JSON. I'm getting equivalent performance to GPT-4 on that specialized task. It's lower latency, it's outputting more tokens/sec., more reliable, private, and completely free.

I don't think we will have an Open Source GPT4 for a long time so this is sorta clickbait, but for the small, specialized tasks, tuned on high quality data, we are already in the "Linux" era of OSS models. They can do real, practical work.


Been my thought for awhile now.

Can you recommend where I can learn more about hardware requirements for running Mistral/Mixtral?


> completely free

Not according to my calculation. For low request rate it is likely more expensive than GPT4.


How are you guys fine tuning?


Can you please point me in the direction of the guide you used for fine tuning? Did you use QLoRA?


What I think they’re claiming is that it’s a base model aimed for further fine tuning, that when further tuned might perform better than GPT-4 on certain tasks.

It’s an argument they make at least as much to market fine tuning as their own model.

This is not a generic model that outperforms another generic model (GPT-4).

That can of course have useful applications because the resource/cost is then comparatively minuscule for certain business use cases.


IDK about GPT4 specifically, but I have recently witnessed a case where small finetuned 7Bs greatly outperformed larger models (Mixtral Instruct, Llama 70B finetunes) in a few very specific tasks.

There is nothing unreasonable about this. However I do dislike it when that information is presented in a fishy way, implying that it "outperforms GPT4" without any qualification.


(Post author here). Totally fair concern. I'll find some representative examples on a sample task we've done some fine-tuning on and add them to the post.

EDIT: Ok so the prompt and outputs are long enough that adding them to the post directly would be kind of onerous. But I didn't want to leave you waiting, so I copied an example into a Notion doc you can see here: https://opipe.notion.site/PII-Redaction-Example-ebfd29939d25...


They can absolutely outperform gpt4 for specific use cases.


Yeah, a 7B foundation model is of course going to be worse when expected to perform on every task.

But finetuning on just a few tasks?

Depending on the task, it's totally reasonable to expect that a 7B model might eke out a win against stock GPT4. Especially if there's domain knowledge in the finetune, and the given task is light on demand for logical skills.


I am very open to believing that. I'd love to see some examples.


I agree, I think they need an example or two on that blog post to back up the claim. I'm ready to believe it, but I need something more than "diverse customer tasks" to understand what we're talking about.


You can fine-tune a small model yourself and see. GPT-4 is an amazing general model, but won’t perform the best at every task you throw at it, out of the box. I have a fine-tuned Mistral 7B model that outperforms GPT 4 on a specific type of structured data extraction. Maybe if I fine-tuned GPT-4 it could beat it, but that costs a lot of money for what I can now do locally for the cost of electricity.


Well it's pretty easy to find examples online, this one using Llama 2, not even Mistral or fancy techniques: https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...


They're quite close in arena format: https://chat.lmsys.org/?arena


To be clear, Mixtral is very competitive, Mistral while certainly way better than most 7B models performs far worse than ChatGPT3.5 Turbo.


Apologies, that's what I get for skimming through the thread.


Not for translations. Did a lot of experimenting different local models. None come even a bit close to the capabilities of chatgpt. Most local models just outputting plain wrong intormation. I am still hoping one day it will be possible. For our business a huge opportunity.


For translation, you're probably better off with a model that's specifically designed for translation, like MADLAD-400 or DeepL's services.


Looks like they utilized the Bradley-Terry model, but that's not one I'm super familiar with.

https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model


the BTL model is just a way to infer 'true' skill levels given some list of head to head comparisons. the head to head comparisons / rankings are the most important!!!! and in this case, the rankings come from GPT-4 itself. so take any subsequent score with all the grains of salt you can muster.

their methodology also appears to be 'try 12 different models and hope 1 of them wins out.' multiple hypothesis adjustments come to mind here :)


https://chat.lmsys.org/?arena

Try a few blinds, mixtral 8x7b-instruct and gpt-4 are 50-50 for me, and it outperforms 3.5 almost every time, and you can run inference on it with a modern cpu and 64 GB of RAM on a personal device lmfao. and the instruct finetuning has had nowhere near the $$$ and rlhf that openai has. It's not a done deal, but people will be able to run models better than today's SOTA on <$1000 hardware in <3 months, I hope for their own sake that OpenAI is moving fast.


Some things to note about gpt4:

>Sometimes it will spit out terrible horrid answers. I believe this might be due to time of the day/too many users. They limit tokens.

>Sometimes it will lie because it has alignment

>Sometimes I feel like it tests things on me

So, yes you are right, gpt4 is overall better, but I find myself using local models because I stopped trusting gpt4.


Don't forget that ChatGPT 4 also has seasonal depression [1].

[1]: https://twitter.com/RobLynch99/status/1734278713762549970

(Though with that said, the seasonal issue might be common to any LLM with training data annotated by time of year.)


How are local models better in terms of trust? GPT 4 is the only model I've seen actually tuned to say no when it doesn't have the information being asked for. Though I do agree it used to run better earlier this year.

The best open source has to offer is Mixtral that will confidently make up a biography of a person it's never heard of before or write a script with nonexistant libraries.


I once asked Llama whether it’d heard of me. It came back with such a startlingly detailed and convincing biography of someone almost but not quite entirely unlike me that I began to wonder if there was some kind of Sliding Doors alternate reality thing going on.

Some of the things it said I’d done were genuinely good ideas, and I might actually go and do them at some point.

ChatGPT just said no.


To be clear, the comparison was originally with GPT3 and ChatGPT3. ChatGPT3 would lie about anti-vaxx books never existing. GPT3 would answer facts.


“… with my definition of better” should be the default interpretation whenever you see the word better anywhere.


In their second sentence they have the most honest response I've seen so far at least: " averaged across 4 diverse customer tasks, fine-tunes based on our new model are _slightly_ stronger than GPT-4, as measured by GPT-4 itself."


I want an interactive prompt box with some example prompts and answers from the model and a comparison with GPT-4. My random guess is that this finetuned Mistral-7B is better at nothing or almost nothing than GPT-4 and that's why instead of the above, we got a table with a bunch of irrelevant metrics.


Of course mistral7b is worse than GPT-4, but I can run mistral-7b at home.


The point is that the article states "averaged across 4 diverse customer tasks, fine-tunes based on our new model are slightly stronger than GPT-4, as measured by GPT-4 itself" and then proves it with nothing tangible, just the 4 selected metrics where it performs the best. I mean obviously a finetuned 7B LLM could perform, let's say, text summarization well. The question is what happens if that text contains code, domain-specific knowledge where some facts are less relevant than the other, etc., and that isn't going to be answered by any metric alone. Fundamentally, with enough diverse metrics, each based on a different dataset, the one with the biggest overlap of the dataset for finetuning will perform really well, and the rest, well, not so well.

Bsically, the statistic means that there's a set of data for which that particular (finetuned) network performs slightly better than GPT-4, and everywhere else, pretty bad. It's just not generalizable to everything while GPT-4 is. It's just as good as saying "calculators outperform GPT-4 at counting". Like, yes, they probably do, but I would like to see - is it applicable and practical, or did you just train a LLM to write all the names in Polish alphabetically really well? And that's why qualitative approach for evaluation LLMs is just better.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: