CodeLlama is obviously trained on code specifically, so maybe not a useful compa...

CodeLlama is obviously trained on code specifically, so maybe not a useful comparison. GPT4 is also way different than Mistral, with a huge step up in parameters and seems to be using multi-agent approach too.

Since Mistral is just a 7B parameter model, it's obvious that you won't be able to have it straight up write accurate code, it's simply too small for being able to accomplish something like that, unless you train the model specifically for writing code up front.

I guess if all you're looking for is a model to write code for you, that makes sense as a "hello world" test, but then you're looking at the wrong model here.

What you really want to do if you're looking for a good generalized model, is to run a bunch of different tests against it, from different authors, average/aggregate a score based on those and then rank all the models based on this score.

Luckily, huggingface already put this all in place, and can be seen here: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

This Mistral 7B model seems to earn itself a 3rd place compared to the rest of the 7B models added to the leaderboard.

Edit: As mentioned by another commentator, this also seems to be a base model, not trained specifically for request<>reply/chat/instructions. They're (or someone) is meant to fine-tune this model for that, if they want to.