How many tokens / sec are you getting on an M1 Air? Curious since I'm at work an...

benkaiser · on Sept 28, 2023

~13 tokens per second. I asked for a story about goldilocks and this was the timings on my M1 air using `ollama run mistral --verbose`:

total duration: 33.964492834s

load duration: 1.471584ms

prompt eval count: 8 token(s)

prompt eval duration: 385.334ms

prompt eval rate: 20.76 tokens/s

eval count: 418 token(s)

eval duration: 31.887033s

eval rate: 13.11 tokens/s

Patrick_Devine · on Sept 27, 2023

I'm getting >30 tokens/sec using it with ollama and an M2 Pro. That might be a little slow though because I have a background finetuning job running.

minzi · on Sept 27, 2023

Bit of a tangential question here, but any recommendations on how to get started fine tuning this model (or ones like it)? I feel like there are a million different tutorial and ways of doing it when I google.

brucethemoose2 · on Sept 28, 2023

https://github.com/OpenAccess-AI-Collective/axolotl

This is a wrapper around many training methods, and it has yielded many excellent community finetunes already.

Karrot_Kream · on Sept 28, 2023

Take a look at the QLoRA repo https://github.com/artidoro/qlora/ which has an example finetuning Llama. Made by the authors of the QLoRA paper.

anonyfox · on Sept 27, 2023

feels roughly like the same speed as GPT3.5 in the browser UI

brucethemoose2 · on Sept 27, 2023

Its the same speed as llama 7B, so very quick.