Hacker News new | past | comments | ask | show | jobs | submit login

How many tokens / sec are you getting on an M1 Air? Curious since I'm at work and can't try this on my Air yet hah.



~13 tokens per second. I asked for a story about goldilocks and this was the timings on my M1 air using `ollama run mistral --verbose`:

total duration: 33.964492834s

load duration: 1.471584ms

prompt eval count: 8 token(s)

prompt eval duration: 385.334ms

prompt eval rate: 20.76 tokens/s

eval count: 418 token(s)

eval duration: 31.887033s

eval rate: 13.11 tokens/s


I'm getting >30 tokens/sec using it with ollama and an M2 Pro. That might be a little slow though because I have a background finetuning job running.


Bit of a tangential question here, but any recommendations on how to get started fine tuning this model (or ones like it)? I feel like there are a million different tutorial and ways of doing it when I google.


https://github.com/OpenAccess-AI-Collective/axolotl

This is a wrapper around many training methods, and it has yielded many excellent community finetunes already.


Take a look at the QLoRA repo https://github.com/artidoro/qlora/ which has an example finetuning Llama. Made by the authors of the QLoRA paper.


feels roughly like the same speed as GPT3.5 in the browser UI


Its the same speed as llama 7B, so very quick.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: