Here [1] is a reference to the token/sec of Llama 3 on different apple hardware....

FezzikTheGiant · 2024-05-22T04:26:34.000000Z

Thanks! This is helpful. I was thinking about the phi models - those might be useful for this task. Will look into how those can be run locally as well

jlokier · 2024-05-22T18:33:58.000000Z

I just ran phi3:mini[1] with Ollama on an Apple M3 Max laptop, on battery set to "Low power" (mentioned because that makes some things run more slowly). phi3:mini output roughly 15-25 words/second. The token rate is higher but I don't have an easy way to measure that.

Then llama3:8b[2]. It output 28 words/second. This is higher despite the larger model, perhaps because llama3 obeyed my request to use short words.

Then mixtral:8x7b[3]. That output 10.5 words/second. It looked like 2 tokens/word, as the pattern was quite repetitive and visible, but again I have no easy way to measure it.

That was on battery, set to "Low power" mode, and I was impressed that even with mixtral:8x7b, the fans didn't come on at all for the first 2 minutes of continuous output. Total system power usage peaked at 44W, of which about 38W was attributable to the GPU.

[1] https://ollama.com/library/phi3 [2] https://ollama.com/library/llama3 [3] https://ollama.com/library/mixtral

smarm52 · 2024-05-21T19:49:50.000000Z

Well since OP doesn't seem to want to: Thank you for your response.

I came across this thread while doing some research, and it's been helpful.

(I hate how common Tragedy of the Commons is. =/)

FezzikTheGiant · 2024-05-22T04:25:05.000000Z

What? chill out buddy - there's such a thing as timezones I was just sleeping