> and it only uses ~22Gb (via Ollama) or ~15GB (MLX) Why is the memory use diffe...

simonw · 2025-04-20T18:13:37 1745172817

No idea. MLX is its own thing, optimized for Apple Silicon. Ollama uses GGUFs.

https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0. https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit says it's 4bit. I think those are the same quantization?

jychang · 2025-04-21T07:47:15 1745221635

Those are the same quant, but this is a good example of why you shouldn't use ollama. Either directly use llama.cpp, or use something like LM Studio if you want something with a GUI/easier user experience.

The Gemma 3 17b QAT GGUF should be taking up ~15gb, not 22gb.

Patrick_Devine · 2025-04-21T17:57:03 1745258223

The vision tower is 7GB, so I was wondering if you were loading it without vision?