EDIT: just saw this “ Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.”
- Exui with exl2 files on good GPUs.
- Koboldcpp with gguf files for small GPUs and Apple silicon.
There are many reasons, but in a nutshell they are the fastest and most VRAM efficient.
I can fit 34Bs with about 75K context on a single 24GB 3090 before the quality drop from quantization really starts to get dramatic.
EDIT: just saw this “ Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.”