Cool service. It's worth noting that, with quantization/QLORA, models as big as llama2-70b can be run on consumer hardware (2xRTX 3090) at acceptable speeds (~20t/s) using frameworks like llama.cpp. Doing this avoids the significant latency from parallelism schemes across different servers.
p.s. from experience instruct-finetuning falcon180b, it's not worth using over llama2-70b as it's significantly undertrained.
Hi, a Petals dev here. You're right, there's no point in using Petals if your machine has enough GPU memory to fit the model and you're okay with the quantization quality.
We developed Petals for people who have less GPU memory than needed. Also, there's still a chance of larger open models being released in the future.
AFAIK you cannot train 70B on 2x 3090, even with GPTQ/qlora.
And the inference is pretty inefficient. Pooling the hardware would achieve much better GPU utilization and (theoretically) faster responses for the host's requests
For training you would need more memory. As for the pooling, Theoretically yes but wouldn't latency play as much, if not a greater part in the response time here? Imagine a tensor-parallel gather where the other nodes are in different parts of the country.
Here I'm assuming that Petal uses a large number of small, heterogenous nodes like consumer gpus. It might as well be something much simpler.
> Theoretically yes but wouldn't latency play as much, if not a greater part in the response time here?
For inference? Yeah, but its still better than nothing if your hardware can't run the full model, or run it extremely slowly.
I think frameworks like MLC-LLM and llama.cpp kinda throw a wrench in this though, as you can get very acceptable throughput on an IGP or split across a CPU/dGPU, without that huge networking penalty. And pooling complete hosts (like AI Horde) is much cheaper.
I'm not sure what the training requirements are, but ultimately throughput is all that matters for training, especially if you can "buy" training time with otherwise idle GPU time.
p.s. from experience instruct-finetuning falcon180b, it's not worth using over llama2-70b as it's significantly undertrained.