These two phases have pretty different performance characteristics - prefill can really maximize GPU memory. For long contexts, its can be nigh impossible to do it all in a single pass - frameworks like vLLM use a technique called "chunked prefill".
The decode phase is compute intensive, but tends not to maximize GPU memory.
If you are serving these models, you really want to be able to have larger batch sizes during inference, which can only really come with scale - for a smaller app, you won't want to make the user wait that long.
So, long contexts only have to be processed _once_ per inference, which is basically a scheduling problem.
But the number of decode passes scales linearly with the output length. If it was unlimited, you could get some requests just _always_ present in an inference batch, reducing throughput for everyone.
Decode speed is generally memory bandwidth bound. Prefill is typically arithmetic bound. This is the reason for mixed batches (both decode and prefill) - it let's you saturate both memory and arithmetic.
Chunked prefill is for minimizing latency for decode entries in the same batch. It's not needed if you have only one request - in that case it's the fastest to just prefill in one chunk.
I'm pretty sure the sibling comment is right about different length limits - it's because of training and model talking nonsense if you let too long.
Chunked prefill or some similar technique is also necessary for serving long context requests where there is not enough GPU memory available, regardless of concerns about latency.
For example, consider a prompt sent to Llama 3.1 405B that uses 128k input tokens.
The KV cache will be 123GB. No matter how many GPUs you shard the model across, you are not fitting that KV cache in GPU memory (a H100 has 80GB)
You can do tensor parallelism 8 ways (8 KV heads). You can also do pipeline parallelism (there is 126 layers). Either way would work. A million tokens is possible just very slow.
Also, 405b has 8 KV heads of 128 size (hidden_size/num_attention_heads) times 126 layers [0] times 2 (K and V) times 2 bytes (bf16) is 504k per token. At FP8 it's 252k.
It is also a training issue. The model has to be trained to reinforce longer outputs, which has a quadratic train-time cost and requires suitable long-context response training data.
They definitely have to be trained to reinforce longer outputs, but I do not believe this adequately explains the low-ish generation limits.
We are starting to see models with longer and longer generation limits (gpt-4o-mini having 16k, the o1 models going up to 64k), as well as longer and longer context limits (often 128k, google offering a million).
I find it very unlikely they are actually training with inputs or outputs near these maximums.
If you want to convince yourself, do the attention calculation math for these sequence lengths.
You can also see how openai restricts the sequence length for fine tuning to 64k - almost certainly bound by available GPU sizes
I suspect the 4096 limits have been set as a "reasonable" limit for a myriad of reasons.
70B+ models typically run great with my MacBook's 96GB of (V)RAM. I want a Mac Studio to run e.g. llama-405B, but I can't justify the marginal model quality ROI for like $7k or whatever. (But I waaant iiit!)
405b even at low quants would have very low tokens generation speed, so even if you got the 192GB it would probably not be a good experience. I think 405b is the kind of model that only makes sense to run in clusters of A100/H100.
IMO it is not worth it, 70b models at q8 are already pretty darn good, and 128gb is more than enough for those.
Exactly! Have you tried the Phi models? To me, they indicate that we can get much more efficient models. In a few years, 70b on gold standard synthetic data + RL might run circles around SotA. It's such an exciting time to be alive.
For reference, Qwen 2.5 32B on CPU (5950X) with GPU offloading (to RTX 3090ti) gets about 8.5 token/s, while 14B (fully on GPU) gets about ~64 tokens/s.
For 70B models, I usually get 15-25 t/s on my laptop. Obviously that heavily depends on which quant, context length, etc. I usually roll with q5s, since the loss is so minuscule.
It would be nice to have comparisons to Claude 3.5 for the coder model, only comparing to open source models isn’t super helpful because I would want to compare to the model I’m currently using for development work.
Oof. I'm really not sure why companies keep releasing these mini coding models; 57.1% is worse than gpt-3.5-turbo, and running it locally will be slower than OpenAI's API. I guess you could use it if you took your laptop into the woods, but with such poor coding ability, would you even want to?
The Qwen2.5-72B model seems to do pretty well on coding benchmarks, though — although no word about Aider yet.
Here is a comparison of the prompt "I want to create a basic Flight simulator in Bevy and Rust. Help me figure out the core properties I need for take off, in air flight and landing" between Claude Sonnet 3.5 and Qwen2.5-14B-Instruct-Q4_K_M.gguf:
Comparable, I guess. But the result is a lot worse compared to Sonnet for sure. Parts of the example code doesn't make much sense. Meanwhile Sonnet seems to have the latest API of Bevy considered, and mostly makes sense.
I'm impressed by the scope of this drop.
The raw intelligence of open models seems to be falling behind closed. But I think that's because frontier models from openai and anthropic are not just raw models, but probably include stuff like COT, 'best of N', or control vectors.
> we are inspired by the recent advancements in reinforcement learning (e.g., o1)
It is interesting to see what the future will bring when models incorporate chain of thought approaches and whether o1 will get outperformed by open source models.