> ~7 Gbps per core seems really slow for an actual quality production implementation
Not really? With a 128-core machine, that's in the ballpark of 900 Gbps; you're hitting other bottlenecks far earlier than that. And in practice, we're talking about 256 hardware threads for a dual-socket Epyc Milan server, which has been the machine-of-the-day at Google for years now. AMD server processors are so big these days that you could spend a quarter of your compute on QUIC without blinking an eye, be able to serve 450 Gbps on paper per machine, and ultimately hit bottlenecks because disk I/O isn't able to feed the NIC that fast for YouTube serving.
One of the biggest things holding QUIC performance back is a chicken-and-egg problem: vendors don't want to implement NIC offload because there aren't enough companies that want it, and companies don't want to use QUIC because it represents such a big performance drop relative to TCP's decades of tuning with the Linux kernel (in part because of the lack of NIC offload).
You are arguing: “Why should Google care about wasting 25% of their compute costs?” I do not know how much that is, but presumably it is in the billions per year. A 1% saving would be tens of millions per year and that would only require a 4% implementation improvement.
Having done the majority of a QUIC implementation myself, achieving (on the non-encryption portion) 10 Gbps (1.5x faster) seems trivial, 30 Gbps per core (4x faster) seems straightforward, and 100 Gbps per core (15x faster) looks possible.
I was looking for benchmarks of professional implementations to see the limits of the protocol, but all I see are rates in the single digit Gbps which I assumed were toy re-implementations based on my analysis of what should be possible. But apparently these are state of the art implementations so now I am trying to figure out if anybody knows the specific reasons for the performance disparity.
> You are arguing: “Why should Google care about wasting 25% of their compute costs?”
So your estimate is that 25% of Google's compute costs are spent on terminating QUIC connections? I'd be very curious to hear how you arrived at that estimate.
Is that estimate excluding time spent on encryption and networking, as per your other posts?
> Having done the majority of a QUIC implementation myself, achieving (on the non-encryption portion) 10 Gbps (1.5x faster) seems trivial, 30 Gbps per core (4x faster) seems straightforward, and 100 Gbps per core (15x faster) looks possible.
100Gbps goodput with typical internet MTU sizes will mean about 10M ingress packets per second on the receiving side. That gives you a time budget of about 100 nanoseconds per packet, i.e. a single cache miss takes up the entire budget. Just computing a hash for the 5-tuple to look up the connection in the socket table will be like 10ns.
> But apparently these are state of the art implementations so now I am trying to figure out if anybody knows the specific reasons for the performance disparity.
It would be a performance disparity if you had a working implementation that was as fast as you claim, but as far as I can tell you don't have one yet?
The person I responded to stated: "AMD server processors are so big these days that you could spend a quarter of your compute on QUIC without blinking an eye". I was pointing out that seems ridiculous since even minor performance improvements would be valuable if you were, in fact, burning 25% of your compute. Even just restricting that to Youtube video delivery would be significant.
I am aware of the performance characteristics demanded by 100 Gbps. That is why I said possible, not "trivial". I am also talking about the protocol implementation itself, not the entire network stack. I doubt ~7 Gbps per core is hitting their UDP stack bottlenecks, so I doubt that is what is actually limiting their performance. And if they were hitting syscall or OS network stack limits, then this entire line of questioning is easily answered by saying that. But then I would wonder why their network stacks are so slow since UDP handling is even more trivial than managing the QUIC data plane, so should not constitute a bottleneck in any sanely designed full stack.
It is a performance disparity if people complain about "copies" and push for zero-copy over 1-copy transport stacks when you can do a full payload copy at a 7% overhead. Copies are basically irrelevant at that protocol overhead. I also happen to have enough technical ability to evaluate a problem and observe that the implementations seem to fall extremely short of what seems should be possible and can ask for technical clarification on what the practical limitations are.
But sure, I do not have a full implementation at the speeds I theorize should be possible. I only have a data plane implementation that I consider a toy going at ~5 Gbps with all optimizations turned off and having done zero of the planned performance work. I would frankly be shocked if I can not get a 4-6x improvement, but that is, still, only speculative so maybe it will happen.
> The person I responded to stated: "AMD server processors are so big these days that you could spend a quarter of your compute on QUIC without blinking an eye".
Ah, gotcha. Sorry about missing that context. I agree that nobody would be wasting that kind of compute on protocol overhead.
> I doubt ~7 Gbps per core is hitting their UDP stack bottlenecks, so I doubt that is what is actually limiting their performance.
Everything will limit the performance, and those limits will add up. Amdahl's Law is a harsh mistress.
> But then I would wonder why their network stacks are so slow since UDP handling is even more trivial than managing the QUIC data plane, so should not constitute a bottleneck in any sanely designed full stack.
A single cheap system call will likely cost you around 200 ns. A typical server would need like three system calls per packet (a poll, a read, and a write to ack it).
You can get much lower overhead with various kinds of kernel-bypass networking, but then the deployment story gets a lot harder.
> I only have a data plane implementation that I consider a toy going at ~5 Gbps with all optimizations turned off and having done zero of the planned performance work.
Right, but it sounds like both the current results and the projected ones are with neither encryption nor network I/O? I'm pretty sure that nobody else is publishing benchmark results from that kind of setup. They'd be sending the traffic over a network (at least looping back over a network card) using standard operating system functionality, as well as doing encryption. And still doing it at 7Gbps.
Three system calls per packet is not a serious design.
First of all, QUIC supports ack ranges which allow ack coalescing, so you do not need to average one ack per packet.
Second of all, QUIC supports frame packing, so you can piggyback on the opposing flow, though in these one-way benchmarks that should not matter.
Third of all, even 1 syscall per packet is not a serious design. You should be doing batch packet reads and writes to amortize that overhead and such APIs have existed for over a decade.
And again, these other aspects should apply zero meaningful performance impact when comparing to ~7 Gbps unless they, themselves, are outrageously slow. But then these would not be QUIC benchmarks, they are “my outrageously slow encryption and UDP stack” benchmarks. And it still falls back to my overarching point which is that ~7 Gbps per core for your end to end networking seems awfully slow. If QUIC is not your bottleneck, then why not benchmark non-bottlenecked systems. If QUIC is your bottleneck, then that seems really slow.
GGGGP here. I just don’t think throughput is a current design goal. To me, it’s critical. But to Google and many others, it was more about latency and connectivity when moving across networks. If you have a search or LLM application, bandwidth isn’t gonna be a constraint.
I am not saying that we are wasting 25% of our compute costs on QUIC, but rather that we hit many other bottlenecks far before we hit that point.
> Having done the majority of a QUIC implementation myself, achieving (on the non-encryption portion) 10 Gbps (1.5x faster) seems trivial
Encryption is a huge part of the cost, especially with the general lack of NIC offload as I mentioned. Since the packet number changes across retransmissions, lossy environments exacerbate the problem since you need to re-encrypt retransmitted packets (which is not a problem you have in TCP). If you're not doing any sender-side pacing, then you're likely to overwhelm your NIC with microbursts.
100 Gbps per core is quite frankly unrealistic; the only way you get that even in TCP is if you have your smart NIC talking directly to your SSDs and bypassing main memory and the kernel TCP stack entirely, and all of your encryption is done by the NIC. But as mentioned, we don't have that level of NIC support.
Even 30 Gbps per core doesn't leave you much wiggle room if you have to perform AES via your CPU.
Not really? With a 128-core machine, that's in the ballpark of 900 Gbps; you're hitting other bottlenecks far earlier than that. And in practice, we're talking about 256 hardware threads for a dual-socket Epyc Milan server, which has been the machine-of-the-day at Google for years now. AMD server processors are so big these days that you could spend a quarter of your compute on QUIC without blinking an eye, be able to serve 450 Gbps on paper per machine, and ultimately hit bottlenecks because disk I/O isn't able to feed the NIC that fast for YouTube serving.
One of the biggest things holding QUIC performance back is a chicken-and-egg problem: vendors don't want to implement NIC offload because there aren't enough companies that want it, and companies don't want to use QUIC because it represents such a big performance drop relative to TCP's decades of tuning with the Linux kernel (in part because of the lack of NIC offload).