The hardware selection and test setup is questionable, yeah. But downclocking the CPU is pretty standard for many benchmarks like that.
Would have been better to evaluate the effects of CPU frequency changes separately. But there's nothing wrong with downclocking in general if you want to evaluate effects of CPU bottlenecks when you don't have high-speed links available.
Thanks to the cloud, everyone has access to 10G/25G/100G.
It shouldn't be taken for granted that a CPU bottleneck exists, although from Netflix's work we know that serving >100G from one server is CPU-intensive.
But is there any point it profiling on a laptop? Thermal throttling is a bigger issue than on a desktop, where even there you have to faff about with C states if you want somewhat believable stats.
If the intention was to artificially reduce performance for testing their code, it would have made sense also to run it with just memory module (50% bandwidth loss) and to underclock memory.
Disabling TCP offload to hardware makes this comparison much less useful. TCP has hardware offload right now, and QUIC will take years to get at that level.
I'm much less worried about tcp segmentation offload being disabled, than a single core, single threaded test. QUIC is going to hurt performance the most on multicore, multithreaded loads, where with TCP, you're getting the NIC to hash packets and deliver them to per-cpu rxqueues, and (hopefully) avoiding a lot of cross-cpu locking. That's going to be painful on QUIC unless/until you can get NICs to hash on the connection ids.
The NIC can hash on IPs and UDP ports just fine, nothing to worry about here.
Sure, there's the rare case of a connection moving to a different L3/4 flow identifier causing a cache miss, but that's not something that happens often enough to be a relevant bottleneck.
Having a receive queue on each CPU is usually counterproductive in my experience ... one per longest-latency-cache domain is better in most cases. Anyway these days many NICs multiqueue resources have been outstripped by the number of cores in the box. We have NICs with 32 or 64 queues in boxes having 88 or 128 CPUs.
RFS flow tables are installed by sendmsg and recvmsg, so I fail to see why RFS doesn't benefit UDP just as much as TCP. If you are using RFS, your QUIC packets should still arrive on the correct CPU.
> We have NICs with 32 or 64 queues in boxes having 88 or 128 CPUs.
Doesn't that mean that you should have more NICs?
From my recollection, the main thing that's made a computer a server or mainframe for the last 50 years has been that such systems have enough IO channels to saturate the CPU (i.e. they're intended to be CPU-bound, not IO-bound, under heavily-concurrent workloads.) If you take all the IO-offload cards out of a mainframe, it's just a regular computer.
PCIe Gen 3 is becoming a real problem here. There's only so many lanes you get per socket, it's completely possible today to end up being IO bound just due to PCIe bandwidth, not network speeds or CPU processing power.
Not really my area of expertise by any means, but I got a bit excited by this talk[1]. The idea is that a PCIe switch can be used to send data directly from PCIe to PCIe device without hitting the host controller. NICs are mentioned as a use-case.
While the talk is centered around RISC-V, there's nothing RISC-V specific about this, just that it enables whimpy cores to control multi-Gbps traffic.
Maybe, but I don’t see why you’d want so many queues. If you are receiving 1 million frames per second, each 10kb long (10gb/s) and you spread that over 64 receive queues you will process exactly one packet per interrupt, maximizing the cost. If you have only four receive queues (one for every cache domain on a Zen2 CPU socket as an example) then if you set your interrupt mitigation to something modest like 10 microseconds then you will receive several frames per interrupt, and additionally you will not be polluting the L1i cache of the other 60 CPU cores.
With Intel DPDK now in a fairly good state, it's actually fairly possible to get very attractive performance in userspace. You can leverage CPU instructions for crypto primitives when available to speed things up too.
The amount of TCP hardware offload in linux right now is actually fairly limited. Kernel developers have largely rejected true in-hardware TCP stacks for a while now (although you can get this functionality with high end NICs and custom kernel builds). That said, since QUIC generally reduces the ack frequency it should be possible to match or exceed TCP with hardware offload. Especially when you look at total throughput with many streams open.
If I understand correctly the whole point of QUIC is to not be in hardware level. That's why QUIC exists, because TCP can't be changed because it is baked in at such low, fundamental levels.
It's not that TCP can't be changed, it's that it's the only option that works.
TCP is baked into firmware on edge routers and firewalls and wifi devices that are beyond the period of support and the companies that created them aren't interested in supporting them or have gone out of business.
There are other protocols in layer 4 that could have been used that are better on some workloads than TCP, but TCP and UDP were the only things configured/supported; many firewall default configurations filter out things they don't recognize so anything new is a non-starter. And because of the perception of responsibility when it comes to networks, if I support a new protocol and you can't connect to it because there's older devices between you and me, it becomes my problem.
So much this. Its ossified so far that if your DHCP packet isn't padded in just the right way, some firewalls will block it. Somebody mis-read the RFC and put in some rule, and now many Enterprise firewalls do this. Sigh.
Anyway, anothing not 'normal' is being filtered out somewhere. We even (at one startup) considered tunneling RTP through HTTP! My god.
QUIC is built on UDP, not sure if you were aware based on your 3rd paragraph?
> One concern about the move from TCP to UDP is that TCP is widely adopted and many of the "middle-boxes" in the internet infrastructure are tuned for TCP and rate-limit or even block UDP. Google carried out a number of exploratory experiments to characterize this and found that only a small number of connections were blocked in this manner.[3] This led to the use of a rapid fallback-to-TCP system; Chromium's network stack opens both a QUIC and traditional TCP connection at the same time, which allows it to fallback with zero latency.[17]
The whole point is to improve network traffic. The only avenue to do that is in user space until new strategies make their way down to the OS and hardware. If QUIC is good enough eventually someone will make hardware for it.
No, the point of QUIC is to shave a few % of traffic off of the loads the likes of FAANG have, saving millions, while having a marginal benefit to pretty much everyone else.
QUIC can be 'a bit better' in 'some scenarios', but most of the time it is totally irrelevant, except for the fact that it requires TLS 1.3 which is a good requirement overall.
QUIC in the datacenter is hugely superior to Linux kernel TCP. Kernel TCP has been inappropriate for datacenter networking for decades and this has been a known issue mentioned many times in the literature (start with [1] if this is new to you). Datacenter operators have been forced to either patch the kernel or bypass it to get decent TCP performance. QUIC is a rational response to this situation. Instead of waiting a quarter century or more for Linux to get its TCP stack in order, switch to UDP and innovate in user space.
> Kernel TCP has been inappropriate for datacenter networking for decades
I'm not sure your conclusion can be drawn from the cited paper alone, which is now 11 years old. The paper describes a phenomenon that occurs when switch hardware buffers overflow on high-throughput networks -- a problem that has since been resolved in many datacenters that use software-defined networking and more modern hardware.
All of the people I've met in my career who believed their frames were not being dropped in practice were people who simply hadn't bothered measuring it. Perhaps my experience was warped by working at Google where they build the dumbest, cheapest switches they can imagine, where frame drops even for the highest priority flows are rampant, but after Google I've seen the same thing in multiple production networks so I'm pretty sure at this point that frame drops are endemic.
Keep in mind that there are only two outcomes to networking: the frame is either forwarded or dropped. Whenever you encounter software-defined networks you need to ask which frames the network is dropping, because that's the only action an SDN can take.
Incast is a stochastic phenomenon and the point of the paper is there is some probability, which is not zero, that too many frames will arrive at the same point in the network at the same time, forcing some to be dropped. This probability cannot be zero regardless of the buffer depth. Eric Dumazet has written a lot about how adding entropy in packet scheduling at the host level helps avoid frame drops. The fact is the problem cannot be solved by the network, it has to be solved by the hosts at the edge (and this is basic nethead-vs-bellhead philosophy going back 40 years).
I no longer have access to stats, but at my last job, we did measurements of packetloss on the backend tcp connections between our hosts. The vast majority of the time, there was either no loss, or some very small amount of loss (maybe < 10 packets lost in a minute).
Generally, when we saw losses, we were able to work with the datacenter to track down bad cables, or overloaded paths, or overloaded fabrics (which was usually in their 'legacy' datacenter). I expect the datacenter networking was overprovisioned (separate backend network, dual NICs on hosts, multiple uplinks on switches, etc), but it was darn reliable. Of course, it would have been nice if the people who owned the datacenter would have been monitoring all the stuff their customer (us) was able to find was broken. And, some sort of multistream connectivity would have helped avoid the paths with loss; ocassionaly the loss would be bad enough to impact service, and we'd need to force reconnects until the tcp stream went on a different path.
I can't speak for the parent, but I think what he's saying is that these are phenomena that start to become apparent at extremely high aggregate throughput rates with extremely high device density - i.e., Google scale.
By saying it's "inappropriate for datacenters," I think he painted too broad a brush - obviously not all are the same, and not all run at Google scale. Depends on the datacenter and the workload.
Sort of. These things are relevant whenever you have congestion. If you have no congestion then the congestion-control features of your protocol aren't worth discussing. On a perfect network any protocol will do the job. All real network are imperfect, which is why the congestion control features of TCP and QUIC are worth discussing.
So now the question is, in the presence of dropped frames/packets, is this a problem that needs a wholesale L4 protocol replacement (a la SCTP or QUIC), or is this something that only requires some improvements to TCP or maybe tuning some knobs?
I've been hearing arguments that "TCP sucks and is irredeemable" for decades, too, yet the world still turns and our devices that use it continue to work reasonably well at increasing bandwidth (2.4kbps to 40Gbps+) - at least as far as the public is concerned.
A number of the knobs are cast in stone in the RFCs and kernel maintainers have been resistant to exposing them. These slides mention several of the TCP parameters that are off by 2-4 orders of magnitude and some ways to remediate this. Note the last slide: Google has been patching around Linux TCP brain damage for 15+ years.
Not sure these are opposing topics. My understanding of SDN is simply that a program may decide to drop certain frames, and this can run on top of ordinary Ethernet.
In theory, if SDN is on top of it, or below it (or both) it would be better to use QUIC than IP-in-TCP or EoIP (or GRE, PPP etc.) because the connections within the SDN would already assume the medium might be lossy and having stacked retransmission will really blow performance (hence VXLAN and the likes are used more than the classical encapsulation AFAIK).
So again: it's for when you do large scale data center communication. Perhaps not just FAANG level, but say, having a few private cages before it starts delivering much of an improvement?
I don't think you need large scale to benefit. Any time you are latency-sensitive, have several machines that are microseconds apart, and there is a chance of dropping frames, you will suffer from Linux TCP bogosity. It can't comprehend microsecond RTTs. If you drop a frame it will be a minimum of 15ms before Linux notices (with TLP, 200ms otherwise). That's orders of magnitude worse than it needs to be.
So again, large scale or real time, which is not something most setups do. Say you process video and audio streams, that makes sense, or if retransmissions are costly at scale. But does it matter for anything else?
I'm trying to find the actual general benefit, but so far I'm not finding it. There is a technical benefit, but not on the scale of say h2 multiplexing or binary vs. plain-text protocols.
Sure, if you're not trying to serve with low latency then this doesn't matter to you. But if you're happy with hundreds of milliseconds of tail latency you also don't need SSDs, 64-core CPUs, 100g ethernet, or anything else newer than ten years old. It's only interesting to evaluate the software on the frontier of performance.
How much does TCP offload help in practice, in particular for TLS connections? Is it a 20% improvement, a 2X improvement, etc.? That would be good to know to evaluate the performance of QUIC in the article.
CPU utilization is a zero-sum game. If you need to use the CPU to calculate checksums or do packet segmentation then it’s that much less CPU available for TLS.
So not quite... a lot of modern CPUs have crypto primitives implemented in hardware that allow other instructions to run simultaneously on the same core (taking advantage of hyperthreading and out of order execution).
Another way is if it takes a couple CPUs to run my application, a CPU or two to drive the storage, and a CPU to do TLS, why would I also spend a CPU on TCP if my network card already has one that does that?
That's a little bit like what programmers are talking about when they are talking about "multi-core programming", except those are general-purpose "cores", and (presumably) these "cores" can only handle doing TCP.
I think the question is: given that you're going to be performing pretty heavy symmetric encryption on every byte anyway, how much can TCP offload really help when all it's handling is the relatively trivial checksum and sliding window computation?
Isn't a large part of the performance gain from TSO offloading on transmit due to avoiding working with way too many skb's in the kernel?
Shouldn't GSO for UDP with proper NIC support give you most of the performance gain?
I'm not sure about the implementation details of GSO and TSO in Linux. But I know a thing or two about the hardware: the Intel Niantic/82599 NICs support UDP transmit segmentation offloading, I've checked the datasheet real quick and it says "Note that current UDP segmentation offload is not supported by any standard OS."
Not sure if this is just an outdated comment in a 10 year old datasheet or if there's some performance to be gained by using this UDP segmentation offload for quick.
Edit: Should have clicked on the link in the article [1] before posting here, UDP segementation offloading is coming to lots of NIC drivers in 5.5.
Always interesting how the hardware often supports way more features than the drivers expose and you suddenly get new hardware features on a 10+ year old NIC model :)
(Similar story with IPsec offloading on the Niantic/82599)
Yes, quite. The i219 chipset in my laptop "includes advanced interrupt handling features to reduce CPU overheard. Other performance-enhancing features include offloading TCP/UDP (for both IPv4 and IPv6) checksum calculations and performing TCP segmentation. Advanced features such as Jumbo Frame support for extra-large packets and Receive Side Scaling (RSS) are also supported."
It's also supported on the old chipset from the motherboard I bought more than half a decade ago, and every chipset I've used in the server space in as long as I can remember.
Their implementation of congestion control (reno) is 85 lines including a big copyright block comment. That's cute but it's completely incomparable to the transmission intelligence in the Linux TCP stack, especially for the line of business Fastly is engaged in (maybe the technical staff doesn't even know what that is due to poor leadership?) in trying to get goodput to devices across congested and channel lossly links. This article is highly comical. I've yet to be impressed by fastly aside from offloading itself to retail investors with another nonsensical IPO.
It sounds like you know a lot about this field, but can you please post more substantively, and in particular replace name-calling ("highly comical" "nonsensical") with actual details? If you know more than others, that's great, but the thing to do on HN is to share some of what you know so the rest of us can learn.
A phrase like "cute but completely incomparable" doesn't teach the reader anything. It would be much better to say what the relevant aspects of "transmission intelligence in the Linux TCP stack" are, and what makes them relevant.
It teaches readers not to trust a brand's PR jobs and brands to not mislead. Comedy is intricate, and yes I find the situation having to perform at higher standards than a $2B company in opining on here during my lunch break to be quite funny.
A 747 is not comparable to a Cessna. The later is cute but not comparable and it doesn't require a dissertation to comment on the quality among other computing professionals.
Attempts to make comedy by putting others and their work down on the internet are rarely funny, and in any case the HN guidelines specifically ask commenters not to do that. If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and sticking to the rules when posting here, we'd be grateful. Note "Don't be snarky." and "Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
"A 747 is not comparable to a Cessna" doesn't help because it's really still just calling names. What would help is an explanation of why you're saying that. What are the significant differences and why do they matter? Lots of readers here would find that interesting. If you don't want to spend your lunch break writing an explanation to help people learn, that's totally fine. But at least please don't post empty comments that make HN worse and break the rules. Not posting is always an option.
You may be overestimating the extent to which readers here will already understand the details of what you mean. No doubt some do, but I'm pretty sure the majority don't. Even most computing professionals aren't specialists in this area. That's why the opportunity to share information that readers find interesting and can learn from is such a good one. You can certainly (indeed, you should) make a strong critique of inflated and misleading claims, but do it with solid information, not shallow putdowns that any internet snarker can write whether they know anything or not.
Thanks. It's true that the term 'name calling' is used a bit differently in the HN guidelines than in general circulation, so I can see why that sounds a bit weird.
> I've yet to be impressed by fastly aside from offloading itself to retail investors with another nonsensical IPO.
Wow, that's quite a strong sentiment, especially given CDNs are your area of expertise [0].
I was (am?) of the opinion that Fastly is a genuine competitor to Cloudflare given their product offering, and that their engineering was more than a match (though, to be honest, I'm mostly going by the presentations their engineers have given over the years). Your statement, to me, makes it seem like Fastly is smokes and mirrors. I'm honestly surprised.
Cloudflare and Fastly are in somewhat different industries even though they can do some similar things. Fastly has a small number of large customers, Cloudflare has an astronomical number of customers which makes what they do even harder. From the outside looking in I would rate Cloudflare as one of the most technical companies in systems software industry at the moment. It's probably because CF has a highly technical CTO.
They use Reno? I don't know what's their business, but hopefully it doesn't involve lossy links like mobile data, which often drop packets even when uncongested...
Reno can work nicely for low packet loss links, though.
Just pertinent to this article, take a look at the links to their QUIC implementation. The reason I bring this up is because it's pretty easy to make a simplified fast path TCP stack that doesn't have all the branch heavy code necessary to do what the Linux TCP stack does well. So this is like a Phoronix style "benchmark". An undergrad level of understanding of systems will net you that you can get equal performance in userspace or kernel space with sufficient holistic thinking and accounting of data movement. I would caution that very few people are qualified to do this in userspace though, which is paradoxical because you will find a lot more people showing up claiming they know how.
So basically this isn't a competent test; it isn't the production grade QUIC from google, it's a toy and the result is complete nonsense.
I'm no TCP (or even congestion control) expert, but I wouldn't have needed to test their code in real life to tell exponentially reducing send window on a lost packet is just not going work in real life in 2020.
They should have at least tested their code with simulated jittery latencies and packet loss typical for modern 4G networks. I bet Linux TCP would have left their implementation in dust.
Another thing that caught my eye was their assumption that MTU is same on return path... probably often true, but definitely wouldn't count on it!
Huge respect for them for putting their necks out and releasing code! Maybe this is their intention as well, to gain understanding through criticism.
Also, perhaps this was just their MVP? It would certainly explain weird choice for congestion control.
It's just marketing bullshit, I don't feel bad at all for offering damning criticism. Companies hire people to do this kind of stuff on purpose. Many unfortunate software trends of the last decade have been started by equally flimsy PR jobs.
Interestingly, I've found TCP_Illinois[0] to do the best on lossy links. Case in point was a cable modem isp I used for about a year once. The upstream was rated at 10mbs, but it would bounce between 1-5mbs. Using a UDP test, I noticed about a 0.2-1% packetloss on upstream only. I tried all the tcp algos in linux. Illinois did the best. And by "best" I mean it would hold a steady speed between 8-9mbs while others would get quite jumpy at times (or just never go above 5mbs). I'm limiting upstream to 9mbs via fq-codel as well. I've always found it strange Illinois is never brought up anywhere. It even has tuneable parameters too.
It's a hybrid meaning it may struggle in direct competition with pure loss based algorithms. CUBIC, the Linux default, is somewhat notorious in causing bloat issues but these things tend to play out like an arms race. BBR is another hybrid and has tons of heuristics and tuning to try and adapt to these arms race conditions.
Not questioning your conclusion; out of curiosity, did you also measure latency?
You could get very close to maximum possible just by massively delaying & coalescing ACKs. In other words nearing blind transmission without caring about packet loss at all.
I believe congestion control would be at the application-layer with QUIC, that is, is it left upto L7 protocols like HTTP/3 to implement their own congestion control and packet loss?
For HTTP/3, then, would fq-codel + BBR work nicely for foreground traffic like voice or video over WebRTC and, say, LEDBAT for background traffic like HTTP downloads or, say, bit-torrent over WebRTC?
Not really. Was just using max average upload speed at. Test was done on my laptop, my firewall uses fq-codel so max bufferbloat delay would be below 50ms in most cases.
Sorry, should have been more specific. I wasn't talking about latency bufferbloat induces in other streams. But extra latency in any stream with packet loss.
I meant stream of the receiving party would effectively block on any lost packet. You can't read past what you don't have yet.
Imagine a stream that consists of 10 packets, and for simplicity that each packet takes a second to send. (Real life scenario would be about several orders of magnitude more packets and similarly shorter packet transmission times.)
You receive packets 1 and 3-10. Thus in one second the application can see data from packet 1. But application's read is going to block until you also receive packet 2. Reception of packet 2 can't happen until 8s is spent receiving packets 3-10 and additional 1s resending 2 (plus whatever it took to send the ACK).
So data from packet 2 would be 9s late from receiving application's point of view.
Bulk transfers would approach maximal possible transfer rate because sender lost no time waiting for ACKs, but this would be pretty bad for something like remote desktop or streaming video.
No sane general purpose network stream protocol would implement this policy for obvious reasons. But you'd sure get high bandwidth for file transfers!
All well and good, but TCP has been around for decades and there are top quality network stacks providing TCP sockets on top of almost anything that can run on a CPU. Is there a (preferably BSD-licensed) QUIC piece of code that can be retrofitted where the TCP stack of an OS used to be? Or are we going to start linking a piece of the network stack into the applications?
>Is there a (preferably BSD-licensed) QUIC piece of code that can be retrofitted where the TCP stack of an OS used to be?
No, and that's a current drawback of QUIC. But of course it's just a current drawback - the investment cost of integrating this into the operating system vice applications will certainly be eclipsed by the benefits as QUIC is deployed more throughout the internet
What you call "ossification" others would call "stabilization" and "standardization". Not to say that the inertia of (some) kernels is not excessive sometimes but I'm not a fan of the "fuck it, we'll do it in userland" attitude. That creates big, opaque application blobs that nobody understands and knows how to debug with standard tools.
I feel like Google in particular loves to do that stuff because while they control the internet and the browser, they don't (yet) control the OS so they have a very strong incentive to push stuff up in order to retain control. They don't have to ask anybody to add things into Chrome, they have to get third parties on board to extend the kernels (outside of Android at least).
What's extra frustrating is that Google does control the OS for Android, but they don't use that to make TCP better.
Things they could do:
a) enable MTU blackhole probing, so people behind dumb networks can talk to servers that don't artificially reduce MSS. Apple has a pretty agressive implementation of MTU probing, and it's very effective.
b) put in hooks for userspace congestion control. Make sending faster (or not) with updates from Google Play services (or whatever)
c) the article mentions sending fewer acks. That could probably be done in TCP too --- it makes sense to make that a setting that they could roll out.
d) some sort of name and shame program for bad networks / bad network devices. It's 2020 and people running PPPoE and sending MSS 1480 when it's really 1472 should be ashamed of themselves, but big companies just set their servers to MSS 1460 and call it a day. :(
QUIC is a standard, it will be stabilised in time.
TCP can't be improved on much at this point, and there are some pretty massive flaws in it that QUIC solves.
TCP isn't going away any time soon, or being "replaced" by QUIC, but for certain applications QUIC is going to be a massive improvement, albeit and an increased complexity and application size cost.
Given that performance is one of the motivations for QUIC, benchmarking needs to be done at every stage of development. Also, QUIC is close to being finalized so I don't think it's too premature.
HTTP 1.1 is somewhat broken. For example pipelining and keeping connection open, data integrity issues like no content length followed by unexpected connection loss, etc.
It also allows ISP shenanigans like content modification to add unwanted advertising and unlimited tracking, even spying.
Can you back that up? HTTPS is a requirement for doing anything securely. It seems like a lazy troll to just dismiss it with no explanation. Similarly, even a cursory amount of research into WebSockets or HTTP/2 will turn up real engineering problems which are solved by those protocols. Can you address any of them?
I hash the password with a one-time server salt in the client (instead of HTTPS) and I use simple HTTP comet-stream for real-time data (instead of WebSockets):
HTTP/2 has the TCP head of line problem which makes it completely useless.
HTTP/3 is trying to be more nimble, but it's a protocol! Ossification is a _feature_ of protocols; for TCP it allows the backbone to evolve without everyone re-implementing everything over and over again at the edges!
The solution is to make TCP like UDP instead, by allowing sockets to not resend missed packets.
Another real solution is to remove the network kernel on linux.
Danger, non-server hardware detected.
we limited the CPU’s clock from 2.2GHz down to 400MHz
Danger danger, system imbalance approaching critical levels; please evacuate to a safe distance.