Been working on something similar as a sub-sub-sub contractor for Alibaba's DCs.
The physical latency is nowhere near as scary as latency of a shitty NIC and software... 20 metre roundtrip is only 200 nanoseconds, or about the same as pulling something from DRAM.
The Linux network stack needs to get its shit in order. Kernel bypass libraries like EVFI transmit UDP in 250ns whereas the kernel typically takes several usec.
Transitioning into kernel space shouldn’t take that long so a lot of needless instructions must be getting executed.
For any meaningful elastic deployment at scale, Linux's (or rather any OS) TCP/IP stack will have to be bypassed through user-space programs: It has all of the upsides and few of the downsides.
there are a couple pretty core assumptions that make that hard:
read() interfaces expect the kernel to put the data where you asked. its hard but not impossible to coordinate with the NIC to demultiplex the packet and get it into the right place without a copy. almost all of the time you just want to see the data, you dont care that it lands in this particular spot.
getting control flow in and out of the kernel is expensive. target address need to be checked. registers often need to be saved.
things like the kernel firewall take a certain amount of work to determine the packet status
there are lots of mitigations, but at some point you run into a conflict between general purpose multi-process functions and performance.
if you ran your application on a dedicated arm next to the nic I bet you'd see a difference. maybe that shouldn't be so hard.
That still doesn't add up though. System call overhead is something like 500ns to 1usec depending on the call. Copying a large 1500 byte UDP MTU from ram is pessimistically along the lines of 2.5 usec (1500 bytes / 64 byte cacheline * 100ns L3 access).
So a simple UDP recv() + copy from network card buffer to userspace should be about 3-4usec. In the real world, it's more like 6 or 7usec the last time I timed it.
Admittedly, the unix API could use a refresh. A lot of unix calls like getaddrinfo are perfectly happy to allocate and give you a buffer. A "fast_recv" could easily do the same and just hand you a page from a kernel page pool directly that you then have to free.
By avoiding the buffer copying, you could get something in the same ballpark as kernel bypass but without all the bespoke code and frameworks in every application. And in the end, that's kind of the point of having an "Operating System" in the first place.
> So a simple UDP recv() + copy from network card buffer to userspace should be about 3-4usec. In the real world, it's more like 6 or 7usec the last time I timed it.
Not sure where the Linux kernel lags (may be at scale it kind of tails out?)-- if you read Google's paper on their network load balancer, Maglev, where section 5.2.1 specifically calls out 30% performance degradation without bypass on smaller packet sizes.
> By avoiding the buffer copying, you could get something in the same ballpark as kernel bypass but without all the bespoke code and frameworks in every application.
May be this article from Cloudflare helps paint a proper picture of why certain kind of applications may need to keep bypassing the kernel forever because their requirements are very specific and not because the kernel can't be made to go faster (ex: a virtual router/switch, or a load-balancer): https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp...
(not an expert, just someone keeping abreast with development in the networking world)
There are inherent problems with relying on the OS in an elastic environment. Kernel's TCP/IP stack could be made to expose very specific APIs to help with NetworkFunctionVirtualization (NFV) and Software-defined Networking (SDN), but the advantages are not clear:
1. Vendors to rely on Kernel devs to do the their biding for their NFV/SDN deployments, and wrestle with various other priorities Kernel might already have.
2. Build/deploy the updated Kernel fleet wide, with those changes (which now include other Kernel changes vendor may or may not want).
3. Cater to strict Kernel community guidelines and expect to be driven by the established processes governed by designated gatekeepers, costing vendors time and speed to market.
With a userspace solution, esp since it allows vendors to be performant by simply interfacing with NICs, the vendors have a measure of control on their own destiny:
1. Deploy at will, at pace.
2. Ability to do custom packet/flow (virtualization), on-the-fly (SDN) even, without requiring any assistance from any other part of the system not in their control.
3. Reduce external factors that might influence their timelines or priorities.
The generic problem, however, with the Kernel's TCP/IP stack needing improvement is orthogonal to this. That effort will have little to no bearing on the direction the Networking industry is moving, imo.
Everything should just be fast, but it usually isn't. So how are you going to deal with that as a business? Hire kernel hackers and spend expensive time and money to fine-tune one version of a kernel that'll never go upstream? Or pay a vendor for a really fast NIC and integrate it into your app?
(If you're Google, you hire the kernel hackers, but if you're trying to make money off a product and ship it next quarter, you buy the NIC)
Well right now everyone who cares about latency (including me) is spending time integrating some kind of kernel bypass library into their application that's custom and/or proprietary to a single card vendor and often pretty finicky to configure and integrate correctly.
It's a massive waste of resources. If the kernel could figure out DRI for graphics then they can figure out universal passthrough networking. All the bypass libraries seem to use the same buffer pool + spinlock slot architecture already.
And by your logic, how does anything ever get into the kernel? Nobody should ever contribute anything because "upstream".
I'm just saying that fixing the kernel is unlikely to happen unless it's in someone's business interest, and it's usually not. Hell, I bet there's kernel forks out there where people have already solved this, but it won't go upstream, because lawyers.
Nice, although they need to went into production to prove any worth.
This type of innovation is pretty much appeared on a daily basis in every senior engineer's head. Not dismissing, it's just a matter of fact in real world systems.
The more interesting thing in this paper is the gulf between state of the art and mainstream practice. Most of you people are happy getting ~100K QPS in a 32-core box and this guy is getting 10M QPS per core.
That is hardly the state of the art: they are basically sacrificing all abstractions that are rightly so required in the name of speed. This is no different than using vanilla DPDK with no congestion and flow control and being able to process 20 mil packets per core (better than the numbers in the paper). Getting 75Gbps per core using RDMA is hardly hard or new.
You can't use this abstraction for anything ... really unless you are on a completely lossless fabric that has enough capacity to avoid congestion.
Yes, Google/MS/FB are going towards such fabrics but the hard part is not building the RPC abstraction suggested in this paper---the hard part is getting to that fabric.
Just to put things into perspective, if you had a quantum computer you could do all sort of crazy stuff with it. You could "parallelize loops! and be super fast" is what this paper is suggesting (I specifically chose parallelizing loops cause there is nothing new in it).
I'm the main author of eRPC, and I wanted to clarify some things.
Your comment suggests (please correct me if you meant differently) that (a) eRPC does not perform congestion control, and (b) eRPC requires a lossless fabric. In fact, eRPC implements congestion control, and it works well in a lossy network. Those are the two main contributions of the paper.
We get 75 Gbps with only UDP/Ethernet packet I/O, without RDMA support.
eRPC implements transport-layer functionality atop a fast packet I/O engine like DPDK, so comparing eRPC to DPDK isn't apples-to-apples.
Your repo is really nice for an academic paper. Thank you for that. It's rare to see a "networked system's" repositories that has readable code. I mainly checked large-tput example:
A few questions—
1) For your 75Gbps, what percentage of the payload of the RPC do you touch? I.e., what portion of the message is used on that core?
More directly, say you have a service that can sustain 100kQPS, if they switch to eRPC, what can they expect? Asked differently, what is the base overhead of today's RPC libraries? Especially ones that bypass kernel.
2) The congestion and flow control is debatable, and their efficacy is up for debate. Especially in a DC setting. Can you claim that eRPC would work for any types of the workload in a DC setting? How would it play out with other connections? At the end of the day, if you are forced to play nice, you may eventually add up branches in your code. Your fast path gets split depending on the connection type, etc. Is that something that you think is preventable?
3) How do you distribute the load across different cores at 75Gbps? How does the CPU ring, contention, etc. come into play? I.e., can you do useful work with that 75Gbps? or should I just read it as a "wow" number? Asking a different question, if I have a for loop that can do 10 billion loops per second and by just adding a function that drops down to 10k loops per second, why would I care about that 10 billion iterations?
4) You claim that it works well in a lossy network, yet your goodput drops to 18~2.5Gbps at 10^-4/10^-3 packet loss---I am still assuming the library is still flooding the network at 75Gbps. How does this play out in scale?
All in all, I do appreciate your work. My issue is that academic people like to make big claims, especially in an academic setting. People in the industry are aware of fast-paths. Kernel networking stack uses fast-paths rigorously. Sure it is heavy and it comes with a lot of bulk, but you can as easily cut it down.
1. Our throughput benchmark is designed to measure data transfer bandwidth, and the comparison point is RDMA writes. In the benchmark, eRPC at the server internally re-assembles request UDP frames into a buffer that is handed to the server in the request handler. The request handler does not re-touch this buffer, similarly to RDMA writes.
We haven't compared against RPC libraries that use fast userspace TCP. Userspace TCP is known to be a fair bit slower than RDMA, whereas eRPC aims for performance RDMA-like performance.
2. eRPC uses congestion control protocols (Timely or DCQCN) that have been deployed at large scale. The assumption is that other applications are also using some congestion control to keep switch queueing low, but we haven't tested co-existence with TCP yet.
3. 75 Gbps is achieved with one core, so there's no need to distribute load. We could insert this data into an in-memory key-value store, or persist it to NVM, and still get several tens of Gbps. The performance depends on computation-communication ratio, and we have tons of communication-intensive application
4. Packet loss in real datacenters is rare, and we can make it rarer with BDP flow control. Congestion control kicks in during packet loss, so we don't flood the network. Our packet loss experiment uses one connection, which is the worst case. An eRPC endpoint likely participates in many connections, most of which are uncongested.
1) Are you using or relying on DMA or SPDK to copy packet data? A single core, to my understanding, (assuming 10 concurrent cache lines in flight and 70~90ns of memory access time) doesn't have the bandwidth to copy that much data from the NIC to memory (assuming the CPU is in the middle). If so, RDMA and the copy methodology are not so different in how they operate.
I didn't look at the paper where you explained how you perform the copying.
Also, IMHO, RDMA itself is a pet project of a particular someone at somewhere that is looking for promotions :) . I don't really know if it's a good baseline. It could be more reasonable to look at the benchmarks of other RPC libraries and compare against the feature set they are providing.
2) As far as I remember talking with random people from Microsoft, Google, and Facebook, none of them use Timely or DCQCN in production. Microsoft may be using RDMA for storage like workloads and relying heavily on isolating that traffic but nothing outside that (?) . I could be wrong.
3) There definitely is a need to distribute the load unless you are assuming that the single core can "process" the data. That may work for Key/value store workloads but what percentage of the workloads in a DC have that characteristic? You say there are tons of communication-intensive applications, care to name a few? I can think of KV stores. Maybe machine learning workload but the computational model is very different there and you rely on tailored ASICs (?) What else? Big data workloads aren't bottlenecked by the network BW.
4) I won't dive into the CC/BDP discussion cause it's very hard to judge without actually deploying it. Sure, a lot of older works made a lot of claims about how X and Y are better for Z and W, but once people tested them they would fall flat for various reasons.
Well this sets a lower bound on the state of the art that is in any case orders of magnitude higher than what many people suffer from in industry practice.
Just to restate what I said---you cannot use this in its current state in production (or industry), ever. And by the time it becomes useful, it becomes a natural solution because the infrastructure supports it.
Every person that works on kernel knows the overhead that comes with abstractions. Everybody that has worked with DPDK knows that you can get 20Mpps+ on a single core.
All they have done is to "frame" the usage and the term RPC differently ... i.e., it's all story telling and no real meat :)
Look at the previous set of publications by the same author: e.g., Achieve a Billion Requests Per Second Throughput on a Single Key-Value Store Server Platform, etc. they are all based on a single assumption that if you don't implement X and Y in your stack you can get better performance. Of course, you can. If you use your calculator to only compute 2+2 you might as well hardcode 4 as the output of your calculator.
They are not setting a lower bound. They are hardcoding and bypassing the parts they don't find useful.
I agree that our code is not production-ready, but an industry team at Intel is trying eRPC out: https://github.com/daq-db.
I am curious to know what features you believe we omitted in our ISCA 15 paper. Our key-value store (https://github.com/efficient/mica2) supports a memcached-like API over UDP.
The physical latency is nowhere near as scary as latency of a shitty NIC and software... 20 metre roundtrip is only 200 nanoseconds, or about the same as pulling something from DRAM.