Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hey, Anuj.

Your repo is really nice for an academic paper. Thank you for that. It's rare to see a "networked system's" repositories that has readable code. I mainly checked large-tput example:

A few questions—

1) For your 75Gbps, what percentage of the payload of the RPC do you touch? I.e., what portion of the message is used on that core?

More directly, say you have a service that can sustain 100kQPS, if they switch to eRPC, what can they expect? Asked differently, what is the base overhead of today's RPC libraries? Especially ones that bypass kernel.

2) The congestion and flow control is debatable, and their efficacy is up for debate. Especially in a DC setting. Can you claim that eRPC would work for any types of the workload in a DC setting? How would it play out with other connections? At the end of the day, if you are forced to play nice, you may eventually add up branches in your code. Your fast path gets split depending on the connection type, etc. Is that something that you think is preventable?

3) How do you distribute the load across different cores at 75Gbps? How does the CPU ring, contention, etc. come into play? I.e., can you do useful work with that 75Gbps? or should I just read it as a "wow" number? Asking a different question, if I have a for loop that can do 10 billion loops per second and by just adding a function that drops down to 10k loops per second, why would I care about that 10 billion iterations?

4) You claim that it works well in a lossy network, yet your goodput drops to 18~2.5Gbps at 10^-4/10^-3 packet loss---I am still assuming the library is still flooding the network at 75Gbps. How does this play out in scale?

All in all, I do appreciate your work. My issue is that academic people like to make big claims, especially in an academic setting. People in the industry are aware of fast-paths. Kernel networking stack uses fast-paths rigorously. Sure it is heavy and it comes with a lot of bulk, but you can as easily cut it down.



Thank you for the questions.

1. Our throughput benchmark is designed to measure data transfer bandwidth, and the comparison point is RDMA writes. In the benchmark, eRPC at the server internally re-assembles request UDP frames into a buffer that is handed to the server in the request handler. The request handler does not re-touch this buffer, similarly to RDMA writes.

We haven't compared against RPC libraries that use fast userspace TCP. Userspace TCP is known to be a fair bit slower than RDMA, whereas eRPC aims for performance RDMA-like performance.

2. eRPC uses congestion control protocols (Timely or DCQCN) that have been deployed at large scale. The assumption is that other applications are also using some congestion control to keep switch queueing low, but we haven't tested co-existence with TCP yet.

3. 75 Gbps is achieved with one core, so there's no need to distribute load. We could insert this data into an in-memory key-value store, or persist it to NVM, and still get several tens of Gbps. The performance depends on computation-communication ratio, and we have tons of communication-intensive application

4. Packet loss in real datacenters is rare, and we can make it rarer with BDP flow control. Congestion control kicks in during packet loss, so we don't flood the network. Our packet loss experiment uses one connection, which is the worst case. An eRPC endpoint likely participates in many connections, most of which are uncongested.


Thanks for the answers:

1) Are you using or relying on DMA or SPDK to copy packet data? A single core, to my understanding, (assuming 10 concurrent cache lines in flight and 70~90ns of memory access time) doesn't have the bandwidth to copy that much data from the NIC to memory (assuming the CPU is in the middle). If so, RDMA and the copy methodology are not so different in how they operate.

I didn't look at the paper where you explained how you perform the copying.

Also, IMHO, RDMA itself is a pet project of a particular someone at somewhere that is looking for promotions :) . I don't really know if it's a good baseline. It could be more reasonable to look at the benchmarks of other RPC libraries and compare against the feature set they are providing.

2) As far as I remember talking with random people from Microsoft, Google, and Facebook, none of them use Timely or DCQCN in production. Microsoft may be using RDMA for storage like workloads and relying heavily on isolating that traffic but nothing outside that (?) . I could be wrong.

3) There definitely is a need to distribute the load unless you are assuming that the single core can "process" the data. That may work for Key/value store workloads but what percentage of the workloads in a DC have that characteristic? You say there are tons of communication-intensive applications, care to name a few? I can think of KV stores. Maybe machine learning workload but the computational model is very different there and you rely on tailored ASICs (?) What else? Big data workloads aren't bottlenecked by the network BW.

4) I won't dive into the CC/BDP discussion cause it's very hard to judge without actually deploying it. Sure, a lot of older works made a lot of claims about how X and Y are better for Z and W, but once people tested them they would fall flat for various reasons.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: