Hey, Anuj. Your repo is really nice for an academic paper. Thank you for that. I...

anujkaliaitd · on March 27, 2019

Thank you for the questions.

1. Our throughput benchmark is designed to measure data transfer bandwidth, and the comparison point is RDMA writes. In the benchmark, eRPC at the server internally re-assembles request UDP frames into a buffer that is handed to the server in the request handler. The request handler does not re-touch this buffer, similarly to RDMA writes.

We haven't compared against RPC libraries that use fast userspace TCP. Userspace TCP is known to be a fair bit slower than RDMA, whereas eRPC aims for performance RDMA-like performance.

2. eRPC uses congestion control protocols (Timely or DCQCN) that have been deployed at large scale. The assumption is that other applications are also using some congestion control to keep switch queueing low, but we haven't tested co-existence with TCP yet.

3. 75 Gbps is achieved with one core, so there's no need to distribute load. We could insert this data into an in-memory key-value store, or persist it to NVM, and still get several tens of Gbps. The performance depends on computation-communication ratio, and we have tons of communication-intensive application

4. Packet loss in real datacenters is rare, and we can make it rarer with BDP flow control. Congestion control kicks in during packet loss, so we don't flood the network. Our packet loss experiment uses one connection, which is the worst case. An eRPC endpoint likely participates in many connections, most of which are uncongested.

diffserv · on March 29, 2019

Thanks for the answers:

1) Are you using or relying on DMA or SPDK to copy packet data? A single core, to my understanding, (assuming 10 concurrent cache lines in flight and 70~90ns of memory access time) doesn't have the bandwidth to copy that much data from the NIC to memory (assuming the CPU is in the middle). If so, RDMA and the copy methodology are not so different in how they operate.

I didn't look at the paper where you explained how you perform the copying.

Also, IMHO, RDMA itself is a pet project of a particular someone at somewhere that is looking for promotions :) . I don't really know if it's a good baseline. It could be more reasonable to look at the benchmarks of other RPC libraries and compare against the feature set they are providing.

2) As far as I remember talking with random people from Microsoft, Google, and Facebook, none of them use Timely or DCQCN in production. Microsoft may be using RDMA for storage like workloads and relying heavily on isolating that traffic but nothing outside that (?) . I could be wrong.

3) There definitely is a need to distribute the load unless you are assuming that the single core can "process" the data. That may work for Key/value store workloads but what percentage of the workloads in a DC have that characteristic? You say there are tons of communication-intensive applications, care to name a few? I can think of KV stores. Maybe machine learning workload but the computational model is very different there and you rely on tailored ASICs (?) What else? Big data workloads aren't bottlenecked by the network BW.

4) I won't dive into the CC/BDP discussion cause it's very hard to judge without actually deploying it. Sure, a lot of older works made a lot of claims about how X and Y are better for Z and W, but once people tested them they would fall flat for various reasons.