Thank you for the questions. 1. Our throughput benchmark is designed to measure ...

diffserv · on March 29, 2019

Thanks for the answers:

1) Are you using or relying on DMA or SPDK to copy packet data? A single core, to my understanding, (assuming 10 concurrent cache lines in flight and 70~90ns of memory access time) doesn't have the bandwidth to copy that much data from the NIC to memory (assuming the CPU is in the middle). If so, RDMA and the copy methodology are not so different in how they operate.

I didn't look at the paper where you explained how you perform the copying.

Also, IMHO, RDMA itself is a pet project of a particular someone at somewhere that is looking for promotions :) . I don't really know if it's a good baseline. It could be more reasonable to look at the benchmarks of other RPC libraries and compare against the feature set they are providing.

2) As far as I remember talking with random people from Microsoft, Google, and Facebook, none of them use Timely or DCQCN in production. Microsoft may be using RDMA for storage like workloads and relying heavily on isolating that traffic but nothing outside that (?) . I could be wrong.

3) There definitely is a need to distribute the load unless you are assuming that the single core can "process" the data. That may work for Key/value store workloads but what percentage of the workloads in a DC have that characteristic? You say there are tons of communication-intensive applications, care to name a few? I can think of KV stores. Maybe machine learning workload but the computational model is very different there and you rely on tailored ASICs (?) What else? Big data workloads aren't bottlenecked by the network BW.

4) I won't dive into the CC/BDP discussion cause it's very hard to judge without actually deploying it. Sure, a lot of older works made a lot of claims about how X and Y are better for Z and W, but once people tested them they would fall flat for various reasons.