Filtering millions of packets per second on commodity NICs

majke · on Oct 9, 2015

This blog post nicely fits as a fourth part of the series we've been releasing since June:

https://blog.cloudflare.com/how-to-receive-a-million-packets...

https://blog.cloudflare.com/how-to-achieve-low-latency/

https://blog.cloudflare.com/kernel-bypass/

https://blog.cloudflare.com/single-rx-queue-kernel-bypass-wi...

I hope this gives a bit more context.

luastoned · on Oct 9, 2015

Thanks! Every single post in that series seems very detailed yet the average software guy will understand them without much trouble.

I really enjoy the effort you guys put into those detailed blog posts.

samstave · on Oct 9, 2015

What is the aggregate PPS Cloudflare handles now? Whats the goal (aside from infinite)?

majke · on Oct 9, 2015

We regularly have many-million pps per server.

You might find this interesting:

https://youtu.be/UcAygzNSxlI?t=7980

https://indico.dns-oarc.net/event/21/contribution/5/material...

jlgaddis · on Oct 10, 2015

Awesome links, thanks!

j_s · on Oct 9, 2015

We would also like to thank Luigi Rizzo, for his Netmap work and great feedback on our patches.

Clearly a useful contribution! The linked pull request looks like more of a finished product; I appreciate it even more when companies include the details of the sausage making.

aexaey · on Oct 10, 2015

This gives a very interesting data point to the "open-source" vs. "free software" debate. Normally free/libre software zealots would tout BSD/ISC/Apache licenses as a way to never get back any downstream changes. And yet - cloudflare did contribute back nicely to a BSD-licensed project, in a situation where they were absolutely not under an obligation to do so.

In fact, even GPLv2 would not have imposed an obligation to publish changes here, only a super-strict GPLv3 would.

One data point of course, hardly warrant a far-reaching conclusion; still - that is something very nice to see.

jgrahamc · on Oct 10, 2015

We open source stuff because it's a virtuous circle. We think other people will look at our code and make it better!

revelation · on Oct 9, 2015

What exactly is the process() doing in the sample? Or is that also commented out in the test?

Because if the only processing here is throw-away this still screams for a FPGA in front of the NIC. Someone mentioned higher R&D on a FPGA solution, but clearly there is massive R&D here in just making sure evil packets don't hit a slow code path.

pjc50 · on Oct 9, 2015

You can get FPGA-based switches, e.g. from Arista. They're not cheap, but you can do whatever you like with the packets as the bytes arrive. But for most applications you'd stick with commodity cards for the cost.

revelation · on Oct 9, 2015

FPGA-based switches from Arista are a gimmick of that particular vendor. 10G ethernet and beyond is absolutely commodity in the FPGA world, every dev kit has one.

wmf · on Oct 10, 2015

An FPGA dev kit probably costs more than a NIC and is harder to program.

ju-st · on Oct 9, 2015

Does anyone know about the current state of IP routing on commodity NICs and Linux? Is 14M pps on 500'000 routes possible?

acd · on Oct 9, 2015

You want to check out Brocade, Intel DPDK and 6wind. Brocades Vyatta router has DPDK support as has Juniper VMX.

http://www.slideshare.net/shemminger/dpdk-performance

ju-st · on Oct 9, 2015

Nice, DPDK has even a library for longest prefix matching [1] but sadly there are no published performance results.

[1] http://dpdk.org/doc/guides/prog_guide/lpm_lib.html#lpm-api-o...

gonzo · on Oct 10, 2015

We're doing 5Mpps (routed, net bridged) on a single core with a project we're calling netmap-fwd.

I'll blog about it on blog.pfsense.org in a few days when I return from Brazil.

ju-st · on Oct 10, 2015

RSS is subscribed :)

comice · on Oct 9, 2015

I don't have any figures for you right now, but routing is certainly a different problem than what is described in this article as routed packets don't need to be passed to userspace for any processing.

Makes a huge difference for performance!

pyvpx · on Oct 10, 2015

depends on the commodity NIC. but yes, an E5 with an Intel 10G card can almost do 14Mpps "out of the box"

SixSigma · on Oct 9, 2015

When you have to bypass your operating system to get your hardware to perform, perhaps it is time to re-assess your choice of Operating system.

bwoj · on Oct 9, 2015

This isn't quite so much bypassing the OS as it is redefining the boundary of the privileged space to not include the network traffic. This lets your filtering application get the network packets directly without having to copy them out of kernel space and into user space. This is exactly the same technique all high performance network devices follow presently. The ones that aren't doing it in userspace, are doing it in some sort of RTOS that doesn't even have protected memory spaces.

SixSigma · on Oct 10, 2015

In hpc circles this scheme is called OS bypass

http://blogs.cisco.com/performance/mpi-newbie-what-is-operat...

lrizzo · on Oct 10, 2015

(netmap author here) I prefer to define netmap as a "network stack bypass" scheme because we use as much as possible of the OS -- all the things it does well, we do not want to reinvent. Device drivers, system calls, synchronization support etc. are part of the kernel. Native netmap support for a NIC only involves 3-400 lines of code, or 10% of the typical device driver.

Processes do ioctl(), mmap() and poll() for I/O - all standard system calls implemented by the OS, there is no NIC-specific code in the application. NICs can be switched in and out of netmap mode without reloading modules (and with the cloudflare patch, even sharing the two modes). There are no custom memory pools or hugepages to reserve. Device configuration relies on ethtool and ifconfig etc.

This approach is what let the cloudflare folks implement their traffic steering with zero new code, just a couple of ethtool lines; the change they contributed back to support the split mode is completely agnostic of the specific NIC being used.

motoboi · on Oct 9, 2015

Greatest problem is copying data in and out kernel-space to user-space.

This causes huge traffic on RAM and trashes caches all along the way.

signa11 · on Oct 10, 2015

> Greatest problem is copying data in and out kernel-space to user-space.

even if you could do away with copying e.g. packet_mmap (on linux), context switch will kill you...

wglb · on Oct 9, 2015

Which OS would handle the load he speaks of?

SixSigma · on Oct 10, 2015

HPC also use OS Bypass so the plan9 team, who develop for Blue Gene and other large clusters, worked on currying system calls to maximize throughput

http://4e.iwp9.org/papers/usecsys.pdf