Corundum: Open-source, high performance, FPGA-based NIC

sebastianconcpt · on Jan 8, 2020

Corundum is an open-source, high-performance FPGA-based NIC. Features include a high performance datapath, 10G/25G/100G Ethernet, PCI express gen 3, a custom, high performance, tightly-integrated PCIe DMA engine, many (1000+) transmit, receive, completion, and event queues, MSI interrupts, multiple interfaces, multiple ports per interface, per-port transmit scheduling including high precision TDMA, flow hashing, RSS, checksum offloading, and native IEEE 1588 PTP timestamping. A Linux driver is included that integrates with the Linux networking stack. Development and debugging is facilitated by an extensive simulation framework that covers the entire system from a simulation model of the driver and PCI express interface on one side to the Ethernet interfaces on the other side.

Corundum has several unique architectural features. First, transmit, receive, completion, and event queue states are stored efficiently in block RAM or ultra RAM, enabling support for thousands of individually-controllable queues. These queues are associated with interfaces, and each interface can have multiple ports, each with its own independent scheduler. This enables extremely fine-grained control over packet transmission. Coupled with PTP time synchronization, this enables high precision TDMA.

throwaway15846 · on Jan 8, 2020

FGPA = field-programmable gate array. NIC = network interface card. PTP = Precision Time Protocol. TDMA = time-division multiple access. I’m trying to understand what it is and what makes it special.

tyingq · on Jan 8, 2020

Some detail from Reddit on what's unique about it:

"Corundum is being developed to facilitate optical networking research and as such has some unique architectural features. First, all hardware queue state is stored in block RAM or ultra RAM, enabling support for thousands of independent, hardware controllable transmit, receive, completion, and event queues. This enables fine-grained hardware control over packet emission on a per-destination or per-flow basis. Additionally, the NIC supports multiple ethernet ports per interface that have separate schedulers but share the same hardware queues, enabling functionality such as striping packets across ports or rapidly migrating flows from port to port. The port schedulers can be made aware of PTP time, enabling high-precision TDMA that's synchronized across a large network."

https://www.reddit.com/r/FPGA/comments/cs87h1/corundum_opens...

ncmncm · on Jan 8, 2020

What makes TDMA special? It seems to be a provisioning scheme to reserve queue slots to enable guaranteed bandwidth for specific connections.

What makes Corundum special? Mainly that you can add client code into it and maybe free up an isolated core, or stage packets to send on short notice.

Mainly of interest for low-latency finance and for high-performance compute clusters.

yaantc · on Jan 8, 2020

Regarding TDMA only: it's part of IEEE time sensitive networking (TSN), which is intended to make Ethernet suitable for industrial application where short latencies and deterministic behavior are critical, and not guaranteed with stock Ethernet.

Supporting critical traffic with TSN is a two steps process. First, you synchronize all the participating network nodes. For this you can use PTP (IEEE 1588), which is like an Ethernet level NTP (grossly oversimplified, but you get the idea). Once all the nodes are in sync, they can use time aware scheduling (TAS) where a TDM frame is overlaid over all the LAN, with Ethernet traffic classes (TC) assigned to specific ranges. In other words, you define a repeating pattern, split into different sequential zones, and TC are aligned to some zones. The goal is to define repeating ranges dedicated to specific traffic classes, where one can control the load and make sure there is no contention and traffic will go through with deterministic latency.

All this could be used in a plant, to support both best effort traffic but also sensitive real-time traffic for automation, while protecting the later.

TSN started out for media applications (broadcasting) over Ethernet, but is getting into industrial applications (see https://opcfoundation.org/).

Support for TSN is planned for 5G (NR) release 16, to support industrial applications.

All this area is in flux, so having a flexible programmable platform can be interesting.

alexforencich · on Jan 8, 2020

So, since Corundum is open source, you can plug in whatever arbitrary transmit scheduler you want. Corundum also supports 10,000+ transmit queues (I have synthesis tested on the ultrscale+ to 32,768 transmit queues). This is super interesting for all sorts of networking applications, as that's a large enough number of queues to give individual flows or connections their own hardware queue, so the scheduler on the NIC can directly control the flow of information leaving the NIC.

TDMA is basically a simple demonstration scheduler that enables and disables queues on microsecond timescales, based on PTP time. One of the original reasons for building Corundum was to enable optical switching research, where data transmission into the switch must be precisely coordinated with the configuration of the switch itself. We have tried to do this in software, but the precision is limited and the CPU overhead is high. With corundum, the schedule is enforced in hardware, so it is extremely precise and does not add any CPU overhead.

topranks · on Jan 8, 2020

TDMA usually means old legacy time-division based communications standards such as E1/T1 (G.704) or SONET/SDH.

Emulating such services in a packet-switched network requires very precise timing hence the reference to it in terms of IEEE1588 / Precision Time Protocol.

lmilcin · on Jan 8, 2020

If you want to understand you need to include another important ingredient, "open-source". This is what caused it show up on HN results.

geraldcombs · on Jan 8, 2020

PTP can be useful for packet capture, since that lets you associate a high-precision timestap to each packet that you receive. Dedicated capture cards typically have some combination of PTP, IRIG, and PPS inputs.

jdsnape · on Jan 8, 2020

I've used NetFPGA (https://netfpga.org/ ) before which seems a little more complete, will be interesting to see how this compares.

alexforencich · on Jan 8, 2020

NetFPGA is a toolbox for network-based packet processing. It is not a NIC, and their NIC reference designs leave a lot to be desired. Corundum is specifically a NIC.

musicale · on Jan 8, 2020

If NetFPGA gives you access to the PCI bus, it should be possible to make it into a NIC.

Simple matter of some verilog and a linux driver. ;-)

Of course, we also have non-FPGA Smart NICs from the likes of Netronome, etc. which can do things like accelerate EBPF or run P4.

alexforencich · on Jan 8, 2020

Well, we're planning on porting Corundum to the NetFPGA SUME hardware at some point in the near future. Should be relatively straightforward as the PCIe interface on the Virtex 7 is the same as on the Ultrascale parts.

NetFPGA does have a NIC reference design, but AFAIK it's just the Xilinx XDMA core connected to a Xilinx 10G MAC. No accessible transmit scheduler, no offloading of any kind, etc. Just about as spartan as you can get, and it's built from completely closed components so you can't really make many modifications to it.

For what we're doing, we can't use any existing commercial NICs or smart NICs because they can't provide the precision we need in terms of controlling transmit timing. We don't care about EBPF, P4, etc. We care about PTP synchronized packet transmission with microsecond precision.

pedrocr · on Jan 8, 2020

Really interesting project. Couldn't find the motivation explained. Is this just for research? Is it usable in production running in an FPGA? Are there plans to produce hardware?

crispyambulance · on Jan 8, 2020

It's from a group at UCSD, so yes, this is research.

The applications for these kinds of things range from SDN (software-defined networking) where low-latency is a concern and to applications in network monitoring. One could, for example, put together a system that performs line-rate TLS decryption at 10Gbps. You need an FPGA (a big one) for something like that.

There are commercial vendors for this kind stuff (selling closed source IP and hardware). It is not yet in Open Compute networking projects, but I expect that's coming soon.

You can now buy "whitebox" switches that run open network linux and put your own applications on them. In the not-too-distant future those "applications" will also extend to stuff that can run on FPGA hardware .

wbl · on Jan 8, 2020

Nope! Netflix does 10G tls on commodity hardware in kernel space. CPU can do a lot.

shaklee3 · on Jan 8, 2020

I believe they are doing 100Gbps now: https://t.co/cbb7NA9vJf?amp=1

It's hard for me to see the use case of an FPGA nic. The reasons outlined above don't seem compelling when a commodity nic like mellanox do so much more already.

alexforencich · on Jan 8, 2020

Mellanox NICs (and basically all commercial NICs) do not do what we want. Software is not precise enough, and is on the wrong side of the NIC hardware queues. The whole point of Corundum is to get control of the hardware transmit scheduler on the NIC itself.

wmf · on Jan 8, 2020

It looks like the UCSD team are exploring data center TDMA which no commercial NIC supports. http://cseweb.ucsd.edu/~snoeren/papers/tdma-eurosys12.pdf

alexforencich · on Jan 8, 2020

The group web page is here: https://circuit-switching.sysnet.ucsd.edu/

Corundum was originally geared more towards optical circuit switching applications, but it's certainly not limited to that. Since it's open source, the transmit scheduler can be swapped out for all sorts of NIC and protocol related research.

tmotwu · on Jan 8, 2020

As others mentioned, datacenter SDN. A FPGA-based hybrid NIC used in production at Azure (>1M hosts): https://www.microsoft.com/en-us/research/uploads/prod/2018/0...

aepiepaey · on Jan 8, 2020

Direct link: https://people.freebsd.org/~gallatin/talks/euro2019.pdf

jjrh · on Jan 8, 2020

> The reasons outlined above don't seem compelling when a commodity nic like mellanox do so much more already.

This is could be useful for people doing testing and benchmarking on network appliances.

shaklee3 · on Jan 8, 2020

Yeah, I suppose that's a valid use case. Things like ixia need to be fpga-based to measure absolute latency without any uncertainty. You cannot currently get that with enough flexibility in commodity cards.

pedrocr · on Jan 8, 2020

I saw it was from UCSD but that in itself wasn't an answer. There are plenty of things that are usable in production that are built by research groups initially.

SDN isn't the answer either unless these FPGAs can be used directly in production and so there's a path for network cards to no longer being built on dedicated hardware. So to clarify my question I could see this being:

1) A pure research effort on network card hardware design. Useful to test things in a lab and publish papers.

2) Something that can be pushed into production by actually shipping an FPGA in the router, perhaps in specialized situations where the fixed hardware isn't flexible enough.

3) A step before actual hardware can be manufactured, and network cards themselves become a whitebox style business where multiple generic vendors show up because the designs are open-source.

Either is interesting.

zamadatix · on Jan 8, 2020

A low end consumer CPU can do 10-20 Gbps of AES per core, it certainly doesn't require a big FPGA.

alexforencich · on Jan 8, 2020

Original motivation is to support optical switching research for datacenter networking applications. The research group web page is here: https://circuit-switching.sysnet.ucsd.edu/ . It is also mentioned in these slides: https://arpa-e.energy.gov/sites/default/files/UCSD_Papen_ENL.... However, the design is very generic and should be interesting to applications outside optical switching. The main point was to get control over the transmit scheduler, coupled with a very large number of hardware transmit queues. There are a number of experimental protocols and similar that could benefit from this vs. implementation in DPDK.

It is still in development; not sure if I would trust it yet for production workloads. We will not be producing hardware; the design runs on pretty much any board that has the correct interfaces, including many FPGA dev boards and commercially available FPGA-based NICs such as the Exablaze X10 and X25.

MeteOzturk · on Jan 8, 2020

Might be silly question but is there a technique to rapidly program FPGA's without interrupting other processes. Say I have multiple soft-CPU's and only want to use my gates to enable ethernet once its needed or the user plugs in?

aylons · on Jan 8, 2020

Not a silly question and actually a very powerful, seldom used feature: partial reconfiguration.

There are, however, several limitations to it. Clock cannot be changed, for example, and usually neither can I/O, specially high-speed transceivers. This has been improving (Ultrascale Xilinx allows for reconfiguring I/O), but you still have to reserve area for reconfiguring (meaning literal area, as in a geographic region in the FPGA).

However, I/O versatility as you suggested has very few advantages to it. You need the reserved logic for ethernet to be programmed at when you plugin. Why would you leve it unprogrammed? If is simply disabled, it won't use any extra power resources, and your soft-CPU's won't be able to take advantage of these resources while you are working. Maybe you could use the area for new soft-cpus, but then you'll hit the problem of over segmenting your design and allowing for less optimization. This would inevitably impact timing constraints and area usage.

Also, FPGA programming may take minutes to finish, and always at least a few seconds. This will be very noticeable by an user and not very efficient if it has to be done frequently.

There are, of courses, good uses for that. But there is also a lot of effort on doing it right and you always risk overdoing it.

aseipp · on Jan 8, 2020

Is programming speed really that bad for the ultra high-end devices? Minutes? I don't remember it being that bad for the Amazon F1 when I ported a Xilinx build to use the F1 SDK (I didn't spend lots of time with our prior one, so I wouldn't know.) Of course, their programming strategy is extremely customized, but even for very high-utilization images, it was only ever on the order of seconds. Vivado is absolutely terribly slow though, no matter what you do, or what device you use. (Not to mention if you want to use the ILA support over the internet...)

Also, for some designs you can mitigate the reconfiguration time issue by having two regions and draining requests to one of them, before doing an update. Most of the Xilinx tooling for OpenCL does this kind of thing by default (4-6 "opencl kernel" regions.) But of course it's not always an option to give up that much space...

alexforencich · on Jan 8, 2020

It depends on the programming interface. JTAG is bit serial and rather slow, so it can take quite a while to load a large FPGA via JTAG. However, there are several other interfaces that can be used, including QSPI, dual QSPI, parallel flash, and a simple parallel interface from some other controller. These can run at many MHz and can load a configuration into a large FPGA in less than a second.

therealcamino · on Jan 8, 2020

Tabula had an FPGA that could switch between configurations in a single cycle (it could be time-sliced among up to 8 configurations). Not sure if their actual product would have supported what you propose -- it was more a way of trading off design size against clock speed. A large design could use 8X the logic and run at 1/8th the clock speed of a small design.

hamiltonkibbe · on Jan 8, 2020

“Partial reconfiguration” is the term you’re looking for

archi42 · on Jan 8, 2020

The hobbyist in me is disappointed: Even the ExaNIC X10 seems to come at 350US$ used, and forget about the four figure UltraScale boards :( Of course, the FPGA can do a lot of stuff in addition, so for HPC this might be really nice to offload application-specific stuff really early (because sustained 2*10G traffic is bound to go somewhere, e.g. CPU[s]).

ncmncm · on Jan 8, 2020

$350 is cheap for such a powerful NIC. You could pay $10k for a Napatech, or $2k for a Solarflare, and not get this level of programmability.

libexanic is a remarkably clean user-space kernel-bypass library that allows you to do processing on early fragments of the packet while the rest are still being received.

archi42 · on Jan 8, 2020

Hence the hobbyist. As a professional engineer I mostly care if the investment into expensive hardware (and developing/adapting a software stack) is worth the gains. And I can totally see that adding a $2k card to a bunch of $5k servers can be cheaper than throwing more servers at a problem (especially after saving on power, rack space and cooling). But in that context I'd only consider a used $350 card for dev/eval work. I don't want to tell the customer that his infrastructure was down for a few days because I convinced him to cheap out on a NIC.

yjftsjthsd-h · on Jan 8, 2020

> But in that context I'd only consider a used $350 card for dev/eval work. I don't want to tell the customer that his infrastructure was down for a few days because I convinced him to cheap out on a NIC.

I would argue that it strongly depends on the price differential and how resilient your system is. If your system is properly failure tolerant, and you can buy twice as much hardware for the same price by accepting a 20% failure rate (say), then it would be strongly advantageous to buy all used hardware.

imtringued · on Jan 8, 2020

Comparing used prices to new prices is dishonest. It costs $2,961.00 new.

[0] https://www.shi.com/Products/ProductDetail.aspx?SHISystemID=...

alexforencich · on Jan 8, 2020

I don't know where the heck that price came from. The cards are more like $1200 new.

https://www.cdw.com/product/exablaze-exanic-x10-network-adap...

harry8 · on Jan 10, 2020

https://m.aliexpress.com/item/4000523308171.html

Ultrascale. Significantly cheaper. Could it be made to work out is it subject to your comments concerning kintex, pcie straddling?

Aside from that did you know you were going to do this when you did verilog-ethernet etc?

ncmncm · on Jan 8, 2020

I think we get them for much less.

But Netronomes are remarkably cheap, and you can "offload" eBPC to run on the card's custom cores.

alexforencich · on Jan 8, 2020

Netronome cards are a gigantic PITA to program, and they are not capable of the precise timing that we need.

alexforencich · on Jan 8, 2020

Author of Corundum here--if you have any questions, ask away.

nullc · on Jan 8, 2020

Are you aware of the extraordinary good deal on huge kintex (K420T) nic-like dev boards on aliexpress?

There are boards with 4 sfp+, ones with 2 sfp+ and 2 QSFP+, and even one with 4 QSFP28 (and UltraScale+ XCVU9P)...

https://www.aliexpress.com/store/group/FPGA-DEV/620372_25030...

they sound like great targets for your work...

alexforencich · on Jan 8, 2020

Yes, I am aware of those. However, the kintex PCIe interface is a bit of a pain as it has a TLP straddling mode that can't be disabled, so it will be some time before it's supported as it will require some significant reworking in the PCIe interface modules. I am planning on supporting straddling eventually as this will improve PCIe link utilization on the ultrascale and ultrascale plus parts. If someone wants to donate a board, I can look in to supporting it.

hobo_mark · on Jan 8, 2020

Interesting, I never heard of straddling, what is it supposed to achieve?

alexforencich · on Jan 8, 2020

Stradding is an artifact of very wide interfaces. On the Ultrascale+ parts, the PCIe gen 3 x16 interface comes out as a 512 bit wide interface. Every cycle of the 250 MHz PCIe user clock transfers 64 bytes of data. The issue has to do with how packets are moved over this type of interface. If your packets are all a multiple of 64 bytes, no problem, you get 100% throughput. However, if your packets are NOT a multiple of 64 bytes in length, you have a problem. What byte lane do packets start and end in? The simplest implementation is to always start packets in byte lane 0. The interface logic for this is the simplest - the packets always start in the same place, so the fields always end up in the same place. However, if your packet is 65 bytes long, the utilization is horrible - it doesn't fit in one cycle, so you have to add an extra cycle for every packet, and bus utilization falls to 50% as you have 63 empty byte lanes after every packet.

Straddling is an attempt to mitigate this issue. Instead of only staring packets in lane 0, the interface is adjusted to support starting packets in several places. Say, byte lanes 0 and 32. Or 0, 16, 32, and 48. Now, when you have a packet end in byte lane 0, you can start the next packet in the same clock cycle, but in byte lane 16 or 32. This increases the interface utilization. The trade-off is now the logic has to deal with parts of two packets in the same clock cycle, and it has to deal with multiple possible packet offsets.

The specific annoyance with PCIe packets is that the max payload size is usually 256 bytes, but every packet has a 12 or 16 byte TLP header attached, which really screws things up when combined with the small max payload size.

hobo_mark · on Jan 18, 2020

Fantastic explanation, thanks.

nullc · on Jan 8, 2020

No interest in 40GB phy?

Right now, 40GB is the sweet spot in lower cost surplus hardware: E.g. you can get Arista DCS-7050QX-32 for about $500 shipped on ebay all day long.

100GB/25GB switches are still really expensive.

alexforencich · on Jan 8, 2020

No free 40G MAC/PHY. Unfortunately, the Xilinx CMAC is 100G only, and the Xilinx soft 40G MAC/PHY is $$$$. I have looked in to building a 40G/100G switchable MAC/PHY, but it's going to be a serious pain in the rear.

Funny you mention that switch, we bought one of those off of eBay for our testbed as it supports PTP.

Also, for optical switching applications, one of the most important factors is how long it takes to bring up the link after switching. Because of this, we have no interest in spending time on 40G and 100G interfaces because interlace deskew takes hundreds of microseconds, and 100G also requires FEC which takes hundreds of microseconds to lock. So we're focused on 10G and 25G and running multiple links in parallel, which also provides more architectural flexibility. I added 100G support for three main reasons: the CMAC license is free, so why not?; supporting 100G makes the project a whole lot more interesting than only 10G or 25G, and it provides a simple way of testing the core NIC datapath.

nullc · on Jan 8, 2020

oh... now I understand the purpose of the TDMA. Using actual optical switching to interconnect. Very interesting!

Got any pointers to the sort of optical switching components you're using?

[I've been out of the networking business professionally for almost a decade now, so I'm a bit out of touch with the state of the art in optical stuff--- I was somewhat surprised recently to learn of the existence and low cost of LR4 40gb optics. :P]

alexforencich · on Jan 8, 2020

That's part of the research we're doing!

Take a look at: https://circuit-switching.sysnet.ucsd.edu/

And: https://arpa-e.energy.gov/sites/default/files/UCSD_Papen_ENL...

The current generation of switches that we're working on uses diffraction gratings patterned onto glass hard drive platters, installed in a modified hard drive, spun by a custom motor controller that's synchronized to the NICs via PTP.

nullc · on Jan 8, 2020

Ha. I was going to guess it would be an AOM, I wouldn't have guessed a diffraction grating on a hard drive platter. That's awesome, and must be incredibly energy efficient.

The cost of switch ports and interconnects could all be dumped into making host interfaces faster, allowing for the switching time to be reduced.

gyger · on Jan 8, 2020

What are the losses between ports that you are experiencing and what is the crosstalk between ports that are not connected?

alexforencich · on Jan 8, 2020

Crosstalk is better than 30 dB, and double pass loss between ports is 5-8 dB. The switch is basically cycling through three or four different interconnection patterns that are defined by looped back fiber connections, so the signal has to pass through the switch twice.

rjsw · on Jan 8, 2020

Did you do a patent search ?

alexforencich · on Jan 8, 2020

We have not done anything like that. Presumably that would be an important thing to do if a commercial product is produced at some point.

LinuxBender · on Jan 8, 2020

This looks very interesting. If this were included in the Linux kernel, would it read and utilize existing sysctl memory and qlen values as well as have it's own sysctl settings, or would all the settings be derived at module load in the modprobe parameters? Why I am asking is that I current disable TOE (TCP offload engine) on all my nic's as they have their own buffer and retry settings and ignore the OS network settings.

alexforencich · on Jan 8, 2020

It's still in development at the moment. We'll see about the interface. But there are no plans to implement any segmentation offloads or TOE in Corundum, that will be left up to the network stack. However, scatter/gather DMA support is planned so that software GSO will work. Right now, most of the low-level twiddling is done from a user space app that directly accesses device registers. For a research device, that's fine, but that would obviously have to be improved for a commercial product.

imtringued · on Jan 8, 2020

Although this is cool, the downside is that it only works on specific FPGAs. Are hardware independent designs possible so that one can run it on any sufficiently large FPGA?

aseipp · on Jan 8, 2020

Making RTL portable, so it can be used with different toolchains, is certainly possible -- and how/why you do this is, like everything, is an engineering tradeoff. Sometimes when you write C code it's easier to just use Linux features and not care about portability! Sometimes it's very easy (or very desirable) to keep portability for your code, which might be easy (use the standard library) or hard (use lots of #ifdef or whatever.)

But, unlike portable C code, to run designs like this on real hardware, you need to do things like describe how the physical pins on the FPGA are connected to the board peripherals (for instance, describing which pin might be connected to an LED, vs a UART). This generally requires a small amount of glue, and depending on how the project is structured, some amount of Verilog/VHDL code, as well. It's not like saying "cc -O2 foo.c" with your ported C compiler that has a POSIX standard library.

This is just the case if you're using the same base FPGA, but with different board layouts. Using different FPGAs (for example, a current-gen FPGA by vendor XYZ, vs XYZ gen N-1), or especially when porting between vendors -- the details can become vastly more complex very quickly.

alexforencich · on Jan 8, 2020

Corundum should work on any ultrascale or ultrascale plus FPGA with the necessary interfaces. Probably also virtex 7. We're considering porting to Intel/Altera parts at some point as well...mainly the PCIe interface will need work, but the rest should be directly portable.

wmf · on Jan 8, 2020

No, because every FPGA and even different boards using the same FPGA are wired differently.

JackRabbitSlim · on Jan 8, 2020

TDMA in a data center environment, anyone else have some heavy deja vu?

alexforencich · on Jan 8, 2020

That's something we joke about quite a bit in the research group - "back to the future!"

lightedman · on Jan 8, 2020

I really wish people would try making original names for their products. Not only am I now going to have to sift through Steven Universe stuff when I look up corundum for people, now I'll have to filter this out as well.

Very annoying for geology-hobbyists.

alexforencich · on Jan 8, 2020

There are two hard things in computer science: cache invalidation, naming things, and off-by- one errors.

_bz2r · on Jan 8, 2020

I vaguely remember a corundum project related to the ruby (and rust?) languages, but search engines fail because the terms are all related to mineralogy.

FWIW: this appears to be different / unrelated to ruby or rust.