Corundum is an open-source, high-performance FPGA-based NIC. Features include a high performance datapath, 10G/25G/100G Ethernet, PCI express gen 3, a custom, high performance, tightly-integrated PCIe DMA engine, many (1000+) transmit, receive, completion, and event queues, MSI interrupts, multiple interfaces, multiple ports per interface, per-port transmit scheduling including high precision TDMA, flow hashing, RSS, checksum offloading, and native IEEE 1588 PTP timestamping. A Linux driver is included that integrates with the Linux networking stack. Development and debugging is facilitated by an extensive simulation framework that covers the entire system from a simulation model of the driver and PCI express interface on one side to the Ethernet interfaces on the other side.
Corundum has several unique architectural features. First, transmit, receive, completion, and event queue states are stored efficiently in block RAM or ultra RAM, enabling support for thousands of individually-controllable queues. These queues are associated with interfaces, and each interface can have multiple ports, each with its own independent scheduler. This enables extremely fine-grained control over packet transmission. Coupled with PTP time synchronization, this enables high precision TDMA.
FGPA = field-programmable gate array. NIC = network interface card. PTP = Precision Time Protocol. TDMA = time-division multiple access. I’m trying to understand what it is and what makes it special.
Some detail from Reddit on what's unique about it:
"Corundum is being developed to facilitate optical networking research and as such has some unique architectural features. First, all hardware queue state is stored in block RAM or ultra RAM, enabling support for thousands of independent, hardware controllable transmit, receive, completion, and event queues. This enables fine-grained hardware control over packet emission on a per-destination or per-flow basis. Additionally, the NIC supports multiple ethernet ports per interface that have separate schedulers but share the same hardware queues, enabling functionality such as striping packets across ports or rapidly migrating flows from port to port. The port schedulers can be made aware of PTP time, enabling high-precision TDMA that's synchronized across a large network."
Regarding TDMA only: it's part of IEEE time sensitive networking (TSN), which is intended to make Ethernet suitable for industrial application where short latencies and deterministic behavior are critical, and not guaranteed with stock Ethernet.
Supporting critical traffic with TSN is a two steps process. First, you synchronize all the participating network nodes. For this you can use PTP (IEEE 1588), which is like an Ethernet level NTP (grossly oversimplified, but you get the idea). Once all the nodes are in sync, they can use time aware scheduling (TAS) where a TDM frame is overlaid over all the LAN, with Ethernet traffic classes (TC) assigned to specific ranges. In other words, you define a repeating pattern, split into different sequential zones, and TC are aligned to some zones. The goal is to define repeating ranges dedicated to specific traffic classes, where one can control the load and make sure there is no contention and traffic will go through with deterministic latency.
All this could be used in a plant, to support both best effort traffic but also sensitive real-time traffic for automation, while protecting the later.
TSN started out for media applications (broadcasting) over Ethernet, but is getting into industrial applications (see https://opcfoundation.org/).
Support for TSN is planned for 5G (NR) release 16, to support industrial applications.
All this area is in flux, so having a flexible programmable platform can be interesting.
So, since Corundum is open source, you can plug in whatever arbitrary transmit scheduler you want. Corundum also supports 10,000+ transmit queues (I have synthesis tested on the ultrscale+ to 32,768 transmit queues). This is super interesting for all sorts of networking applications, as that's a large enough number of queues to give individual flows or connections their own hardware queue, so the scheduler on the NIC can directly control the flow of information leaving the NIC.
TDMA is basically a simple demonstration scheduler that enables and disables queues on microsecond timescales, based on PTP time. One of the original reasons for building Corundum was to enable optical switching research, where data transmission into the switch must be precisely coordinated with the configuration of the switch itself. We have tried to do this in software, but the precision is limited and the CPU overhead is high. With corundum, the schedule is enforced in hardware, so it is extremely precise and does not add any CPU overhead.
TDMA usually means old legacy time-division based communications standards such as E1/T1 (G.704) or SONET/SDH.
Emulating such services in a packet-switched network requires very precise timing hence the reference to it in terms of IEEE1588 / Precision Time Protocol.
PTP can be useful for packet capture, since that lets you associate a high-precision timestap to each packet that you receive. Dedicated capture cards typically have some combination of PTP, IRIG, and PPS inputs.
NetFPGA is a toolbox for network-based packet processing. It is not a NIC, and their NIC reference designs leave a lot to be desired. Corundum is specifically a NIC.
Well, we're planning on porting Corundum to the NetFPGA SUME hardware at some point in the near future. Should be relatively straightforward as the PCIe interface on the Virtex 7 is the same as on the Ultrascale parts.
NetFPGA does have a NIC reference design, but AFAIK it's just the Xilinx XDMA core connected to a Xilinx 10G MAC. No accessible transmit scheduler, no offloading of any kind, etc. Just about as spartan as you can get, and it's built from completely closed components so you can't really make many modifications to it.
For what we're doing, we can't use any existing commercial NICs or smart NICs because they can't provide the precision we need in terms of controlling transmit timing. We don't care about EBPF, P4, etc. We care about PTP synchronized packet transmission with microsecond precision.
Really interesting project. Couldn't find the motivation explained. Is this just for research? Is it usable in production running in an FPGA? Are there plans to produce hardware?
It's from a group at UCSD, so yes, this is research.
The applications for these kinds of things range from SDN (software-defined networking) where low-latency is a concern and to applications in network monitoring. One could, for example, put together a system that performs line-rate TLS decryption at 10Gbps. You need an FPGA (a big one) for something like that.
There are commercial vendors for this kind stuff (selling closed source IP and hardware). It is not yet in Open Compute networking projects, but I expect that's coming soon.
You can now buy "whitebox" switches that run open network linux and put your own applications on them. In the not-too-distant future those "applications" will also extend to stuff that can run on FPGA hardware .
It's hard for me to see the use case of an FPGA nic. The reasons outlined above don't seem compelling when a commodity nic like mellanox do so much more already.
Mellanox NICs (and basically all commercial NICs) do not do what we want. Software is not precise enough, and is on the wrong side of the NIC hardware queues. The whole point of Corundum is to get control of the hardware transmit scheduler on the NIC itself.
Corundum was originally geared more towards optical circuit switching applications, but it's certainly not limited to that. Since it's open source, the transmit scheduler can be swapped out for all sorts of NIC and protocol related research.
Yeah, I suppose that's a valid use case. Things like ixia need to be fpga-based to measure absolute latency without any uncertainty. You cannot currently get that with enough flexibility in commodity cards.
I saw it was from UCSD but that in itself wasn't an answer. There are plenty of things that are usable in production that are built by research groups initially.
SDN isn't the answer either unless these FPGAs can be used directly in production and so there's a path for network cards to no longer being built on dedicated hardware. So to clarify my question I could see this being:
1) A pure research effort on network card hardware design. Useful to test things in a lab and publish papers.
2) Something that can be pushed into production by actually shipping an FPGA in the router, perhaps in specialized situations where the fixed hardware isn't flexible enough.
3) A step before actual hardware can be manufactured, and network cards themselves become a whitebox style business where multiple generic vendors show up because the designs are open-source.
Original motivation is to support optical switching research for datacenter networking applications. The research group web page is here: https://circuit-switching.sysnet.ucsd.edu/ . It is also mentioned in these slides: https://arpa-e.energy.gov/sites/default/files/UCSD_Papen_ENL.... However, the design is very generic and should be interesting to applications outside optical switching. The main point was to get control over the transmit scheduler, coupled with a very large number of hardware transmit queues. There are a number of experimental protocols and similar that could benefit from this vs. implementation in DPDK.
It is still in development; not sure if I would trust it yet for production workloads. We will not be producing hardware; the design runs on pretty much any board that has the correct interfaces, including many FPGA dev boards and commercially available FPGA-based NICs such as the Exablaze X10 and X25.
Might be silly question but is there a technique to rapidly program FPGA's without interrupting other processes. Say I have multiple soft-CPU's and only want to use my gates to enable ethernet once its needed or the user plugs in?
Not a silly question and actually a very powerful, seldom used feature: partial reconfiguration.
There are, however, several limitations to it. Clock cannot be changed, for example, and usually neither can I/O, specially high-speed transceivers. This has been improving (Ultrascale Xilinx allows for reconfiguring I/O), but you still have to reserve area for reconfiguring (meaning literal area, as in a geographic region in the FPGA).
However, I/O versatility as you suggested has very few advantages to it. You need the reserved logic for ethernet to be programmed at when you plugin. Why would you leve it unprogrammed? If is simply disabled, it won't use any extra power resources, and your soft-CPU's won't be able to take advantage of these resources while you are working. Maybe you could use the area for new soft-cpus, but then you'll hit the problem of over segmenting your design and allowing for less optimization. This would inevitably impact timing constraints and area usage.
Also, FPGA programming may take minutes to finish, and always at least a few seconds. This will be very noticeable by an user and not very efficient if it has to be done frequently.
There are, of courses, good uses for that. But there is also a lot of effort on doing it right and you always risk overdoing it.
Is programming speed really that bad for the ultra high-end devices? Minutes? I don't remember it being that bad for the Amazon F1 when I ported a Xilinx build to use the F1 SDK (I didn't spend lots of time with our prior one, so I wouldn't know.) Of course, their programming strategy is extremely customized, but even for very high-utilization images, it was only ever on the order of seconds. Vivado is absolutely terribly slow though, no matter what you do, or what device you use. (Not to mention if you want to use the ILA support over the internet...)
Also, for some designs you can mitigate the reconfiguration time issue by having two regions and draining requests to one of them, before doing an update. Most of the Xilinx tooling for OpenCL does this kind of thing by default (4-6 "opencl kernel" regions.) But of course it's not always an option to give up that much space...
It depends on the programming interface. JTAG is bit serial and rather slow, so it can take quite a while to load a large FPGA via JTAG. However, there are several other interfaces that can be used, including QSPI, dual QSPI, parallel flash, and a simple parallel interface from some other controller. These can run at many MHz and can load a configuration into a large FPGA in less than a second.
Tabula had an FPGA that could switch between configurations in a single cycle (it could be time-sliced among up to 8 configurations). Not sure if their actual product would have supported what you propose -- it was more a way of trading off design size against clock speed. A large design could use 8X the logic and run at 1/8th the clock speed of a small design.
The hobbyist in me is disappointed: Even the ExaNIC X10 seems to come at 350US$ used, and forget about the four figure UltraScale boards :( Of course, the FPGA can do a lot of stuff in addition, so for HPC this might be really nice to offload application-specific stuff really early (because sustained 2*10G traffic is bound to go somewhere, e.g. CPU[s]).
$350 is cheap for such a powerful NIC. You could pay $10k for a Napatech, or $2k for a Solarflare, and not get this level of programmability.
libexanic is a remarkably clean user-space kernel-bypass library that allows you to do processing on early fragments of the packet while the rest are still being received.
Hence the hobbyist. As a professional engineer I mostly care if the investment into expensive hardware (and developing/adapting a software stack) is worth the gains. And I can totally see that adding a $2k card to a bunch of $5k servers can be cheaper than throwing more servers at a problem (especially after saving on power, rack space and cooling).
But in that context I'd only consider a used $350 card for dev/eval work. I don't want to tell the customer that his infrastructure was down for a few days because I convinced him to cheap out on a NIC.
> But in that context I'd only consider a used $350 card for dev/eval work. I don't want to tell the customer that his infrastructure was down for a few days because I convinced him to cheap out on a NIC.
I would argue that it strongly depends on the price differential and how resilient your system is. If your system is properly failure tolerant, and you can buy twice as much hardware for the same price by accepting a 20% failure rate (say), then it would be strongly advantageous to buy all used hardware.
Yes, I am aware of those. However, the kintex PCIe interface is a bit of a pain as it has a TLP straddling mode that can't be disabled, so it will be some time before it's supported as it will require some significant reworking in the PCIe interface modules. I am planning on supporting straddling eventually as this will improve PCIe link utilization on the ultrascale and ultrascale plus parts. If someone wants to donate a board, I can look in to supporting it.
Stradding is an artifact of very wide interfaces. On the Ultrascale+ parts, the PCIe gen 3 x16 interface comes out as a 512 bit wide interface. Every cycle of the 250 MHz PCIe user clock transfers 64 bytes of data. The issue has to do with how packets are moved over this type of interface. If your packets are all a multiple of 64 bytes, no problem, you get 100% throughput. However, if your packets are NOT a multiple of 64 bytes in length, you have a problem. What byte lane do packets start and end in? The simplest implementation is to always start packets in byte lane 0. The interface logic for this is the simplest - the packets always start in the same place, so the fields always end up in the same place. However, if your packet is 65 bytes long, the utilization is horrible - it doesn't fit in one cycle, so you have to add an extra cycle for every packet, and bus utilization falls to 50% as you have 63 empty byte lanes after every packet.
Straddling is an attempt to mitigate this issue. Instead of only staring packets in lane 0, the interface is adjusted to support starting packets in several places. Say, byte lanes 0 and 32. Or 0, 16, 32, and 48. Now, when you have a packet end in byte lane 0, you can start the next packet in the same clock cycle, but in byte lane 16 or 32. This increases the interface utilization. The trade-off is now the logic has to deal with parts of two packets in the same clock cycle, and it has to deal with multiple possible packet offsets.
The specific annoyance with PCIe packets is that the max payload size is usually 256 bytes, but every packet has a 12 or 16 byte TLP header attached, which really screws things up when combined with the small max payload size.
No free 40G MAC/PHY. Unfortunately, the Xilinx CMAC is 100G only, and the Xilinx soft 40G MAC/PHY is $$$$. I have looked in to building a 40G/100G switchable MAC/PHY, but it's going to be a serious pain in the rear.
Funny you mention that switch, we bought one of those off of eBay for our testbed as it supports PTP.
Also, for optical switching applications, one of the most important factors is how long it takes to bring up the link after switching. Because of this, we have no interest in spending time on 40G and 100G interfaces because interlace deskew takes hundreds of microseconds, and 100G also requires FEC which takes hundreds of microseconds to lock. So we're focused on 10G and 25G and running multiple links in parallel, which also provides more architectural flexibility. I added 100G support for three main reasons: the CMAC license is free, so why not?; supporting 100G makes the project a whole lot more interesting than only 10G or 25G, and it provides a simple way of testing the core NIC datapath.
oh... now I understand the purpose of the TDMA. Using actual optical switching to interconnect. Very interesting!
Got any pointers to the sort of optical switching components you're using?
[I've been out of the networking business professionally for almost a decade now, so I'm a bit out of touch with the state of the art in optical stuff--- I was somewhat surprised recently to learn of the existence and low cost of LR4 40gb optics. :P]
The current generation of switches that we're working on uses diffraction gratings patterned onto glass hard drive platters, installed in a modified hard drive, spun by a custom motor controller that's synchronized to the NICs via PTP.
Ha. I was going to guess it would be an AOM, I wouldn't have guessed a diffraction grating on a hard drive platter. That's awesome, and must be incredibly energy efficient.
The cost of switch ports and interconnects could all be dumped into making host interfaces faster, allowing for the switching time to be reduced.
Crosstalk is better than 30 dB, and double pass loss between ports is 5-8 dB. The switch is basically cycling through three or four different interconnection patterns that are defined by looped back fiber connections, so the signal has to pass through the switch twice.
This looks very interesting. If this were included in the Linux kernel, would it read and utilize existing sysctl memory and qlen values as well as have it's own sysctl settings, or would all the settings be derived at module load in the modprobe parameters? Why I am asking is that I current disable TOE (TCP offload engine) on all my nic's as they have their own buffer and retry settings and ignore the OS network settings.
It's still in development at the moment. We'll see about the interface. But there are no plans to implement any segmentation offloads or TOE in Corundum, that will be left up to the network stack. However, scatter/gather DMA support is planned so that software GSO will work. Right now, most of the low-level twiddling is done from a user space app that directly accesses device registers. For a research device, that's fine, but that would obviously have to be improved for a commercial product.
Although this is cool, the downside is that it only works on specific FPGAs. Are hardware independent designs possible so that one can run it on any sufficiently large FPGA?
Making RTL portable, so it can be used with different toolchains, is certainly possible -- and how/why you do this is, like everything, is an engineering tradeoff. Sometimes when you write C code it's easier to just use Linux features and not care about portability! Sometimes it's very easy (or very desirable) to keep portability for your code, which might be easy (use the standard library) or hard (use lots of #ifdef or whatever.)
But, unlike portable C code, to run designs like this on real hardware, you need to do things like describe how the physical pins on the FPGA are connected to the board peripherals (for instance, describing which pin might be connected to an LED, vs a UART). This generally requires a small amount of glue, and depending on how the project is structured, some amount of Verilog/VHDL code, as well. It's not like saying "cc -O2 foo.c" with your ported C compiler that has a POSIX standard library.
This is just the case if you're using the same base FPGA, but with different board layouts. Using different FPGAs (for example, a current-gen FPGA by vendor XYZ, vs XYZ gen N-1), or especially when porting between vendors -- the details can become vastly more complex very quickly.
Corundum should work on any ultrascale or ultrascale plus FPGA with the necessary interfaces. Probably also virtex 7. We're considering porting to Intel/Altera parts at some point as well...mainly the PCIe interface will need work, but the rest should be directly portable.
I really wish people would try making original names for their products. Not only am I now going to have to sift through Steven Universe stuff when I look up corundum for people, now I'll have to filter this out as well.
I vaguely remember a corundum project related to the ruby (and rust?) languages, but search engines fail because the terms are all related to mineralogy.
FWIW: this appears to be different / unrelated to ruby or rust.
Corundum has several unique architectural features. First, transmit, receive, completion, and event queue states are stored efficiently in block RAM or ultra RAM, enabling support for thousands of individually-controllable queues. These queues are associated with interfaces, and each interface can have multiple ports, each with its own independent scheduler. This enables extremely fine-grained control over packet transmission. Coupled with PTP time synchronization, this enables high precision TDMA.