A problem on the hardware side is that Intel's IOMMU TLB is tiny (64 entries), s...

drewg123 · on Feb 25, 2020

Nice paper, and thanks for the reminder of how small the IOMMU tlb is. We never hit this because we were testing full-sized packets (and really bigger, because of TSO) and hit host IOMMU management overheads at ~100K to 200k TSO sends/sec.

emmericp · on Feb 25, 2020

Interesting, did you use huge pages?

I think ~100k to 200k TSO "packets" per second should be doable with the IOMMU. But I guess it depends where the data is coming from. Could be one of the odd cases where copying data is faster than doing zero-copy, e.g., just copy everything into the same small set of small-ish buffers to keep the number of pages that need to be present in the IOMMU small?

hedora · on Feb 26, 2020

This sounds like a driver bug / misconfiguration. Once the IOMMU is set up, there isn’t any host side management to be done (other than managing the IOMMU TLB, but unless it’s thrashing, that’s a no-op on the fast path).

At most, each kernel driver has to do an extra addition to map its physical I/O offset to the one exposed to the bus by the IOMMU. With huge pages, there’s approximately one offset per driver, so it lives in cache, probably next to other driver state.

ajross · on Feb 25, 2020

FWIW: 64 entries isn't particularly small for a dTLB, IIRC that's exactly the size on current Intel cores. The real problem is that device DMA, unlike software behavior, is distressingly non-local. The device will stream out a packet or storage block and then never touch that memory again (or not for a very long time -- memory buffers are huge relative to bandwidth on these things). The TLB just doesn't do you much good.

emmericp · on Feb 25, 2020

There's only one level of TLBs in the IOMMU. And that's 64 entries.

Yeah, I think the dTLB is only 64 entries on Intel CPUs as well, but there's a second larger layer behind that, and an even larger third layer. IIRC it's a total of 4096 entries on recent Intel CPUs.

the8472 · on Feb 25, 2020

Could p2p-dma for zero-copy (e.g. shoveling data from nvme to the nic) avoid a large fraction of the iommu overhead?