Hacker News new | past | comments | ask | show | jobs | submit login

A problem on the hardware side is that Intel's IOMMU TLB is tiny (64 entries), so using huge pages for all DMA-accessible memory is absolutely required to get a good performance out of it.

We've done some benchmarks here: https://www.net.in.tum.de/fileadmin/bibtex/publications/pape... (Figure 9 on page 10)

Only a very basic benchmark, working on more...




Nice paper, and thanks for the reminder of how small the IOMMU tlb is. We never hit this because we were testing full-sized packets (and really bigger, because of TSO) and hit host IOMMU management overheads at ~100K to 200k TSO sends/sec.


Interesting, did you use huge pages?

I think ~100k to 200k TSO "packets" per second should be doable with the IOMMU. But I guess it depends where the data is coming from. Could be one of the odd cases where copying data is faster than doing zero-copy, e.g., just copy everything into the same small set of small-ish buffers to keep the number of pages that need to be present in the IOMMU small?


This sounds like a driver bug / misconfiguration. Once the IOMMU is set up, there isn’t any host side management to be done (other than managing the IOMMU TLB, but unless it’s thrashing, that’s a no-op on the fast path).

At most, each kernel driver has to do an extra addition to map its physical I/O offset to the one exposed to the bus by the IOMMU. With huge pages, there’s approximately one offset per driver, so it lives in cache, probably next to other driver state.


FWIW: 64 entries isn't particularly small for a dTLB, IIRC that's exactly the size on current Intel cores. The real problem is that device DMA, unlike software behavior, is distressingly non-local. The device will stream out a packet or storage block and then never touch that memory again (or not for a very long time -- memory buffers are huge relative to bandwidth on these things). The TLB just doesn't do you much good.


There's only one level of TLBs in the IOMMU. And that's 64 entries.

Yeah, I think the dTLB is only 64 entries on Intel CPUs as well, but there's a second larger layer behind that, and an even larger third layer. IIRC it's a total of 4096 entries on recent Intel CPUs.


Could p2p-dma for zero-copy (e.g. shoveling data from nvme to the nic) avoid a large fraction of the iommu overhead?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: