Could reduced MMU overhead be significant on a server?
I work in the embedded space, so very different with small CPUs. Using an MMU compared to a simple MPU can waste 25 to 30% of a CPU cycles easily. I wouldn't expect such overhead on a big server CPU with much bigger TLB (but also bigger working sets?), still a unikernel could use a very simple mapping with in some case a single large page, and see very little MMU management overhead. I have no idea on the possible gain on a server, but I'm curious. Any hindsight welcome.
This was on an ARM926 running a quite large code base. The ARM926 TLB has only 64 entries, which is not much when running an OS with 4 kB pages. So that's quite an extreme (and old...) case, but interesting when dealing with rather small CPU with not so small context. The MMU has a performance cost to keep in mind.
A server is very different. Still, a blog on transparent huge pages on Linux [1] shows an example where a high load JVM application server spends over 10% of its time doing page walk: "Yes, you see it right! More than 10% of CPU cycles were spent doing the page table walking.". So there's a massively bigger TLB, but also the server run much larger application with a large footprint. In the end, for some apps the MMU overhead can be significant in the server space.
In a unikernel, there's a single address space used and no isolation. You could map it with only huge pages, and dramatically reduce TLB misses even for a large footprint application (assuming a server running mostly such unikernels, as another regular OS apps could trash the TLB otherwise I guess). Now, I'm not sure there's any such large footprint application running an a unikernel yet, but it may be a possible gain for unikernels on principle.
I work in the embedded space, so very different with small CPUs. Using an MMU compared to a simple MPU can waste 25 to 30% of a CPU cycles easily. I wouldn't expect such overhead on a big server CPU with much bigger TLB (but also bigger working sets?), still a unikernel could use a very simple mapping with in some case a single large page, and see very little MMU management overhead. I have no idea on the possible gain on a server, but I'm curious. Any hindsight welcome.