One of my previous jobs involved coding on a media processor. That processor had a direct-mapped cache, so code size and layout mattered. Ideally, you wanted the performance-critical code to fit in the cache and be in different cache lines to avoid thrashing.
As a concrete example, Carlos Bueno's Mature Optimization Handbook[0] describes how the HHVM team got substantial performance wins by reducing instruction cache misses in rarely executed code.
Personally I see the value in exploring the limits of what systems are capable of, and exploring ways to use them outside of the parameters for which they were designed.
I would also generally like to avoid being on-call for a system that is being pushed to its limits or used outside the parameters it was designed for.
I am very curious to hear if anyone is shipping cosmopolitan-libc/Actually-portable-executable binaries, either internally or for consumption by end users. I would love to hear more about the experience!