I'm waiting for the time we finally realize the obvious best model for pipelined SMP applications: rescheduling the next required process to the core where the data are cache local.
'Work-stealing' schedulers already do this - jobs are scheduled onto the core which created them and presumably touched their data last, unless there is load imbalance in which case other cores take jobs. I don't know about the internals of Erlang but I'd be surprised if it was not already work stealing as it's the usual technique.
As far as I'm aware, most work stealing schedulers still aren't cache-aware. One really naiive (but possibly effective) way to do this could be to have a per-core (or per L2, or per NUMA node) work LIFO which would be consulted before looking to other cores for work. When your core/L2/NUMA node schedules a task right before terminating, it is more likely that the next task will be local. This, of course, doesn't work if you're more concerned about jitter or latency under load.
I noticed a paper about a cache-aware work-stealing scheduler which I have not yet read[0].
Frankly I believe that Intel could sell processors now with ten times more cache per core, and the queue for them at $50,000 a socket would be just immense.
I probably underestimate the likely cost by several times and then the cooling would be a great science fiction set properly to 1:12 scale, but I certainly know businesses who have a real desire for a product like that.
Am I missing a showstopper preventing the possibility? I'm not going to be persuaded that it couldn't be done by mere impracticalities. I'm quite prepared to take heatsinks the size of Cantelupes...
The problem with huge caches is actually that access latency grows with the physical distance of the cache lines from the pipelines.
This is why typically you see them adding new cache levels, instead of drastically expanding the size of the cache, especially the lower level caches (L1 and L2).