Back in the nineties, 3DFX had a synthetic rendering benchmark that relied on keeping the entire benchmark in L1 but the secret was taking over the entire machine so that no interrupt or other mechanism could pollute the cache.
A bit of ignorance on my part, but would the L1 be holding the data and the instructions? In which case we would be trying to fit our entire 6502 emulator in less than 64K of memory alongside the emulated RAM/ROM?
“The high-performance cores have an unusually large 192 KB of L1 instruction cache and 128 KB of L1 data cache and share a 12 MB L2 cache; the energy-efficient cores have a 128 KB L1 instruction cache, 64 KB L1 data cache, and a shared 4 MB L2 cache.”
⇒ chances are you don’t even have to try to fit an emulator of a 8-bit machine and it memory into the L1 cache.
Real question about L1 caches. For a long time, x86 (Intel & AMD) L1 caches have been pretty much pegged at 32KB. Do you know why they didn't make them larger? My guess: There is a trade-off between economics and performance.
Depends on the precise architecture, but ARM (and other RISC designs) usually have separate data and instruction L1 caches. You may need to be aware of this if writing self-modifying code, because cache coherence may not be automatic