Well, there is a major architectural reason why the entire M-series appears to be "so fast" and that is the unified memory, which completely eliminates the buffer-to-buffer data copying that is probably over half of what a non-unified memory architecture chip is doing at any given time. M-series chips have an architecture that completely eliminates data copying, just reference the data where it is, and you're done.
I really like the principles behind AMD's chiplet design, of course they've had different design goals behind it (easier diversification of their product portfolio), but it remains a fact that you can slap a not-so-terrible GPU right next to a CPU core.
There's probably a lot still missing: Apple integrated the memory on the same die, and built Metal for software to directly take advantage of that design. That's the competitive advantage of vertical integration.
I think the UMA is the secret sauce for faster PC/laptop that people tend to overlook since You Can Never Has Enough RAM (TM).
I'm planning to buy HP ZBook Ultra G1a laptop with AMD Ryzen Strix and it seems to be a very good alternative to Apple M series laptop [1]. It can support up to 128 GB RAM (up to 96 GB VRAM) and should be able to run GPT-OSS 120B model.
It's not just the GPU memory, it's also I/O memory. That speeds up a lot: just update the pointer to where the memory is, no copying out of I/O memory.