As I said, in most scenarios this doesn't matter, so I'm not sure why you're poi...

As I said, in most scenarios this doesn't matter, so I'm not sure why you're pointing at an average scenario as some sort of counter-argument?

There's a single memory controller on the M1. The CPU, GPU, neural net engine, etc... all share that single controller (hence how "unified memory" is achieved). Given the theoretical maximum throughput of the M1's memory controller is 68GB/s, and that the CPU can hit around 68GB/s in a memcpy, I'm not sure what you're expecting? If you hammer the GPU at the same time, it must by design share that single 68GB/s pipe to the memory. There's not a secondary dedicated pipe for it to use. So the bandwidth must necessarily be split in a multi-use scenario, there's no other option here.

Think of it like a network switch. You can put 5 computers behind a gigabit switch and 99% of the time nobody will ever have an issue. At any given point you can speedtest on one of them and see a full 1gbit/s on it. But if you speedtest on all of them simultaneously you of course won't see each one getting 1gbit/s since there's only a single 1gbit/s connection upstream. Same thing here, just with a 68GB/s 8x16-bit channel LPDDR4X memory controller.

The only way you can get full-throughput CPU & GPU memory performance is if you physically dedicated memory modules to the CPU & GPU independently (such as with a typical desktop PC using discrete graphics). Otherwise they must compete for the shared resource by definition of being shared.