What if you split up the training down to the literal vector math, and treated every macbook like a thread in a gpu, with just one big computer acting as the orchestrator?
You would need each MacBook to have an internet connection capable of multiple terabytes per second, with sub millisecond latency to every other MacBook.