While there are newer algorithms (almost all first order btw), one of the most c...

While there are newer algorithms (almost all first order btw), one of the most commonly used for deep convnets, sgd with momentum, is from the 80s. It really is mostly about computing power - one gtx 1080 has more tflops than world's fastest supercomputer till 2001 [0]. The actual speed difference is probably an order of magnitude larger due to an absence of communication overhead and latency inherent in sharing work across thousands of separate cpus and lots of slow ram. That would make one gtx 1080 equivalent - for neural net training purposes - to a supercomputer from 2004.

[0] https://en.wikipedia.org/wiki/History_of_supercomputing#Hist...