Is pretraining really all that much of a failure? I haven't really found an authoritative answer on whether or not pretraining is worth it these days. Hinton's 2012(?) Coursera course still focuses pretty deeply on generative/layer-by-layer pretraining with RBMs but I'm just not really sure if that's fallen by the wayside today. Or maybe it's still useful only in specific circumstances?
Saxe Ganguli McClelland, 2013, about linear nets and orthogonal initialization. But then, read Li Jiao Han Weissman 2017 (maybe preprint), "Demystifying ResNet", which makes a nice claim about the niceness being conditioning of Hessian at init.
Tldr: it's good conditioner but you can do better ab initio