Part of the reason this is interesting is that deep learning has to have some kind of inductive bias to perform as well as we've seen (given a learning problem, it's biased toward learning in particular ways). In general though, a neural network can approximate _any_ function, so reigning in that complexity and uncovering which functions are actually learnable (or efficiently learnable) by deep learning is an important research direction. This paper says that the functions uncovered by deep learning (with caveats) are precisely those which are close to functions represented by a different learning technique, which is notable because this new class of functions is not "all functions" and because the characterization is explainable in some sense, giving insight into how deep learning works. That connection also winds up having a bunch of other interesting implications that somebody else can cover if they'd like.
If you look at how they actually "translate" the fancy model to the simple one, it requires fully fitting the original model (and keeping track of the evolution of gradients over the training). So it wouldn't make training more efficient, but perhaps it would be useful in inference or probing the characteristics of the original model.
This has always anecdotally appeared to be the case when investigating the predictions of neural nets. Particularly when it comes time to answer the question “what does this model not handle”
No, they require a full pass over the support vectors, which are potentially a much smaller set. (That’s part of why everyone was so excited about SVMs when they were invented) The support vectors are the training values with nonzero hinge loss, or alternatively, training values sufficiently close to the decision boundary.
Fair enough, but the number of support vectors for non trivial problems is still pretty large (as I understand but could be wrong), e.g. 20-30% of the dataset. Having to iterate over 30% of say imagenet on each batch of predictions seems unfeasible.
neural nets at the same time require multiple passes through the data (epochs).
if we can train a model in one epoch jnstead of 10000 epochs thats a breakthrough!
True, but it sounds like you’re just shifting computation from training to inference. And I’m not sure that’s a very good trade off to make, you’re likely to predict on much more data than you trained on (e.g. ranking models at google, fb, etc)
not sure I get your point, both DNNs and SVMs require one forward pass for inference, so there is no difference.
if SVM model can converge in one epoch, how is it not less efficient than the status quo with DNNs?
For kernel SVMs, one needs to keep around part of the training data (the support vectors) right? With DNNs, after training, all you need are the model parameters. For very large datasets, keeping around even a small part of your training data may not be feasible.
Furthermore, number of parameters do not (necessarily) grow with the size of the training data, can be reused if you get more data, can be quantized/pruned/etc. There's not really an easy way to do these things with SVMs as far as I understand.