This sounds interesting, but a little over my head. Can anyone offer an explanat...

hansvm · on Dec 6, 2020

Part of the reason this is interesting is that deep learning has to have some kind of inductive bias to perform as well as we've seen (given a learning problem, it's biased toward learning in particular ways). In general though, a neural network can approximate _any_ function, so reigning in that complexity and uncovering which functions are actually learnable (or efficiently learnable) by deep learning is an important research direction. This paper says that the functions uncovered by deep learning (with caveats) are precisely those which are close to functions represented by a different learning technique, which is notable because this new class of functions is not "all functions" and because the characterization is explainable in some sense, giving insight into how deep learning works. That connection also winds up having a bunch of other interesting implications that somebody else can cover if they'd like.

peteretep · on Dec 6, 2020

“You don’t need a fancy model, you can just find a similar example directly from the training data and get similar results”

ironSkillet · on Dec 6, 2020

If you look at how they actually "translate" the fancy model to the simple one, it requires fully fitting the original model (and keeping track of the evolution of gradients over the training). So it wouldn't make training more efficient, but perhaps it would be useful in inference or probing the characteristics of the original model.

lumost · on Dec 6, 2020

This has always anecdotally appeared to be the case when investigating the predictions of neural nets. Particularly when it comes time to answer the question “what does this model not handle”

smallnamespace · on Dec 6, 2020

Defining ‘similar’ robustly is the meat of the problem, and what we’re finding deep NNs to do well.

moralestapia · on Dec 6, 2020

Not an explanation, but a benefit could be that SVMs can be evaluated much faster and are more explainable (* Citation needed, I know).

eugenhotaj · on Dec 6, 2020

Don’t kernel SVMs need a full pass through the data they were trained on to make predictions? How is that faster?

cscheid · on Dec 6, 2020

No, they require a full pass over the support vectors, which are potentially a much smaller set. (That’s part of why everyone was so excited about SVMs when they were invented) The support vectors are the training values with nonzero hinge loss, or alternatively, training values sufficiently close to the decision boundary.

eugenhotaj · on Dec 6, 2020

Fair enough, but the number of support vectors for non trivial problems is still pretty large (as I understand but could be wrong), e.g. 20-30% of the dataset. Having to iterate over 30% of say imagenet on each batch of predictions seems unfeasible.

alexilliamson · on Dec 6, 2020

You only need the "Support Vectors" to make predictions, not the whole dataset.

slt2021 · on Dec 6, 2020

neural nets at the same time require multiple passes through the data (epochs). if we can train a model in one epoch jnstead of 10000 epochs thats a breakthrough!

sdenton4 · on Dec 6, 2020

Epochs are more about the training data than the model... If you've got a big enough dataset, one epoch or less is fine!

eugenhotaj · on Dec 6, 2020

True, but it sounds like you’re just shifting computation from training to inference. And I’m not sure that’s a very good trade off to make, you’re likely to predict on much more data than you trained on (e.g. ranking models at google, fb, etc)

slt2021 · on Dec 6, 2020

not sure I get your point, both DNNs and SVMs require one forward pass for inference, so there is no difference. if SVM model can converge in one epoch, how is it not less efficient than the status quo with DNNs?

eugenhotaj · on Dec 6, 2020

For kernel SVMs, one needs to keep around part of the training data (the support vectors) right? With DNNs, after training, all you need are the model parameters. For very large datasets, keeping around even a small part of your training data may not be feasible.

Furthermore, number of parameters do not (necessarily) grow with the size of the training data, can be reused if you get more data, can be quantized/pruned/etc. There's not really an easy way to do these things with SVMs as far as I understand.