Hacker News new | past | comments | ask | show | jobs | submit login

50/50 I can shed some light or am way over my head:

The point of the paper is that if you train a deep learning model with gradient descent then the resulting model is effectively a kernel machine, regardless of model architecture.

The nice thing about a kernel machine is that it is simple (just one hidden layer) and we are able to use to analyze a kernel machine more effectively and conveniently.

So, I think the contribution here isn't "these sets of universal aproximators are equivalent" but rather "We have this effective but opauge deep learning thing, turns it it's actual a kernel machine in retropsect so we can bring 'kernel tooling' to analyze the deep learning mode"




Does this essentially mean that any multi-layer RNN can be reasonably approximated by a 1-layer network (something like a perceptron) for the "playback" purposes, that is, for recognition / transformation, not learning?

This may have colossal practical implications, as long as the approximation stays good enough.


Hmmm, I think that's not precise and my use of "architecture" was misleading.

If we're thinking in terms of "universal aproximators", an RNN is a way to make a sequence of approximate functions for a sequence of inputs.

But it's still a sequence of functions, not a single function.

For a 1 layer network to have the same ability as an RNN (take an unbounded amount of context) it would need to have infinite width which is a no-go.


I would be skeptical about thinking of networks this way without empirically verifying it yourself.

The only useful trick I’ve found like that, is that a stack of linear layers with no activation function is equivalent to a single larger layer. Sometimes it enables some clever optimizations on TPUs, since you want one of the dimensions to be a multiple of 128. (I haven’t actually used that trick, but it’s in my back pocket.)

But thinking of an entire model as a single layer seems strange. A single layer has to have some kind of meaning. To me, it means “a linear mapping followed by a nonlinear activation function.” So is the claim that there exists a sufficiently complicated activation function that approximates any given model? Because that sounds an awful lot like the activation function itself might be “the model”. Except that makes no sense, because activation functions don’t use model weights; the linear multiply before the activation does that.

So it quickly takes me in circles. I don’t have a good intuition for models yet though.


Wouldn't this one-layer network be a lot less "compressive" than the multi-layer net, and in some sense "duplicate" subnetworks in earlier layers?


Does that mean in theory we can uncover an underlying model, a theorem effectively, that the model is effectively approximating and so remove some of the uncertainty?


If I understand the paper (which is questionable), that's what the author is aiming for. E.g. he's saying 1) We can make these amazing black boxes 2) We don't really understand them 3) But when we make them with gradient descent they end up being almost kernel machines 4) We know a lot about kernel machines, so we can use that to "remove some uncertainty"




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: