50/50 I can shed some light or am way over my head: The point of the paper is th...

nine_k · on Dec 7, 2020

Does this essentially mean that any multi-layer RNN can be reasonably approximated by a 1-layer network (something like a perceptron) for the "playback" purposes, that is, for recognition / transformation, not learning?

This may have colossal practical implications, as long as the approximation stays good enough.

talolard · on Dec 7, 2020

Hmmm, I think that's not precise and my use of "architecture" was misleading.

If we're thinking in terms of "universal aproximators", an RNN is a way to make a sequence of approximate functions for a sequence of inputs.

But it's still a sequence of functions, not a single function.

For a 1 layer network to have the same ability as an RNN (take an unbounded amount of context) it would need to have infinite width which is a no-go.

sillysaurusx · on Dec 7, 2020

I would be skeptical about thinking of networks this way without empirically verifying it yourself.

The only useful trick I’ve found like that, is that a stack of linear layers with no activation function is equivalent to a single larger layer. Sometimes it enables some clever optimizations on TPUs, since you want one of the dimensions to be a multiple of 128. (I haven’t actually used that trick, but it’s in my back pocket.)

But thinking of an entire model as a single layer seems strange. A single layer has to have some kind of meaning. To me, it means “a linear mapping followed by a nonlinear activation function.” So is the claim that there exists a sufficiently complicated activation function that approximates any given model? Because that sounds an awful lot like the activation function itself might be “the model”. Except that makes no sense, because activation functions don’t use model weights; the linear multiply before the activation does that.

So it quickly takes me in circles. I don’t have a good intuition for models yet though.

dkural · on Dec 7, 2020

Wouldn't this one-layer network be a lot less "compressive" than the multi-layer net, and in some sense "duplicate" subnetworks in earlier layers?

pbhjpbhj · on Dec 6, 2020

Does that mean in theory we can uncover an underlying model, a theorem effectively, that the model is effectively approximating and so remove some of the uncertainty?

talolard · on Dec 7, 2020

If I understand the paper (which is questionable), that's what the author is aiming for. E.g. he's saying 1) We can make these amazing black boxes 2) We don't really understand them 3) But when we make them with gradient descent they end up being almost kernel machines 4) We know a lot about kernel machines, so we can use that to "remove some uncertainty"