Gated Linear Networks

fxtentacle · on June 15, 2020

That is an amazing paper, a great result and new neutral architectures are long overdue.

But I don't believe that this has any significance in practice.

GPU memory is the limiting factor for most current AI approaches. And that's where the typical convolutional architectures shine, because they effectively compress the input data, then work on the compressed representation, then decompress the results. With gated linear networks, I'm required to always work on the full input data, because it's a one step prediction. As the result, I'll run out of GPU memory before I reach a learning capacity that is comparable to conv nets.

friendly_aixi · on June 15, 2020

Convolution is a linear operation; in the case of images, you can view it as a multiplication with a doubly block circulant matrix. I can't see any barriers to hybrid approaches here, though it seems difficult to avoid using backpropagation for credit assignment within the convolutional layers.

Re: significance, how about their application in regression: https://arxiv.org/abs/2006.05964 ? Or in contextual bandits: https://arxiv.org/abs/2002.11611 ?

Disclaimer: I am one of the authors (Joel).

fxtentacle · on June 16, 2020

Wow, that G-GLN paper is really new. Thanks for sharing it :)

My first impression is that this will be very challenging to use for many of the people currently using AI in practice, because of the requirement to have a convex and gaussian-distributed result.

For example, I currently work on optical flow and the loss functions are usually very jumpy and usually only convex within a few pixels around the correct result. I have seen plenty of modeling errors in optical flow SOTA papers, for example casting a boolean occlusion term so that it can be added to the loss (Won't work, no gradient). I have also seen how strongly authors struggle with the irregularities of their loss function and issues with convergence, for example by fixing all random initialization to a seed and then parameter-scanning on that value (Very expensive).

Notable examples of the difficulties faced would be https://arxiv.org/abs/1904.04998 which I have never seen converge from random initialization or https://arxiv.org/abs/1711.07837 which diverges without supervised pre-training.

Also, while many people use the gaussian-distributed euclidean norm of the difference between prediction and groundtruth as their main loss term, there has been a lot of discussions recently if that is a good idea, because it forces the neural network to represent uncertainty with blur. But optical flow tends to have sharp edges where objects end.

Combined, that means the problems that I work on usually do not have a gaussian-distributed loss and usually are irregular and never convex, so I'm not sure if I could use G-GLN.

But the application to contextual bandits looks VERY interesting to me :)

I see great potential in using conv layers as pre-compression and then applying decision trees on the resulting intermediate representation.

What I previously did for object segmentation was to sample a random but fixed set of feature pair differences and use the signs as bit flags. I then trained the decision trees on those bits to predict the boundary shape around that pixel. I got the general idea from this paper: https://arxiv.org/abs/1406.5549

But it sounds like moving from difference bits to halfspaces and from linear regression to GLN, your paper "Online Learning in Contextual Bandits using Gated Linear Networks" could greatly improve on that. I'd be curious to see how those bandits do on BSDS500.

BTW, are you aware of any discussion groups for AI image processing that are open to members of the public?

https://www.reddit.com/r/deeplearning/ seems to be mostly people that just took a Udemy / Coursera course, so it's usually about re-using existing models and almost no talk about research.

helges · on June 16, 2020

> BTW, are you aware of any discussion groups for AI image processing that are open to members of the public?

r/machinelearning is the most suitable subreddit for this, since it is actively frequented by researchers and the quality of discussion is much higher than in r/deeplearning

fxtentacle · on June 16, 2020

Thanks :)

0-_-0 · on June 15, 2020

I like the choice of user name

BrokrnAlgorithm · on June 15, 2020

What about findings w. R. T. Online learning? I find continuous learning quality of algorithms to be a topic which often seems to be more of a side-concern, although it carries a lot of relevance in applied settings.

fxtentacle · on June 15, 2020

I believe that to be a red herring. Their approach cannot learn any features that provide a lower-dimensional approximation of the input data. As the result, there is no intermediate representation which could change and thereby negatively affect previously learned classifiers.

But if I train 10 independent traditional networks, I also won't have newly learned data affect old performance. So in effect they give up the possibility to do transfer learning in exchange for avoiding the disadvantages of transfer learning. But that's a bad tradeoff.

With their approach you always train from scratch, which brings with it the need for huge training data sets.

So I can train a bird classifier on the traditional architecture with 500 labeled images and a pretrained resnet. Our I use a million bird images and this approach.

friendly_aixi · on June 15, 2020

1) Indeed, GLNs don't learn features... but I would claim they do learn some notion of an intermediate representation, it's just different from the DL mainstream -- in particular its closely related to the inverse Radon transform in medical imaging.

2) Inputs which are similar in terms of cosine similarity will map to similar (data dependent) products of weight matrices, and thus behave similarly, which of course can affect performance in both good and bad ways. With the results we show on permuted MNIST, its well... just not particularly likely that they will interfere. This is a good thing -- why should completely different data distributions interfere with one another? The point is the method is resiliant to catastrophic forgetting when the cosine similarity between data items from different tasks is small. This highlights the different kind of inductive bias a halfspace gated GLN has compared to a Deep ReLu network.

3) Re bird example, that's slightly unfair. I am sure one could easily make use of the pre-trained resnet to provide informative features to a GLN -- it's early days for this method, hybrid systems haven't been investigated, so I don't know whether it would work better than current SOTA methods for image classification. But I would be pretty confident that some simple combination would work better than chopping the head off a pretrained network and fitting an SVM on top. This is all speculation on my part though. :)

fxtentacle · on June 16, 2020

2) The problem that I would expect with a hybrid method is that conv features are usually trained to be redundant with dropout, so they should be highly correlated with each other and, thus, have a high cosine similarity.

3) I agree that my argument is scientifically unfair. I was trying to argue from the perspective of a prospective user. My customers tend to have a budget limit of how much their classifier is allowed to cost. Training from scratch would be too expensive. But a chopped reset with some conv layers will work OK and be cheap enough.

So for me the user, the ecosystem around your architecture and the availability of pretrained models might make the critical difference on whether I'll use it or not.

BrokrnAlgorithm · on June 15, 2020

Good point as well - sometimes its not about dimensionality reduction but more about persistent representation, having this geared towards highly non-stationary environments is nice thing to have.

BrokrnAlgorithm · on June 15, 2020

Still, there are a lot of domains where transfer learning is no the most applicable setting - I'm thinking of highly noisy and non-stationary setting such as finance. In some of these domains, especially time series, lack of data is often not the issue, e.g. high frequency datasets.

Having models constantly re-train as the default setting is essentially what a rolling regression would do - having a rolling regression that doesn't catastrophically forget would be quite valuable.

xiaodai · on June 16, 2020

what stops you from using a few convolution layers and then use the GLN in the last layer? You achieve the best of both worlds if your theory is true.

fxtentacle · on June 16, 2020

That would work, but it would likely re-introduce the catastrophic forgetting problem, because now the GLN is dependent on an intermediate representation determined by conv layers.

Immortal333 · on June 15, 2020

"We show that this architecture gives rise to universal learning capabilities in the limit, with effective model capacity increasing as a function of network size in a manner comparable with deep ReLU networks."

What exactly this statement means?

janosett · on June 15, 2020

> "We show that this architecture"

They are demonstrating a new technique, Gated Linear Networks.

> gives rise to universal learning capabilities in the limit

they claim to show that with an unbounded amount of time and memory (network size / # params) this architecture can be used to learn/approximate any function

> with effective model capacity increasing as a function of network size

Model capacity here refers to the ability to memorize a mapping between inputs and outputs. They show that a network with more layers/weights will "memorize" more.

> in a manner comparable with deep ReLU networks

"Deep ReLU networks" are referring to commonly used modern deep neural network architectures. ReLU is a popular activation function: https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

fxtentacle · on June 15, 2020

They mean that if you add parameters, the learning capability of their approach grows by a similar amount as if you would add the same number of parameters to a conv+ReLu network (the standard approach).

That "universal" is a weird claim in my opinion, but they mean that with enough parameters, this architecture can learn everything.

Immortal333 · on June 15, 2020

I was able to get the second part of the statement. but I haven't seen the use of "in limit" in a statement like this.

Yes, the universal approximation is a strong claim. NN has been proven to have universal approximation theoretically.

friendly_aixi · on June 15, 2020

The result here is stronger, in the sense that typical NN universality results are statements with respect to just capacity (and not how you optimise them). Here, the result holds with respect to both capacity + a choice of suitable no regret online convex optimisation algorithm (e.g. online gradient descent). Of course, this is just one desirable property of a general purpose learning algorithm.

currymj · on June 16, 2020

Do you know of any other kinds of universal function approximators that also have a good regret bound like this, or is yours the first one?

global convergence to any arbitrarily weird function at rate O(sqrt(T)) seems amazing, almost too good to be true, and I’m wondering what the catch is. Maybe it’s just a moderately nice property but not extraordinary? Maybe there are some horrible constants hiding in there?

friendly_aixi · on June 16, 2020

1) It's definitely not the first. Other methods have universal guarantees of some form or other with well quantified rates of convergence, e.g. k-NN would be the best known example.

2) There are some restrictions on the class of density functions it can model, so arbitrarily weird is a bit strong, but the model class is very general.

3) The weights needed to model any function in this class although finite, can be arbitrarily large. The regret of a single neuron has a dependence on the diameter of the convex set your weights reside in, so there is a nasty constant of sorts in there, and this will also carry over when you analyse the regret of a complete network. With such a general statement, it's unavoidable sadly.

4) The universality result on its own is just a nice property. See it as the first stepping stone to a more meaningful analysis. What you really want is for the model class to grow as you add more neurons, using weights within a realistic range, and that the method performs well in practice on some problems people care about -- we provide empirical evidence that the capacity grows on par with deep relu networks with our capacity experiments, and show a bunch of results where the method works, but we don't have a theoretical characterisation of the class of density functions it can model well (i.e. if the function has some nice structural property, then a network of reasonable size is guaranteed to learn a good approximation quickly). Such a result would be extraordinary in my eyes. Because the network is composed of simple and well understood building blocks, I am optimistic that such an analysis will be possible in the future.

currymj · on June 16, 2020

thank you very much for the detailed response!

much to chew on here, it really does seem like a very interesting class of models. from the papers it sounds like in practice clipping weights to a small set works okay, so the constant factors shouldn't be too bad.

i may have to sit down and try to implement these...

contravariant · on June 16, 2020

Universal approximation doesn't seem that strong a claim at all, you could use simple polynomials to achieve just that (to some extent RELUs are polynomials as well, provided you use 0/1 or integer weights, but that's probably besides the point).

It would be much weirder for a function with lots of parameters to not have universal approximation in some sense, as it would imply that you lose some degrees of freedom somewhere.

jacksnipe · on June 15, 2020

"In limit" is just shorthand for "as N tends to infinity" (not necessarily N, but you get the idea).

fxtentacle · on June 15, 2020

"in limit" here means as you approach the edge case of having an unlimited number of parameters.

T-A · on June 15, 2020

Presumably (haven't read the paper yet) that their network provably becomes a universal function approximator in the limit of infinite size.

Reading... actually the proof seems to be in

https://arxiv.org/abs/1712.01897

jawarner · on June 15, 2020

As the network size increases, it can learn more complex functions. When the network gets bigger and bigger, it gets closer to being able to learn any arbitrary function.

nl · on June 16, 2020

I didn't realise Hutter was on leave from ANU at DeepMind.

caretak3r · on June 15, 2020

As a relative neophyte in this realm, this is fascniating to read. Comparing this to the the models/methods to derive said properties, is good education for me.