What about findings w. R. T. Online learning? I find continuous learning quality...

fxtentacle · on June 15, 2020

I believe that to be a red herring. Their approach cannot learn any features that provide a lower-dimensional approximation of the input data. As the result, there is no intermediate representation which could change and thereby negatively affect previously learned classifiers.

But if I train 10 independent traditional networks, I also won't have newly learned data affect old performance. So in effect they give up the possibility to do transfer learning in exchange for avoiding the disadvantages of transfer learning. But that's a bad tradeoff.

With their approach you always train from scratch, which brings with it the need for huge training data sets.

So I can train a bird classifier on the traditional architecture with 500 labeled images and a pretrained resnet. Our I use a million bird images and this approach.

friendly_aixi · on June 15, 2020

1) Indeed, GLNs don't learn features... but I would claim they do learn some notion of an intermediate representation, it's just different from the DL mainstream -- in particular its closely related to the inverse Radon transform in medical imaging.

2) Inputs which are similar in terms of cosine similarity will map to similar (data dependent) products of weight matrices, and thus behave similarly, which of course can affect performance in both good and bad ways. With the results we show on permuted MNIST, its well... just not particularly likely that they will interfere. This is a good thing -- why should completely different data distributions interfere with one another? The point is the method is resiliant to catastrophic forgetting when the cosine similarity between data items from different tasks is small. This highlights the different kind of inductive bias a halfspace gated GLN has compared to a Deep ReLu network.

3) Re bird example, that's slightly unfair. I am sure one could easily make use of the pre-trained resnet to provide informative features to a GLN -- it's early days for this method, hybrid systems haven't been investigated, so I don't know whether it would work better than current SOTA methods for image classification. But I would be pretty confident that some simple combination would work better than chopping the head off a pretrained network and fitting an SVM on top. This is all speculation on my part though. :)

fxtentacle · on June 16, 2020

2) The problem that I would expect with a hybrid method is that conv features are usually trained to be redundant with dropout, so they should be highly correlated with each other and, thus, have a high cosine similarity.

3) I agree that my argument is scientifically unfair. I was trying to argue from the perspective of a prospective user. My customers tend to have a budget limit of how much their classifier is allowed to cost. Training from scratch would be too expensive. But a chopped reset with some conv layers will work OK and be cheap enough.

So for me the user, the ecosystem around your architecture and the availability of pretrained models might make the critical difference on whether I'll use it or not.

BrokrnAlgorithm · on June 15, 2020

Good point as well - sometimes its not about dimensionality reduction but more about persistent representation, having this geared towards highly non-stationary environments is nice thing to have.

BrokrnAlgorithm · on June 15, 2020

Still, there are a lot of domains where transfer learning is no the most applicable setting - I'm thinking of highly noisy and non-stationary setting such as finance. In some of these domains, especially time series, lack of data is often not the issue, e.g. high frequency datasets.

Having models constantly re-train as the default setting is essentially what a rolling regression would do - having a rolling regression that doesn't catastrophically forget would be quite valuable.