Variational Autoencoders Explained

sdenton4 · on Sept 15, 2018

Two things I believe to be true...

1) Auto-encoders are overplayed, mostly because they're a pretty easy intro ML project. There was a brief moment (before ResNets and batch normalization showed up) when they were useful for bootstrapping a representation, but they aren't serving a terribly concrete purpose now that it's so much easier to get end-to-end deeper pipelines running and learning their own representations. The common criticism is that a representation that's great for reconstruction may still not be super useful for classification (or whatever your real goal is). And, compared to 'real' engineering, autoencoders do a pretty crap job at data compression.

2) That said, /variational/ auto-encoders are doing some interesting things. There was a nice paper [pdf: https://arxiv.org/pdf/1612.00410.pdf ] using variational methods to try to take advantage of the (possibly chimerical!) information bottleneck of Tishby. And I think the general idea of being able to have middle-of-the-stack loss may still have some use for helping models generalize; loss based entirely on reconstruction error seems a bit too constrictive, though.

sherjilozair · on Sept 15, 2018

Autoencoders lead to other ideas. Understanding AEs and VAEs help you understand GANs, CycleGAN, PixelCNN, Pix2pix, and so on. AEs teach that you can do other things with neural networks, apart from supervised learning. VAEs teach that neural networks can parametrize more complex distributions.

TaylorAlexander · on Sept 15, 2018

Another recent work that takes advantage of VAEs is World Models: https://worldmodels.github.io/

It did have an issue where the VAE recreates unnecessary features, as the training did not consider feature saliency.

theCricketer · on Sept 16, 2018

> Auto-encoders are overplayed, mostly because they're a pretty easy intro ML project.

I think you mean "normal" autoencoders, like denoising autoencoders or the identity autoencoder that are used for feature learning. Note that variational autoencoders are not really autoencoders in that sense. They are called “autoencoders” only because the final training objective that derives from the probabilistic setup does have an encoder and a decoder, and resembles a traditional autoencoder.

Traditional autoencoders are the common intro projects used for representation learning and to bootstrap other networks, not variational autoencoders.

make3 · on Sept 15, 2018

I don't think anyone pro-level would think of your point 1) as controversial or anything of the sort. This is a really well accepted, mainstream line of thought.

nabla9 · on Sept 15, 2018

> auto-encoders are overplayed,

What you are using for unsupervised learning instead of autoencoders?

sdenton4 · on Sept 15, 2018

Triplet loss is a good alternative, which plays nice when dealing with large, partially labelled data set. Here's a random blog post: https://omoindrot.github.io/triplet-loss

nabla9 · on Sept 15, 2018

It's not unsupervised learning.

sdenton4 · on Sept 16, 2018

You can run triplet learning as unsupervised representation learning. Take let aug(X) be an augmentation of X (noise, translation, etc), and form the triple (X, aug(X), Y). Let f be your neural network (or whatevs). Then reduce the distance d(f(X), f(aug(X))) while increasing d(f(X), f(Y)). No labels needed.

nabla9 · on Sept 16, 2018

Has anyone done that and what is the advantage over modern autoencoders?

I would really like to know, because I'm currently working with autoencoders. Relying only on synthetic samples can introduce all kinds of biased denoising and require endless tweaking.

When you tie the weights and convolution kernels in the decoder to those of the encoder, you get relatively fast learning without excess number of variables and encoder-decoder can be much deeper.

sdenton4 · on Sept 16, 2018

Here's the OG triplet metric learning paper: https://arxiv.org/abs/1412.6622 Their strategy is for labeled data; just take X and X' with the same labels, and Y from a different label.

And here's an example of unlabeled triplet learning: https://arxiv.org/abs/1711.02209 In this case, aug(X) might be a slightly time or pitch shifted example. So whatever the label is for X, it will be the same as the label for aug(X).

The advantage here is that as a framework it plays nicely with 'real' problems. Representation learning is rarely the actual problem; it's certainly not something an end-user cares about! If you are ultimately working on a classification problem and have a huge amount of unlabeled data and a tiny bit of labeled data, you can train the metric with both the Huff scheme (X and X' with same label) and the unsupervised scheme (X and aug(X)) simultaneously, and take advantage of the large dataset.

A more general argument for metric learning is that, again, reconstruction error proooobably isn't what you actually care about. (Sure, you're sharing encoder and decoder weights, but why do you even need to decode?! How many parameters are being wasted on the need for the weight matrices to support decoding?) If clustering is what you want, metric learning gets at things being close or far from one another more directly.

nabla9 · on Sept 16, 2018

Thank's for the info. I look into it.

dostres · on Sept 16, 2018

That’s supervised of a different sort. Also, it may end up learning the effects of the noise which may not reflect real data.

make3 · on Sept 16, 2018

for text, language models

bitL · on Sept 15, 2018

Re 1) AEs are still a neat base for more advanced and very useful architectures; visual semantic segmentation uses a similar architecture that adds 1x1 skip connections, LVAE/beta-VAE are useful for anomaly/fraud detection even today. AEs give us a pretty cool way to do customizable dimensionality reduction as well, so PCA/ICA/t-SNE/t-UMAP/etc. might not be needed in the best case.

theCricketer · on Sept 16, 2018

The insight that made it possible for me to grasp VAEs was digging into the probabilistic setup that leads to this formulation. The neural networks are "just" a powerful function approximators applied on top of this probabilistic framework.