Hacker News new | past | comments | ask | show | jobs | submit login
Progressive Growing of GANs for Improved Quality, Stability, Variation [video] (youtube.com)
220 points by visarga on Oct 28, 2017 | hide | past | favorite | 65 comments



I love how using training data from the Internet has resulted in the GANs believing that a picture of a "cat" often contains text -- or, at least, shapes resembling text -- at the top and bottom of the image. (Visible on the right side of the screen starting around 4:20.)


The interpolating animals look so much like my dreams it's unsettling. I often wake up when things get too "unrealistic" and people or animals are changing shape too rapidly. Whenever I see something come out of a NN that looks like something that came out of my brain I get a little future shock.


Did you see the the horror faces it generated for humans? Good for Halloween!


I thought the cats looked like abominations.


The progress in this field is astonishing. It wasn't even a few years ago that the typical demonstration in a paper on GAN would have been an array of very small images. I recall often seeing numbers between 64x64 and 256x256. At the time, the argument was raised that the resolution can hardly be increased because it is our brain that tries to identify objects and faces in the photos to figure out what they are representing.

Then here we are, with indistinguishable 1024x1024 recreations and trippy latent space interpolations. I know not every researcher or entrepreneur has the resources of NVIDIA to train for this many days, but let's not forget, that part needs to occur only once. It makes me wonder about the day that a GAN manages to bankrupt stock photography services.


> I know not every researcher or entrepreneur has the resources of NVIDIA to train for this many days, but let's not forget, that part needs to occur only once.

It's not like they trained this on a GPU farm. According to the paper [1], they "trained the network on a single NVIDIA Tesla P100 GPU for 20 days".

[1] http://research.nvidia.com/sites/default/files/pubs/2017-10_...


Yes, but they probably did not get it right on first try. You usually need hundreds of iterations to get it right. This does not imply 2000 days of GPU, but it quickly pushes this outside of reach of consumer hardware.


Where I live that thing alone is $17,000.


P100 is essentially 1080Ti, so you can grab 1080 Ti and have the same speed. V100 might be the expensive version, up to 10x faster.


You should look at GPU memory bandwidth for a proxy for performance when training DNNs. The P100 is about 40% faster than a 1080Ti. The V100 is only about 75% faster than a 1080Ti.

Based on this, i expect these commodity GPU servers (with 10 1080Ti cards) that cost 1/10th of the DGX-1 will be huge: https://www.servethehome.com/deeplearning11-10x-nvidia-gtx-1...


It may be the same hardware, but nVidia is known to block existing hardware features in their drivers on consumer level chips to drive prices. They did this in the past with quad buffer stereo (full screen only, not available through OpenGL) or decreased double precision performance (transparently added NOPs after each double precision opcode).

I will believe that these devices are equal when someone shows me benchmarks proving that. Until then, I am skeptical based on past experience.


Worth pointing out the main criticism of GANs, which is that right now researchers don't really have a way to tell if a GAN is just copying and pasting the training data or not (there is no "test set" unlike in supervised learning). And in fact an ideal GAN could just learn to output the training set. One example someone found in the generated images for this model: https://twitter.com/nalkalchbrenner/status/92401333254951321....


This is a general problem for generative models, regardless of whether they are explicit latent density (variational autoencoders) or implicit density (GANs) or fully-observed (PixelCNN). To date no consensus has been reached in the research community on what the "right" way to evaluate a generative model is: e.g. Parzen Windows estimators, log-probability, ELBO, etc.

Yann Lecun himself (as of NIPS2016) was pretty critical of probability-based metrics, as those have strong dependence to the choice of model (e.g. if the model is poor the log-probability is meaningless).

In GANs, the critic and the generator are trained w.r.t. each other, reaching some kind of equilibrium. A recent proposal that seems to be "ok" for evaluating GANs was proposed by https://arxiv.org/abs/1705.05263, which is to train a separate critic on the generator, for use in evaluation (the generator never sees gradient information from this critic). This evaluation critic approximates the Wasserstein distance. One could imagine actually training the independent critic on a validation set of images not seen by the training set.


The critic idea seems interesting but doesn't really get to the question of whether/to what degree the GAN is just interpolating the training data. It seems more of a useful tool for diagnosing GANs.


If it were memorizing training data, it would due poorly on a Wasserstein critic trained on a validation set drawn from the same data distribution.


An GAN that just made verbatim copies from the training set would not be able to smoothly interpolate in latent space. Also, see Fig. 10 in the paper at http://research.nvidia.com/publication/2017-10_Progressive-G....


But they seemed to have picked closest neighbors in pixel space instead of z-space, which is not the best idea, no?


That is true, but if you don't look at them as a machine learning method, but rather as a computer graphics method, then it is quite impressive. It has the added benefit of being allowed to overfit as long as the average human does not find out. If you optimize for psychovisual metrics, GANs are fine.


Actually, GANs reach state of the art in anomaly/outlier detection and drug/molecule prediction, so there is certainly more to it than just artistic applications:

https://openreview.net/forum?id=S1EfylZ0Z

https://www.ncbi.nlm.nih.gov/pubmed/28703000

http://pubs.acs.org/doi/abs/10.1021/acs.molpharmaceut.7b0034...


But if you don't see it as a machine learning method, and don't care if the things the GAN spits out are just memorized photos, that means you don't actually care about the synthesis parts of the GAN? Thus the only reason to get excited about it is the interploration stuff; which significantly reduces how interesting it is in my eyes.


My interpretation of this "problem" is that it's a radically efficient solution to the problem of generating content in absence of outside reinforcement.

Said another way: If we gave a human the task of: [make a painting of a bridge] using a handful of examples of bridges as inspiration, and they did a 1-1 copy of one of them, it would be the most efficient result. However there is generally a culturally implied task of [the new painting should not be a direct replication of one of the examples].

So this "problem" with GAN's is a novelty requirement which is not explicitly built in to the generation chain.


Can we please stop with these bullshit human comparisons that happen in every AI comment section?

Copying pictures is not efficient within the scope of this problem. The whole point of these algorithms is to extract (or ideally understand) essential features of some class of objects and to be able to represent an object of such class with radically smaller amounts of data that would be required for the full description.

That is the only definition of efficiency that matters here.


No we can't. You're right that copying pictures is not the goal - and that is my point. Simply copying pictures would satisfy the "adversarial" side of the DNN inference most efficiently and within the constraints of the GAN architecure. We would consider it a hack or a cheat because the problem is poorly defined, hence the paradox.

Human level AI is the goal (at least mine), so every time we see something unexpected or a "failure" in ML it's worth thinking about the "failure" mode when compared with how a human could hack the system.


To be frank - the problem is not poorly defined, you're just not aware of the definition.

In general in generative models, you have some "true data distribution" P and an estimator distribution Q.

The goal is to make P and Q the same, generally by minimizing some divergence between them.

The actual objective is defined as being between the actual distributions P and Q, but because we only have so many data points, we define an empirical loss that just uses the real observations from P. So if the model makes Q just memorize the samples from P, then it actually hasn't made P and Q similar, it's only minimized the empirical loss.

One practical way to get around this with GANs is to train a conditional GAN instead of an unconditional GAN, and then run the conditioned generation task on held-out samples from the validation set. Another good and perhaps more general solution is to train an inference network and to generate reconstructions on held out data points. If they look totally different, then the model is probably not very "representative".


Yeah. There are a lot of really sketchy transitions that don't look human in that video. Also, when the picture looks fully realistic, you can Google search it and find that there are lots of very similar images.

Now, the fact that they usually aren't identical to Google's finds is moderately impressive. So is the fact that some transitions are "smooth" - being able to move from one head/eye position to another. But the difference between drawing a face and copy-pasting someone's face onto a different hair+background is very significant, and quite often the algorithm seems to be doing the latter. (And in any case, GANs are clearly not the way humans draw faces.)

It would be interesting to see someone try to do the same thing without a neural network. How far would they get on the same training set? The dataset is 30,000 pre-aligned and cropped images. (Would be nice if there was a searchable version to make sure generated versions are not identical to something in that set.)

I bet you could get pretty far with just matching and region replacement, plus some color corrections. But not one would pay you for that.


There are a few things to say about this:

1. If you train a conditional GAN to do image inpainting (for example, left to right), it should be quite apparent the degree to which the model is copying and pasting the training set - by running the model with "given" parts from the test set.

2. I disagree that an ideal GAN could just output the training set. I think the right conceptual framework is that any generative model is trying to produce a distribution similar to the data distribution, and we try to accomplish this by using samples from the data distribution. So if the model memorizes the training set, then it isn't actually that close to the true underlying data distribution. In likelihood-based models (for example the usual generative RNN) you can test this by evaluating likelihood on a validation set.


This is also relevant to your comment: https://arxiv.org/abs/1705.07663


Seems like you could validate using random points in the latent space, and crudely verifying that the nearest neighbors in image space aren't similar.


Looking at synthetic celebrities is unsettling. They're all familiar. They're obviously celebrities, but it's impossible to remember their names.


Exactly as the actual celebrities then... I think i recognized one of them.


Generated images start at 0:43 in the video. Link to paper here:

http://research.nvidia.com/sites/default/files/pubs/2017-10_...

Source: NVIDIA


Well, I guess the propaganda bots will be a little more convincing now that they can have unique profile photos.

Would be nice if I could use this to convince Facebook that some fictional image is myself, though.


Oh, damn. I've already had enough of fake personas, but now there is virtually no limit.


Holy crap that video. It makes me think of the scramble suit from A Scanner Darkly[1]

[1]: https://www.youtube.com/watch?v=BWne23FfKW8


Had the same thought! Imagine how much time they could have saved drawing it all!


Yea, but this looks much better. http://blendberg.com/pl/aktualnosci


I’ve seen this going around twitter and the video looks cool but I don’t really understand what’s going on.

Can someone explain it in layman’s terms?


Neural Networks are really good at identifying patterns in data. As a classic example, if you wanted to predict housing prices, you could build a data set that maps features about houses (square feet, location, proximity to Caltrain, etc) onto their actual price, and then train a network to recognize the complex relationship between features and pricing. Training happens by feeding the network features, letting it make a guess about the price, and then correcting the guess (backpropagation).

Convolutional Neural Networks work similarly, but with images. Instead of giving a CNN discrete features, you'll usually just use the pixels of the image itself. Through a series of layers, the CNN is able to build features itself (traditionally things like edges, corners) and learn patterns in image data. For example, a CNN might be trained on a dataset that maps images onto labels, and learn how to label new images on its own.

This video uses Generative Adversarial Networks (GANs) to actually generate new images. In this case, you have two networks "competing" against each other. One network is a traditional CNN trying to identify is an image is "real" or computer generated, and the second network tries to generate new images to trick the first network.

We've been able to generative fairly realistic small images before (usually 64x64), but doing it this well on high-resolution (1024x1024) images is unprecedented.

http://kvfrans.com/generative-adversial-networks-explained/


Typically in a neural network, you train a single network against a single loss function that is known in advance. For example, an autoencoder is (usually) a neural network that has a chokepoint somewhere, and is trained to reconstruct the input image. Since there is a chokepoint (a layer that is significantly smaller than the input), it learns to compress the input and reconstruct it. Sort of like a lossy image compression. To train it, and tell how well it does, we can just measure the output against the input (difference between then reconstructed image and original input image). This tells us how well the network does, and gives us a well known loss function we can use in advance.

But what if we don't have a loss function? Or we don't know it? (for example, how do we even measure "what makes a face a celebrity-like face?") In that case we can train it against another network that is itself trained to differentiate between a "real" input and a "fake" input. The new network takes an image as input, and outputs a probability that the input is real or fake. We don't know the loss function, but by alternating which batch of images this network gets (fake or real), we can tell how well it does (it should estimate the real oens are real, and the fake ones fake). By training these two networks in tandem, we can use the information from the new network (the discriminator) to tell the old network (the generator) how to generate new, better images. This way we don't really need to know the loss function in advance, between the discriminator serves as our loss function.

That is the general idea. In practice, it's fairly non-trivial to get these two networks to work together nicely... often one will get much better than the other, which prevents the other from learning.

In this particular paper, they are using a technique to expand the size of the images to much larger than you would normally be able to.


A neural network is fed hundreds of thousands of images of celebrity faces. From this it learns the typical characteristics of human faces, and can then be used to draw new faces with random combinations of those characteristics.

This work is interesting because it is the first one that has been this successful at high resolution, so the output images are large and detailed rather than postage stamp sized.



Wow. Watching this guy is painful


Why?


I wonder if something like this could be combined with 3D animation to produce super realistic computer-animated films.


Nice idea.


Some of the generated celebrity images seem pretty much indistinguishable from reality. Does reality exist any more? In the future, it may get harder and harder to know.


Wonder how long until this is co-opted by the porn industry and what the law will have to say about it.

Is it illegal to own a digital brain that can think up illegal porn?


I would say that it probably is. If you trained a network using illegal data (e.g. cp images) then not only did you have to have that data once, which is of course illegal, but the data itself is at least partially encoded in the network weights which I think should make it illegal.


I guess you might be able to get around this if you train only on legal content and can interpolate into content that would be illegal if a real recording.

However, I'm not sure whether there are any other applications for this specific interpolation scenario that would lead to it being developed, as the effort required to make it work is likely much higher.


Having the model produce realistic interpolations through areas of the latent space that had no associated training data is surely something that people will be trying to make happen.


In the us, child pornography with digital actors has already been ruled to be legal.


Note to all: Please don't do this. Lots of missing kids out there and the agents working in this area don't need anymore distractions.


Love how this was downvoted. Way to stay sane HN... I don't think that any conversations focused at further promoting child abuse imagery should be encouraged in this community, particularly when such misapplication of these technologies would profoundly derail existing investigations.

But fuck it. Downvote away!


The authors have a GitHub repo containing code and links to the paper and more outputs:

https://github.com/tkarras/progressive_growing_of_gans


Written in Theano, of course! R.I.P.


A GAN is a Generative Adversarial Network[0], the video is like animated Deep Dream[1] stuff but way more refined.

I don't like the horizontal sliding transition because I'm way focused on the bizarre iterations of the various targets.

Gonna have to update our camouflage patterns again to combat computer vision...

[0]: https://en.m.wikipedia.org/wiki/Generative_adversarial_netwo...

[1]: https://en.m.wikipedia.org/wiki/DeepDream


For people interested in the details, I wrote a small summary of the paper: https://github.com/aleju/papers/blob/master/neural-nets/Prog... (Prior knowledge of GANs is recommended, otherwise its probably hard to understand.)


This+VR = porn industry dream (also propaganda industry dream)


Meanwhile, my Neural translation models don't converge. Sigh.


Interesting. In another 300-500 years, I am pretty sure we will start simulating sensory experience and ultimately the past. I am not sure if I am a toy simulation of the past from the future, right now.


> I am not sure if I am a toy simulation of the past from the future, right now.

Think about this: what is the only generation that lives in times when massive recording, storage and communication is possible, but was not yet influenced by AI? Ours. When AI will want to simulate a "natural" human society, it has only us as templates to train its humanGAN on. Even our past comments on reddit and Twitter have been used many times to train dialogue systems - we're being uploaded with each post we make.


> In another 300-500 years

instead of "In the future", triggers me. Do you know something we do not? I am resigned to the notion that everything is up for grabs.

> I am not sure if I am a toy simulation

We need "reality discriminants". The trippy thing is if they could exist and their output is not necessarily boolean. There would be a threshold point at which beings can exist along the simulated-real spectrum, were by they can understand the output of the "reality discriminants", yet they are not real.


The way these GAN algorithms work is precisely by building a network that discriminates reality from fake. That's why they become so good at this!



Interesting thought.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: