Great work! Hacker News still seems to have a deeply skeptical culture with regard to machine learning - not sure why. There's always someone saying it's "not novel" and it's "just doing x".
Overfitting is a known issues in machine learning, people. If you still think all neural networks are doing is memorizing the dataset completely in the year 2021 - you might want to revisit the topic. It is one of the first concerns anyone training a deep model will have and to assume this model is overfit _without_ providing specific examples is arguing in bad faith.
Sentdex has shown his GAN is able to generalize various game logic like collision/friction with vehicles and also learns aspects of rendering such as a proper reflection of the sun on the back of the car.
He also showed weak points where the model is incapable of handling some situations and even did the impossible task of "splitting a car in two" to try and solve a head-on collision. Even though this is a failure case; it should at least provide you with some intuition that the GAN isn't just spitting out frames memorized from the dataset because that never happens in the dataset.
You will need to apply a little more rigor before outright dismissing these weights as merely overfit.
@sentdex Have you considered a guided diffusion approach now that that's all the rage? It's all rather new still but I believe it could be applied to these concepts as well.
Heh, yeah, tough crowd I guess. The full code, models, and videos are all released and people are still skeptical.
I feel like 95%+ of papers don't do anything besides tell you what happened and you're just supposed to believe them. Drives me nuts. Not sure why all the hate when you could just see for yourself. I'd welcome someone who can actually prove the model just "memorized" every combo possible and didn't do any generalization. I imagine the original GameGAN researchers from NVIDIA would be interested too.
Interesting @ guided diffusion, not aware of its existence til now. We've had our heads down for a while. Will look into it, thanks!
> I feel like 95%+ of papers don't do anything besides tell you what happened and you're just supposed to believe them.
Honestly I think there's a big problem with page limits. My team recently had a pre-print that was well over 10 pages and we still didn't get everything and then when we submitted to NeurlIPS we had to reduce it to 9! This seems to be a common problem and why you should often check different versions on ArXiv. And we had more experiments and data we needed to convey since the pre-print. This problem is growing as we have to compare more things and tables can easily take up a single page. I think this causes an exaggeration of the problem that always exists of not explaining things in detail and expecting readers to be experts. Luckily most people share source code which helps show all the tricks authors used and blogging is becoming more common which further helps.
> I'd welcome someone who can actually prove the model just "memorized" every combo possible
Indeed. Novel, efficient program synthesis is still novel, efficient program synthesis even if it's a novel, efficient data compression codec you're synthesising.
>> The full code, models, and videos are all released and people are still skeptical.
If you're uncomfortable with criticism of your work you should definitely try publishing it, e.g. at a conference or journal. It will help you get comfortable with being criticised very quickly.
Perhaps, but that criticism should be the easiest to ignore. The OP expresses frustration to lay criticism and I expect that even brief contact with academic criticism will make the frustration felt by the OP to lay criticism fade into irrelevance.
I've been learning about this stuff for about a year now. Your earlier experiments with learning to drive in GTA V were an inspiration for me - because they hit that perfect intersection of machine learning, accessibility in education, and just plain cool.
The pandemic hit and Open AI had released DALL-E and CLIP. I was unemployed and bored with my Python skills and decided to just dive in. I found a nice gentleman named Phil Wang on github had been replicating the DALL-E effort and decided to start contributing!
We have a few checkpoints available with colab notebooks ready and there is also a research team with access to some more compute who will eventually be able to perform a full replication study and match a similar scale to Open AI and then some because we are also working with another brilliant German team https://github.com/CompVis/ who has provided us with what they are calling a "VQGAN" (if you're not familiar) - which is a variational autoencoder for vision tokens with the neat trick from GAN-land of using a discriminator in order to produce fine details.
We use their pretrained VQGAN to convert an image into digits. We use another pretrained text tokenizer to convert words to digits. The digits both go into a Transformer architecture and a mask is applied to the image tokens in the transformer so that the text tokens can't see the image tokens. The digits come out and we encode them back into text and image respectively. Then, a perceptual loss is computed. Rinse, wash, repeat. Slowly but surely, text predicts image without ever having been able to actually _see_ the image. Insanity.
Anyway, taking a caption and making a neural network output an image from it has again hit that "perfect intersection of machine learning, accessibility in education, and just plain cool". I don't know if you could fit it into the format of your YouTube channel but perhaps it would be a good match?
FWIW I saw your video a couple of days ago via Reddit and I loved it a lot. Even sent a link to the video to a friend of mine because I think it was a very inspiring and interesting video.
One of the main problems with ML/NN is it often works like magic, aka the trick works as long as audience doesnt know the secret behind it. Its fascinating to gullible audience, mundane bordering on boring to practitioners.
>able to generalize various game logic like collision/friction with vehicles and also learns aspects of rendering such as a proper reflection of the sun on the back of the car
id did none of that, what this model did is learn all the frames of video and their chronological order according to the input.
> impossible task of "splitting a car in two" to try and solve a head-on collision.
it played back both learned versions at once, like reporting confidence of round thing being 50% ball and 50% orange.
In the end, everything is boiling down to matrix math, so you can always make the argument that no neural network is impressive if you want.
The model's size is ~173MB, depending on settings. That's not much space to have memorized every single possible combination of events, nor was our data enough to cover that either.
Your original self driving GTA5 videos are what helped me come to understand machine learning in the first place (along with some of Seth Bling's MarI/O, and a bit of Tom7's learn/play-fun magic). I used your tech to make an AI that played Donkey Kong Country in LSNES emulator shortly before Gym-Retro was released.
Not offhand, but you've probably inspired a lot of creativity with this across the internet... and a lot of copy cats. I'm looking forward to seeing what gets made.
>> The model's size is ~173MB, depending on settings. That's not much space to have memorized every single possible combination of events, nor was our data enough to cover that either.
The resolution of the images output by the model is very low (what is it exactly, btw?). It's not impossible that your model has memorised at least a large part of its data.
In fact the simplest explanation of your model's output (as of much of deep neural networks for machine vision) is that it's a combination of memorisation and interpolation. There was a recent ish paper by Pedro Domingos that proposed an explation of deep learning as memorisation of exemplars similar to support vectors (if I understood it correctly - only gave it a high-level read).
It's also difficult to see from your demonstration exactly what the relation between the output and the input images are. You're showing some very simple situations in the video (go left, go right) but is that all that was in the input?
For example, I'd like to see what happens when you try to drive the car over the barrier. Was that situation in the input? And if so, how is it modelled in the output?
Finally, how do you see this having real-world applications? I don't mean necessarily right now, but let's say in 30 years time. So far, you need a fully working game engine to model a tiny part of an entire game in very low resolution and very poor detail. Do you see this as somehow being extended to creating a whole novel game from scratch? If so, how?
Edit: on memorisation, it's not necessary to memorise events, only the differences between sets of pixels in different frames. For instance, most of the background and the road stays the same during most of the "game". Again, the resolution is so low that it's not unfathomable that the model has memorised the background and the small changes to it necessary to model the input. So, it interpolates, but can it extrapolate to unseen situations that are nevertheless predicted by the physics you suggest it has learned, like driving over the barrier?
That is impressive! Less than twice the size of ResNet-50 weights. Surely that is within an order of magnitude of an equivalent Unity or GoDot game+models.
> My Tiger repelling rock^^^^^^leopard detection model works great on all animal pictures ... until you feed it a sofa
I'm sorry, how is this different than normal software engineering? There's dozens of unit/integration testing memes poking fun at specifically this (which is a mostly solvable problem in ML btw, when you use out of distribution data. Give your model a 3rd end state that represents "neither").
> id did none of that, what this model did is learn all the frames of video and their chronological order according to the input.
A better explanation is that the network knows what frame to generate given the current frame (and n previous frames) and the current user input. If it was memorizing then it'd have to generate an extremely large number of scenarios (it would exponentially grow as any given frame has k possible actions from your current frame to the next frame). If Sendex can run the game for arbitrary length and take arbitrary actions then it is a far more reasonable explanation that the model is generating the frames rather than memorizing. Apply Occam's Razor.
Edit: Sentdex said the model was ~173MB, so that is not large enough to memorize the gameplay.
Maybe I'm misinterpreting, but if you've ever seen a cat freak out about a cucumber (an entire video genre, apparently), ostensibly real intelligences make similar errors.
Beyond rote memorization, it looks like it could be explained by saying the model appears to have a found a concept of consonance and dissonance that is bounded within the field of its inputs, and a networked grammar for interacting with the up/down/left/right inputs. Some people might find that technically trivial, but as a layman I am impressed.
The "magic" part is that the response of the network appears to be so complex relative to its inputs, but given the input is so limited from a controller, it's easy to attribute more meaning to it when it is working with a finitely bounded simulated model.
Generally I'd wonder, if the behaviour appears more complex than the stimuli, do we tend to attribute intent to it?
Yes, when it is there for no valid reason, or ridiculous reasons. Skepticism is not a default position you can take like a toddler refusing to eat their vegetables. You need some informed (and non-fallacious) intelligent reasoning behind that. "I'm skeptic about this thing using X because X is so hyped these days" is not such reasoning.
Well, it kind of is. Blockchain has been hyped by charlatans as the cure to all world's ills. That means when you read something about blockchain you should be especially suspicious.
Similarly, I've read too many people hyping up glorified chatbots as one step below AGI (see the :o reactions to GPT3), so I'm now extra skeptical about claims about machine learning.
"I'm skeptical about this thing using X to do Y because the burden of proof is on people claiming X does Y and historically they have failed to meet that burden"
I don't know what skepticism has to do with ridiculous toddlers - they are almost universally incapable of grasping the nuances of epistemology.
There's skepticism and then there's being a non-expert in a field and talking with high confidence. How do you differentiate these? Conspiracy theorists use the same logic. You're right that skepticism is good, but it is easy to go overboard.
Sure, but skepticism should decrease if there are a community of experts are saying the same thing. As an example, anti-vaxxers often claim skepticism and that they have done their own research. The reason we don't trust them is because we think doctors have a greater expertise in the subject than them (it is, either way, trusting someone). Unless you're a virologist you probably don't actually have the expertise to actually verify vaccine claims.
So sure, you are right, but in the context of this discussion you're implying that the vast majority of ML researchers (myself included) are charlatans. I'm not sure what the meaningful difference here is. We're publishing results, people are actively reproducing them, and then some person on the internet that doesn't understand the subject comes along and says "you're full of shit." We can even disprove the claims being made (e.g. I've explained why the network can't be memorizing the game in another comment). That is literally happening in this thread (GAN Theft Auto is in fact a replication/extension effort). Is that meaningfully different from the anti-vaxxers?
I think it’s a problem when it turns to - being skeptical for the sake of it.
Not been too long on HN but the top comments on most threads are a contrarian one (and one which I truly appreciate because it provides a different POv) but sadly because it is encouraged through the high upvotes, the crowd tendency is to regress towards this approach, even if sometimes the rigour of the critique is lacking
It can be, but its certainly not an unmitigated good. Especially when it leads to aspersions of fraud and conspiratorial thinking (e.g. rasz's comment thread below).
Skepticism is good when it targets bold claims with vague proof. This is not a bold claim (it's a video demo showing the process) and its proof is not vague (you can inspect the source). Skepticism over something like GPT-2 without more than sample output is good. Skepticism over GPT-2 with a workable demo and source is unhelpful.
I like your YouTube videos in general and think this content is a great benefit to the community.
I wouldn’t take the few negative comments personally - I’ve seen many GAN architectures that heavily overfit (including my own bobross pix2pix) get a lot of praise, while ‘less violating’ models (like yours) get more skepticism. Skepticism isn’t bad! But I’d wager in your case it may be because you’re a YouTuber, and other ml YouTubers are notorious for ripping off content (eg Siraj).
Not really related to this, but I’d personally love to see the difference in training times it would take an RL agent to adequately learn to drive a car in gta versus adequately flying a helicopter.
One throwaway line about GAN operating systems now made me want to see a shell GAN. Keypresses as inputs, 80x24 terminal screens as outputs. Could a neural network dream of Unix?
Wow, what an incredible video and showcase. This really puts GPT-3's power into perspective. I can't wait till the public has access to something that powerful- or maybe I should enjoy not receiving GPT-3 phishing emails in my inbox.
GPUs: am I a joke to you? Instead of using them to render polygons, let’s use them to train neural networks that produce models that make them unnecessary. I’m oversimplifying - but pretty wild nonetheless.
Something I'd like to see is a visualization of subsets of the network's internal state that correlate with simple quantities like compass direction, velocity, position, etc. It'd be really fascinating to see where in the model these things are being learned, whether they are concentrated in a small area or spread out, and whether this is somewhat consistent across different iterations of the model.
Me too! In a much simpler setting a former colleague of mine, Jacob Hilton, tried such an exploration for the vision part of a OpenAI CoinRun model. It’s the first part of this paper: https://distill.pub/2020/understanding-rl-vision/
Can someone explain a bit more on the long term applicability, or maybe other use cases that might be easier to appreciate?
The reason why I ask is that it seems very challenging to generate the training data for such systems. Could someone explain how this can go further than to just replicating X? So, if assuming some creative freedom, could you give an idea of what the long term application of this would be?
NB: please take my questions at face value without thinking I'm implying this isn't cool for what it is. I'm all for people having fun. I'm all for projects not needing to tackle some grander issue.
That is an interesting thought. I don't fully understand how though. The main challenge is the training data. If you need to first create the interactive experience... What would the added value be?
For example is creating novel combination based on (large) training set. If the network had enough weights on what is realistic looking could create novel game experiences based on a prompt of say a film or book.
Looks interesting, if very far from practical-- too bad it requires a "DGX" station to train
It seems to flicker/fade things in alot, like the random poles that keep appearing and disappearing, it seems like there is not enough focus on temporal consistency or something?
If you see the source output image before it was upscaled you noticed the resolution is too low and thin objects "fall between" the pixels. The upscaler then interprets it as air, it seems.
It should be possible to take arbitrary video training data (whether from a game or real-life) and automatically reconstruct the 3D models of all vehicles in the scene (and the skybox) and "play back" the scene in a video game engine.
This is the direction virtual and augmented reality is headed (Facebook Codec Avatar, and their room reconstruction technology).
Impressive. Makes you wonder if at some point in the future there isn't a game engine any more but tons of training material and you play in a generated dream.
Certainly impressive. And sure, maybe in a distant future. Though I think this is like one of those things where creating a working prototype that is 75% complete is the "easy" part. The other 25% (which you need for an actual working product) will take forever. Like self driving cards, nuclear fusion, etc.
Text dungeon was very much a 75% complete thing, yet it's greatly entertaining in its own right. I would happily play a dream game which just falls apart sometimes.
I fail to see novelty here. What's the size difference between the model and and all of the 64x32 image training data? If the difference is not significant, you're basically almost just scrubbing a video, right?
The GAN model is the game environment. You're playing a neural network. The novelty is no game engine, no rules, just learned how to represent the game and you can play it.
The first link provided seems to need a very detailed human-provided cost function for specific development needs.
The second one is indeed interesting research and seems to be a combination of the prior learned motion mapping working in tandem with a generative model.
I suppose you could say that the automation of the dataset is considered as "augmentation"; but the difference here is that the dataset is just pixels and inputs rather than all that animation info and simulation data. Yes, a simulation is running; but the GAN only gets the pixels and the input.
There's a similarity there though; you're right. In either case; the explicit goal of the video you posted is to combat runtime constraints of generative models. I'm not certain it's a fair comparison.
The latter video and sentdex's result both seem to generalize to unique scenarios not present in the training set. This may mean they are creating an efficient representation of the underlying data in order to predict future samples more easily than simply overfitting.
The top level comment here is a shallow dismissal and Randomoneh could have answered these questions themselves before throwing out a smug comment like "I fail to see novelty here" when it's at the very least the first large-scale GAN successfully trained on GTA V.
The first link exposes the trick employed by your model.
>animation info and simulation data
but did your model learn any of that?
>explicit goal of the video you posted is to combat runtime constraints
The trick to motion mapping is feeding a lot of data with accompanying inputs to build an atlas you can reference during playback.
>first large-scale GAN successfully trained on GTA V
Its really cool. The problem I had is in the presentation. I immediately felt insincerity bordering on scamming the audience, because I assume someone working in this field would know how the sausage is made. From the YT clip: "the shadow and reflection works", "modeling of physics works". Do they? or did your model build an atlas of video frames it can play back according to the fed input? Im guessing weather/time of day was locked when recording training data - perfect shadow and constant sun position for a nice reflection. Searching for 1:1 matches of generated output in the training set would be interesting and pretty revealing.
> I immediately felt insincerity bordering on scamming the audience
MFW I read this. Jeez man. Model size is 173MB. It didn't just memorize every possible combo.
How the hell you went from our excitement about a fun project we shared on YT to accusing us of "scamming" the audience I really don't know. What a terribly rude and hateful attitude you have =/
Don't take it personal. Commenters on HN are famous for dismissing successful ideas (remember Dropbox?).
I have one question: you mentioned that the training data was 100GB. Was it the same resolution as what is output by the model (ignoring supersampling)?
I wouldn't call it scamming, but 173MB is not small at all. At the resolution of this model, you can easily fit the entire Titanic movie in 173MB. Maybe even have enough space for audio.
Furthermore no one is saying the model "memorized every possible combo". However imagine you have a set of keyframes (maybe even multiple fragments per frame) and you need to interpolate between them? Not that hard of a task, isn't it.
Models don't care about simulating our "intention" properly. They care about fitting the input in the simplest way possible. Think about a model like a lazy worker merely trying to look like it's working.
None of this makes NN less exciting, but it should inform us you can't go 0 to 60 in one step and hope the NN would have great insight about what it's doing.
We need models that make smaller conceptual jumps, i.e. models that understand 3D space, then models which understand transformations in 3D space, then models which understand citicscape, etc. etc.
It sounds like you and others are trying to clarify how this demo doesn't live up to your idealized, subjective expectations. Noone is claiming this to be a revolutionizing or even useful video game engine.
It's a neural network that recreates a limited, yet fully dynamic gameplay segment only based on player input. It's a really neat and fun project.
I think it's quite telling that you point to me about having idealized, subjective expectations and then describe the demo as "limited yet fully dynamic gameplay". It rotates the car to left or right depending on whether you press left or right.
It's super-interesting but it doesn't recreate limited fully dynamic gameplay. It doesn't recreate any sort of dynamic gameplay. That's your idealized, subjective interpretation.
The driving seems pretty dynamic to me. Maybe "fully" was a bit hyperbolic, as I can't really justify or quantify what that would entail. On the other hand, saying that it's not dynamic at all seems equally misguided. Also you seem to disregard the "limited" and "segment" qualifiers which was there for a reason.
> However imagine you have a set of keyframes (maybe even multiple fragments per frame) and you need to interpolate between them? Not that hard of a task, isn't it.
Intrestingly, the video artifacts of this model look somewhat similar to those from simple motion interpolation algorithms such as ffmpeg's minterpolate, especially during fast camera motion.
https://ffmpeg.org/ffmpeg-filters.html#minterpolate
I feel scammed when practitioner of the art tries to sell me on his model "learning physics of the simulation. Look, it even figured out where to put the shadow".
Have you seen the video? The author even goes as far as suggesting the technique might useful for (generating?) entire operating systems at https://www.youtube.com/watch?v=udPY5rQVoW0&t=853s. That's just wild.
I suggested there could be a "future where many game engines are entirely or even mostly AI based like this. Or even things like operating system or other programs."
The thought here was just a wondering of what the future might be and if we might have far more AI based programs.
I still think the answer is a strong yes, this is a glimpse into the future. No where did I say GameGAN would be that engine. You're just trying your hardest to hate.
Manipulative much? I don't hate you (well, so far), you aren't being attacked, I'm just noting what a few informed people here don't like about your video. No, they aren't trolls. And, yes, everyone has different level of tolerance to exaggerations, of course.
Odd, pretty sure it was you who misrepresented what I said in attempts to manipulate.
You were also the one who "exaggerat[ed]" my claims. I made a general statement about my thoughts about future AI-based software rather than human-coded.
I still think that's indeed the inevitable future. Doesn't seem like it's remotely outrageous or an exaggerated. I never said GameGAN would be that software, but you seem to want to make that be the case so you can put it down.
What makes you believe neural networks aren't or could not be deterministic? What makes you think NNs could not eventually produce far more robust, reliable, and secure operating systems?
Seems obvious to me, but I guess you're more informed than me :)
You, like many youtubers, made completely exaggerated claims in your commentary. Your model fits a sequence of inputs to a video frame. But you say "wow look it even models the movement of the sun!". It's pretty absurd.
Overfitting is a known issues in machine learning, people. If you still think all neural networks are doing is memorizing the dataset completely in the year 2021 - you might want to revisit the topic. It is one of the first concerns anyone training a deep model will have and to assume this model is overfit _without_ providing specific examples is arguing in bad faith.
Sentdex has shown his GAN is able to generalize various game logic like collision/friction with vehicles and also learns aspects of rendering such as a proper reflection of the sun on the back of the car.
He also showed weak points where the model is incapable of handling some situations and even did the impossible task of "splitting a car in two" to try and solve a head-on collision. Even though this is a failure case; it should at least provide you with some intuition that the GAN isn't just spitting out frames memorized from the dataset because that never happens in the dataset.
You will need to apply a little more rigor before outright dismissing these weights as merely overfit.
@sentdex Have you considered a guided diffusion approach now that that's all the rage? It's all rather new still but I believe it could be applied to these concepts as well.