For anyone who isn't familiar, DALL-E is the state of the art for text-to-image generation. It's closed source, but it's astonishing: https://openai.com/blog/dall-e/
Hey Phil, not quite on topic but I've used some of your implem (both for hobby projects and for work) and the conciseness and clarity of your code is a delight. Many thanks!
They already had a small drone war. Azerbaijan vs Armenia. Armenia had 1990s soviet stuff. Azerbaijan had recently purchased drone tech. Armenia got totally destroyed.
These are previews of future wars, much like the wars in the decades before WW1 previewed the utter slaughter that was to come thanks to the new technological era of mechanized warfare.
OpenAI has a reputation and a history for pushing compute forward (AI in Dota2, GPT, CLIP, DALL-E, etc.).
They have everything to lose by lying. If they say that these examples are not cherry-picked, then we have no reason (a priori) to doubt them.
On a side-note, the fact that you could doubt the results are real is telling: each of their compute-heavy experiments shakes our belief and further reinforces their reputation.
This was a big pet peeve of mine early in my linux usage as well (still is really). Especially since it is not exactly straight forward to figure out what binaries get installed when you install a package. Almost feels like there should be some namespacing like pkg_name->cmd_name so you can at least tab complete package binaries easily.
Yes debian has the same types of things, but maybe spread across a few commands. I actually just prototyped what I am thinking here https://github.com/seiferteric/pkg_namespace if anyone is interested.
I disagree with that. The primary command line executable provided by systemd is systemctl, this is an apt name to interface with and control the system daemon, which is an executable it also provides, but not one that is intended to be directly invoked from the command daemon.
ImageMagick is also far more than the command line utility that it provides to interface with it's library. Say that it had been written by a third party, but did the same thing using the ImageMagick library as a dependency, would it then be fine for it to have a different name?
Ryan Murdock/advadnoun here: Glad to see some interest in the project!
I've written a few follow-ups to this, with some public notebooks as well that produce qualitatively different results. If anyone is interested, it should be pretty easy to find the BigSleep method, which steers BigGAN in a very similar way to this, as well as the Aleph notebooks, which use the DALL-E decoder or Taming Transformers VQGAN to generate images with CLIP, depending on the version of the notebook.
One thing that has fascinated me about this is the possibility of translation across dissimilar categories.
What I mean is we have style transfer in the domain of text, and we can style transfer with images. And we can generate images from text. Can we style transfer from image to text or vice versa? Can prose be rewritten in a manner that, in some sense, adheres to aesthetic principles of impressionist painting?
Presumably there would be some kind of informational representation of text style discernable to an image generation system. And just like an artistic style can be extracted from a painting and transposed to a photograph, perhaps an interpretation of textual style could be applied to a photograph despite them being different mediums.
What would that even look like? I don't know, but I find the possibilities fascinating.
The temptation, I think, is to make a first pass at answering this question in a frustrating, cartoonishly shallow way. And I think systems will possibly be developed that just go ahead and do it before people are culturally ready to understand it in a non-frivolous way. Everyone needs to get those reactions out of their system, I guess, but there's a more nuanced possibility here that might allow clashing of dissimilar categories in ways never previously contemplated.
Nice ! This points me back to the my favorite mental model of machine learning/nn. It’s always about shuffling in an out of a number of dimensions and the mappings between them.
For some reason Colab let me run this all last night. Wow just wow. Eventually I got the feeling that it matters if I'm watching.
Each of these links has about 9 images and the prompts that made them. Sometimes the image does not look like an animal right off the bat, then it seems like you asked for something the network had to say.
How long did it take for you.. I just did one text to image and Colab was churning away for a few hours but then choked after maybe 300ish image iterations saying something about some kind of space limit being reached..
Were you able to complete the 1050 iterations, and how long did it take?
- the model usually locks in within 200-300 iterations, so if you don’t like the result by then, retry
- in fact, you can tell if the model is off to a good start within 25-50 iterations and I encourage you to cherry-pick runs early and often; don’t be afraid to restart
- time to render depends on which GPU you get from colab, but I usually run the renders for 10 minutes a pop. About 1-2 minutes if I run them on a 3090 locally
- the prompt plays a big role in the quality of the result; “A painting of a dog playing fetch” will usually turn out better than “dog playing fetch”
- lucidrains/bigsleep produces better results generally than lucidrains/deepdaze (this is my subjective preference)
- the colabs linked to from the big-sleep GitHub repo produce poorer results than running them as a python package locally (this one might honestly be placebo)
> - the prompt plays a big role in the quality of the result; “A painting of a dog playing fetch” will usually turn out better than “dog playing fetch”
However, it can get taken very literally, in that you might get a picture that features a frame around the painting.
Another thing that comes to mind as a corollary is that the AI seems to like being constrained in its outputs. So adding something like “in the style of Monet” to the end of the prompt will return much more coherent results.
True. For the article, I experimented with a number of 'in the style of' prompts. Where there's a distinctive visual iconography with strong key features, BigSleep [1]does an amazing job of abstracting and reproducing that style. Besides artists, it also does very well with iconic movies like Blade Runner and The Matrix.
How does copyright work in those cases, for example if you train your model using copyrighted messages, then wouldn't the result be a derivative of the works used in the dataset?
If the result comes from a "sum" of different images, how can you calculate the split for royalties?
Is it possible to "reverse engineer" the result to see which data points and at what proportion contributed to the final result?
If anyone is interested in trying this without Google Colab, I have a site that takes a text prompt and renders it for you: https://dank.xyz
One of the models is lucidrains’ implementation of the excellent Big Sleep model by Ryan Murdock. The other models are mostly based on the work of Federico Galatolo.
The queue is temporarily paused as I’m upgrading the hardware to a better GPU, but I encourage you to browse the existing renders or submit your own for when the server is back online.
This uses CLIP to optimize a GAN's input to generate an output matching a text description. Optimization is very slow, it's basically the same process as training. DALL-E uses a feedforward network to directly predict an image from text. But that model hasn't been published yet.
Domain registrars will often make the initial registration of a domain cheaper than the renewal costs. Beware the renewal costs, or be prepared to rename to a new domain.
An interesting thing about CLIP is that when it doesn't know what something looks like, it instead generates pictures with the search text in them. That's why it confuses "an iPod" with "a piece of paper with iPod written on it".
There are many similar apps and projects. Many of them are Google Colab notebooks, which can be used for free via a web browser. See "List of sites/programs/projects that use OpenAI's CLIP neural network for steering image/video creation to match a text description" at https://www.reddit.com/r/MachineLearning/comments/ldc6oc/p_l...
> This is just a teaser. We will be able to generate images, sound, anything at will, with natural language. The holodeck is about to become real in our lifetimes.
Does anyone have any similar resources for other forms of media generated via natural language inputs?
For a non-neural AI take on text-to-scene generation (statistical parsing, symbolic representation, rule-based, 3d models with inverse kinematics), check out Bob Coyne's WordsEye:
The quality of the image really depends on the quality of the prompt, and a LOT of cherry picking.
I find that big sleep is also a better model than the one linked here (deep daze), generally.
I’ve generated several hundred images myself and found a few real treasures. Here’s a few of my personal favourites:
“A painting of a murder in the style of Monet” [0]
“A photo of fellas in Paris” [1]
“A painting of Thanos wearing the Infinity Gauntlet in the style of Rembrandt” [2]
I definitely agree that in the general case the examples are underwhelming, but I believe there is a lot of potential here. Personally I’m super excited to unlock the potential of human-guided, AI-assisted creative tooling. Some Colab notebooks let you active explore the latent space of a model to direct the results where you want them to go. As the generate-adjust feedback loop gets tighter we’re gonna see some crazy things.
It would be interesting to assemble frames from a movie, say at scene changes, etc. and have this thing "narrate" a movie. It would be similar in concept (though not in content) to how narration for the visually impaired is offered as a form of assisted media viewing.
In fact, you could probably train it with existing visual descriptions from movies.
This is amazing. I highly recommend submitting something via the Google Collab notebook and actually seeing how the code generates the images over time... I'm currently waiting and watching "a lizard king wielding a sword" form, and the actual formation of the image is really interesting as well.
Kudos on making an accessible notebook to play around with this. Currently have one running 'baby jumps over a house' on my machine. At iteration 50 and its starting to take shape, literally, will post here if it turns out to be decent.
I wrote a script a while back that pairs clip art to class names (ex. MicrophoneRecorder -> pictures of a microphone and tape recorder). The goal was to add a visual component to naming your abstractions.
Is there a way to get it to use my GPU more aggressively? Have a 3090 and right now its using ~5% of the capacity according to task manager looking at GPU usage of the browser window its running on.
I guess it is supposed to help us find tumors and whatnot, synthesize images rather than rely on an artist to do it, etc. But if you listen to a lot of ML talks they sometimes say, "we trained this on images but the technique works on any 2D or 1D signal" so it's applicable to signal problems in general.
This is incredible... The first few images could easily be used as album cover art.
Is there a way to perform a similar translation with music? For example, if you play in D Minor (the saddest of all keys), is there a way to map the key or some other musical characteristic to a word and have the images be generated with the intermediate being the primary source? Or would the approach be to map images to certain characteristics of music directly?
Currently both Big Sleep and Deep Doze are generating 512x512. These ones are representative: https://postimg.cc/HVspWgPn
Few are "collages", most images have full-area coherence.
“A man painting a completely red image” is very much a dadaist collage. The only complaint is that the ‘man’ could be rather more recognizable as such.
A few of my favorite results: https://m.imgur.com/a/bPETAG6 https://m.imgur.com/r9YdYRE https://youtu.be/tp2IuT-cgHc
https://www.reddit.com/r/MediaSynthesis/comments/mbhrkl/the_...