Text to Image Generation

corysama · on March 29, 2021

https://www.reddit.com/r/MediaSynthesis/ has a tiny community of people playing with tools like this.

A few of my favorite results: https://m.imgur.com/a/bPETAG6 https://m.imgur.com/r9YdYRE https://youtu.be/tp2IuT-cgHc

https://www.reddit.com/r/MediaSynthesis/comments/mbhrkl/the_...

weinzierl · on March 29, 2021

"The Notorious B.I.G. raps H.P. Lovecraft's Nemesis with AI"

Oh man, this video is crazy.. awesome.. crazy - not sure. The images are creepy, creepy as hell.

hypertele-Xii · on March 29, 2021

I've dabbled in trying to make something more... substantial, with these techniques.

https://www.youtube.com/watch?v=IU5MYV0CtpU

0x01DEED · on March 29, 2021

This is probably the best Notorious B.I.G. song that has ever existed.

exo-pla-net · on March 29, 2021

For anyone who isn't familiar, DALL-E is the state of the art for text-to-image generation. It's closed source, but it's astonishing: https://openai.com/blog/dall-e/

lucidrains · on March 29, 2021

I have a working repo for that

https://github.com/lucidrains/dalle-pytorch

It just needs to be trained

m3at · on March 29, 2021

Hey Phil, not quite on topic but I've used some of your implem (both for hobby projects and for work) and the conciseness and clarity of your code is a delight. Many thanks!

lucidrains · on March 29, 2021

Thanks for the kind words and glad you found them useful :)

komuher · on March 29, 2021

I love ur transformer repos, its always joy to re-implement stuff from ur code its easy to read and clean :D

lucidrains · on March 29, 2021

Thanks :D

kgin · on March 29, 2021

This is the first time I've seen new tech and felt a twinge of fear instead of thinking "oh cool".

EGreg · on March 29, 2021

Perhaps you’ve never seen this:

https://m.youtube.com/watch?v=9CO6M2HsoIA

Compare 2:33 in the above fictional video to this real one:

https://m.youtube.com/watch?v=DjUdVxJH6yI

https://www.thedrive.com/the-war-zone/35726/the-army-has-unv...

narrator · on March 29, 2021

They already had a small drone war. Azerbaijan vs Armenia. Armenia had 1990s soviet stuff. Azerbaijan had recently purchased drone tech. Armenia got totally destroyed.

https://www.msn.com/en-us/news/world/attack-drones-dominatin...

These are previews of future wars, much like the wars in the decades before WW1 previewed the utter slaughter that was to come thanks to the new technological era of mechanized warfare.

Blikkentrekker · on March 29, 2021

I don't see how this is scarier than the nuclear bombs that already exist.

EGreg · on March 29, 2021

Less collateral damage. L

Much more accessible to regular people.

Ability to be anonymous and hide in a crowd of drones.

cbogie · on March 29, 2021

yeah exactly

amelius · on March 29, 2021

> It's closed source, but it's astonishing

Really? Their name says OpenAI.

Anyway, how do we know their examples were not cherry-picked? Do they have an online demo?

woko · on March 29, 2021

OpenAI has a reputation and a history for pushing compute forward (AI in Dota2, GPT, CLIP, DALL-E, etc.).

They have everything to lose by lying. If they say that these examples are not cherry-picked, then we have no reason (a priori) to doubt them.

On a side-note, the fact that you could doubt the results are real is telling: each of their compute-heavy experiments shakes our belief and further reinforces their reputation.

ctdonath · on March 29, 2021

8-|

The singularity approaches.

otikik · on March 29, 2021

Small pet peeve of mine: The package name is `deep-daze` but the executable is called `imagine`.

This bothers me. More than it should.

Imagemagick used to do the same thing - but they judiciously renamed their `convert` executable to `magick`. Still not perfect, but an improvement.

If your package introduces one command-line executable, it should always be called the same as your package.

seiferteric · on March 29, 2021

This was a big pet peeve of mine early in my linux usage as well (still is really). Especially since it is not exactly straight forward to figure out what binaries get installed when you install a package. Almost feels like there should be some namespacing like pkg_name->cmd_name so you can at least tab complete package binaries easily.

EnigmaCurry · on March 29, 2021

Arch makes it pretty easy to recall:

# Who owns binary? pacman -Qo /bin/ls

# What files come with package? pacman -Ql coreutils | grep ......

seiferteric · on March 29, 2021

Yes debian has the same types of things, but maybe spread across a few commands. I actually just prototyped what I am thinking here https://github.com/seiferteric/pkg_namespace if anyone is interested.

anoncake · on March 30, 2021

It's just `apt-file search /path/to/file` actually. I think there's a dpkg command that is limited to installed packages as well.

Blikkentrekker · on March 29, 2021

I disagree with that. The primary command line executable provided by systemd is systemctl, this is an apt name to interface with and control the system daemon, which is an executable it also provides, but not one that is intended to be directly invoked from the command daemon.

ImageMagick is also far more than the command line utility that it provides to interface with it's library. Say that it had been written by a third party, but did the same thing using the ImageMagick library as a dependency, would it then be fine for it to have a different name?

advadnoun · on March 29, 2021

Ryan Murdock/advadnoun here: Glad to see some interest in the project!

I've written a few follow-ups to this, with some public notebooks as well that produce qualitatively different results. If anyone is interested, it should be pretty easy to find the BigSleep method, which steers BigGAN in a very similar way to this, as well as the Aleph notebooks, which use the DALL-E decoder or Taming Transformers VQGAN to generate images with CLIP, depending on the version of the notebook.

lucidrains · on March 29, 2021

Ryan's Big Sleep notebook https://colab.research.google.com/drive/1NCceX2mbiKOSlAd_o7I...

Aleph2Image notebook https://colab.research.google.com/drive/1Q-TbYvASMPRMXCOQjkx...

glenstein · on March 29, 2021

One thing that has fascinated me about this is the possibility of translation across dissimilar categories.

What I mean is we have style transfer in the domain of text, and we can style transfer with images. And we can generate images from text. Can we style transfer from image to text or vice versa? Can prose be rewritten in a manner that, in some sense, adheres to aesthetic principles of impressionist painting?

Presumably there would be some kind of informational representation of text style discernable to an image generation system. And just like an artistic style can be extracted from a painting and transposed to a photograph, perhaps an interpretation of textual style could be applied to a photograph despite them being different mediums.

What would that even look like? I don't know, but I find the possibilities fascinating.

The temptation, I think, is to make a first pass at answering this question in a frustrating, cartoonishly shallow way. And I think systems will possibly be developed that just go ahead and do it before people are culturally ready to understand it in a non-frivolous way. Everyone needs to get those reactions out of their system, I guess, but there's a more nuanced possibility here that might allow clashing of dissimilar categories in ways never previously contemplated.

ksm1717 · on March 29, 2021

Nice ! This points me back to the my favorite mental model of machine learning/nn. It’s always about shuffling in an out of a number of dimensions and the mappings between them.

When you make this please name it SynesthesAI

flowery · on March 29, 2021

For some reason Colab let me run this all last night. Wow just wow. Eventually I got the feeling that it matters if I'm watching.

Each of these links has about 9 images and the prompts that made them. Sometimes the image does not look like an animal right off the bat, then it seems like you asked for something the network had to say.

https://postimg.cc/vDvYdBhC https://postimg.cc/XpHnpzT8 https://postimg.cc/HVspWgPn

That Syd Mead style ... mwa

indigodaddy · on March 29, 2021

How long did it take for you.. I just did one text to image and Colab was churning away for a few hours but then choked after maybe 300ish image iterations saying something about some kind of space limit being reached..

Were you able to complete the 1050 iterations, and how long did it take?

indiv0 · on March 29, 2021

Not the OP, but some tips:

- the model usually locks in within 200-300 iterations, so if you don’t like the result by then, retry

- in fact, you can tell if the model is off to a good start within 25-50 iterations and I encourage you to cherry-pick runs early and often; don’t be afraid to restart

- time to render depends on which GPU you get from colab, but I usually run the renders for 10 minutes a pop. About 1-2 minutes if I run them on a 3090 locally

- the prompt plays a big role in the quality of the result; “A painting of a dog playing fetch” will usually turn out better than “dog playing fetch”

- lucidrains/bigsleep produces better results generally than lucidrains/deepdaze (this is my subjective preference)

- the colabs linked to from the big-sleep GitHub repo produce poorer results than running them as a python package locally (this one might honestly be placebo)

Hard_Space · on March 29, 2021

> - the prompt plays a big role in the quality of the result; “A painting of a dog playing fetch” will usually turn out better than “dog playing fetch”

However, it can get taken very literally, in that you might get a picture that features a frame around the painting.

indiv0 · on March 29, 2021

Indeed!

Another thing that comes to mind as a corollary is that the AI seems to like being constrained in its outputs. So adding something like “in the style of Monet” to the end of the prompt will return much more coherent results.

Hard_Space · on March 29, 2021

True. For the article, I experimented with a number of 'in the style of' prompts. Where there's a distinctive visual iconography with strong key features, BigSleep [1]does an amazing job of abstracting and reproducing that style. Besides artists, it also does very well with iconic movies like Blade Runner and The Matrix.

[1]The Deep-Daze author's most popular T2I mashup: https://github.com/lucidrains/big-sleep

https://colab.research.google.com/drive/1NCceX2mbiKOSlAd_o7I...

indigodaddy · on March 29, 2021

Very helpful feedback, thanks very much!

indigodaddy · on March 29, 2021

Wow! Fascinating images. The “Chinese ink painting made of poetry” one is particularly interesting and beautiful.

varispeed · on March 29, 2021

How does copyright work in those cases, for example if you train your model using copyrighted messages, then wouldn't the result be a derivative of the works used in the dataset? If the result comes from a "sum" of different images, how can you calculate the split for royalties? Is it possible to "reverse engineer" the result to see which data points and at what proportion contributed to the final result?

malka · on March 29, 2021

it is a grey area. There has been a huge drama in the furries world some time ago about this: https://www.reddit.com/r/HobbyDrama/comments/gfam2y/furries_...

indiv0 · on March 29, 2021

If anyone is interested in trying this without Google Colab, I have a site that takes a text prompt and renders it for you: https://dank.xyz

One of the models is lucidrains’ implementation of the excellent Big Sleep model by Ryan Murdock. The other models are mostly based on the work of Federico Galatolo.

The queue is temporarily paused as I’m upgrading the hardware to a better GPU, but I encourage you to browse the existing renders or submit your own for when the server is back online.

flowery · on March 29, 2021

The gallery is truly valuable as a reference - thank you.

bennyp101 · on March 29, 2021

Wow, I submitted one, then scrolled down the queue ... there is a /lot/ of porn requests, and some ... rather disturbng ones as well!

etaioinshrdlu · on March 29, 2021

This uses CLIP to optimize a GAN's input to generate an output matching a text description. Optimization is very slow, it's basically the same process as training. DALL-E uses a feedforward network to directly predict an image from text. But that model hasn't been published yet.

w_for_wumbo · on March 29, 2021

I think it'd be neat to take an old text only terminal game, and just auto generate the pictures based off responses.

_carbyau_ · on March 29, 2021

Then I might try Zork....

Honestly, I probably wouldn't, I don't want my mental vagaries of what a Grue looks like to be set by some random image.

ilaksh · on March 29, 2021

This is not really related but if you want to play Zork over Gemini I set up a whole gemini server and domain just for that. gemini://zork.club

.club was on sale for $1.17

oarsinsync · on March 29, 2021

Domain registrars will often make the initial registration of a domain cheaper than the renewal costs. Beware the renewal costs, or be prepared to rename to a new domain.

_carbyau_ · on March 29, 2021

But how about making lovecraftian horrors come into being!

astrange · on March 29, 2021

I tried making a grue with CLIP+FFT and got 1. a lot of construction cranes 2. some kind of bat thing.

https://imgur.com/a/UkPzcXZ

An interesting thing about CLIP is that when it doesn't know what something looks like, it instead generates pictures with the search text in them. That's why it confuses "an iPod" with "a piece of paper with iPod written on it".

mattkevan · on March 29, 2021

I ran some Lovecraft through Story2Hallucination[1] which uses Big Sleep to make videos from text.

The results were quite something - https://m.imgur.com/tfWLsSR

[1] https://github.com/lots-of-things/Story2Hallucination

barbs · on March 29, 2021

Maybe I should try hooking it up to Nethack!

Blikkentrekker · on March 29, 2021

> AssertionError: CUDA must be available in order to use Deep Daze

And here I was hoping to generate extensive phallic imagery on this most auspicious of nights.

This “CUDA” is Nvidia only if memory serve, correct?

exo-pla-net · on March 29, 2021

Yes, but you can engage in your phallic pursuits, for free, using Google Colab.

thr0w__4w4y · on March 29, 2021

Yeah, got the same thing and remembered reading "This will require that you have an Nvidia GPU". SOL, but still pretty cool.

Hard_Space · on March 29, 2021

I wrote a piece on this, including a chat with the creator of this notebook/GitHub (also commenting in this thread), a couple of weeks ago: https://rossdawson.com/futurist/implications-of-ai/future-of...

wsk7323 · on March 29, 2021

There are many similar apps and projects. Many of them are Google Colab notebooks, which can be used for free via a web browser. See "List of sites/programs/projects that use OpenAI's CLIP neural network for steering image/video creation to match a text description" at https://www.reddit.com/r/MachineLearning/comments/ldc6oc/p_l...

Disclosure: I am the author of this list.

enoreyes · on March 29, 2021

> This is just a teaser. We will be able to generate images, sound, anything at will, with natural language. The holodeck is about to become real in our lifetimes.

Does anyone have any similar resources for other forms of media generated via natural language inputs?

Mandatum · on March 29, 2021

Does this count? https://affinelayer.com/pixsrv/

sedachv · on March 29, 2021

For a non-neural AI take on text-to-scene generation (statistical parsing, symbolic representation, rule-based, 3d models with inverse kinematics), check out Bob Coyne's WordsEye:

http://www1.cs.columbia.edu/~coyne/papers/wordseye_siggraph....

https://www.wordseye.com/

http://www1.cs.columbia.edu/~coyne/

ultimatecrouton · on March 29, 2021

I would be interested in reading an explanation of how this works, which I can't see on the repo.

I'm familiar with SIRENs and CLIP, but not immediately obvious how the two are utilised here.

person_of_color · on March 29, 2021

Is it just me, or are the examples just not that good?

indiv0 · on March 29, 2021

The quality of the image really depends on the quality of the prompt, and a LOT of cherry picking.

I find that big sleep is also a better model than the one linked here (deep daze), generally.

I’ve generated several hundred images myself and found a few real treasures. Here’s a few of my personal favourites:

“A painting of a murder in the style of Monet” [0]

“A photo of fellas in Paris” [1]

“A painting of Thanos wearing the Infinity Gauntlet in the style of Rembrandt” [2]

I definitely agree that in the general case the examples are underwhelming, but I believe there is a lot of potential here. Personally I’m super excited to unlock the potential of human-guided, AI-assisted creative tooling. Some Colab notebooks let you active explore the latent space of a model to direct the results where you want them to go. As the generate-adjust feedback loop gets tighter we’re gonna see some crazy things.

[0]: https://www.reddit.com/r/MediaSynthesis/comments/l4hbkl/text...

[1]: https://www.reddit.com/r/MediaSynthesis/comments/l4eg64/text...

[2]: https://www.reddit.com/r/deepdream/comments/l4hq22/texttoima...

pugworthy · on March 29, 2021

It would be interesting to assemble frames from a movie, say at scene changes, etc. and have this thing "narrate" a movie. It would be similar in concept (though not in content) to how narration for the visually impaired is offered as a form of assisted media viewing.

In fact, you could probably train it with existing visual descriptions from movies.

cl42 · on March 29, 2021

This is amazing. I highly recommend submitting something via the Google Collab notebook and actually seeing how the code generates the images over time... I'm currently waiting and watching "a lizard king wielding a sword" form, and the actual formation of the image is really interesting as well.

cl42 · on March 29, 2021

If anyone is interested, here's the lizard king progression: https://twitter.com/Wojciech/status/1376371663542087682

867-5309 · on March 29, 2021

>Install >$ pip install deep-daze >Examples >$ imagine "a house in the forest" >That's it.

if only! just spent an hour on fresh installs of debian 10 and ubuntu 20.04 with python 2.7 and 3.6 alternatively - not having it

I understand it has a lot of required packages but please, write a bloody install guide

ronyfadel · on March 29, 2021

Computer hallucinating words. Reminds me of teratoma, a tumor that can develop hair, teeth, muscles and bone.

monkeydust · on March 29, 2021

Kudos on making an accessible notebook to play around with this. Currently have one running 'baby jumps over a house' on my machine. At iteration 50 and its starting to take shape, literally, will post here if it turns out to be decent.

mevorah · on March 29, 2021

I wrote a script a while back that pairs clip art to class names (ex. MicrophoneRecorder -> pictures of a microphone and tape recorder). The goal was to add a visual component to naming your abstractions.

Definitely going to update with this!

monkeydust · on March 29, 2021

Is there a way to get it to use my GPU more aggressively? Have a 3090 and right now its using ~5% of the capacity according to task manager looking at GPU usage of the browser window its running on.

rvz · on March 29, 2021

Going to predict that someone will find this out and sell some generated text-to-images as NFTs and hype it all over Twitter.

Then will cause another wave of copycats selling some generated text-to-images as NFTs.

flowery · on March 30, 2021

"super crab" ... it made a shrine with figurines and a logo https://postimg.cc/5jhgGtCv

flowery · on March 30, 2021

"the answer to life the universe and everything" https://postimg.cc/w3g4TMvp

flowery · on March 29, 2021

"minimalist pumpkin" https://postimg.cc/1fvHzzq8

r0b05 · on March 29, 2021

Remarkable! It's not quite there but it can only get better.

This is why I like HN on a Monday morning.

tirrex · on March 29, 2021

This looks like something I’ll be addicted.

I couldn’t think that would be possible, very interesting.

andyp-kw · on March 29, 2021

And we said that an AI couldn't produce art..

person_of_color · on March 30, 2021

Has anyone tried to sell their work as NFT?

joshspankit · on March 29, 2021

I find the titles much more “artsy” by reading the one above. Can easily picture a couple of them in a gallery

sabujp · on March 29, 2021

arg, needs cuda

clircle · on March 29, 2021

What is the point of this work? Other than that it is cool.

clircle · on March 29, 2021

Despite the downvotes, I'm still curious about the problem that this solves.

flowery · on March 30, 2021

I guess it is supposed to help us find tumors and whatnot, synthesize images rather than rely on an artist to do it, etc. But if you listen to a lot of ML talks they sometimes say, "we trained this on images but the technique works on any 2D or 1D signal" so it's applicable to signal problems in general.

What those are, I have no idea.

rajangdavis · on March 29, 2021

This is incredible... The first few images could easily be used as album cover art.

Is there a way to perform a similar translation with music? For example, if you play in D Minor (the saddest of all keys), is there a way to map the key or some other musical characteristic to a word and have the images be generated with the intermediate being the primary source? Or would the approach be to map images to certain characteristics of music directly?

jquaint · on March 29, 2021

I wonder if you could use another model that describes music and feed that text into this one?

Even something based on spotify's music labeling api would be super interesting!

coldcode · on March 29, 2021

I will get excited when I see this making images bigger than 256 pixels.

flowery · on March 30, 2021

Currently both Big Sleep and Deep Doze are generating 512x512. These ones are representative: https://postimg.cc/HVspWgPn Few are "collages", most images have full-area coherence.

aasasd · on March 29, 2021

> as album cover art

Indeed, what I see is an album cover generator.

“A man painting a completely red image” is very much a dadaist collage. The only complaint is that the ‘man’ could be rather more recognizable as such.