Running Stable Diffusion on Your GPU with Less Than 10Gb of VRAM

cube2222 · on Sept 4, 2022

For those without a GPU / not a powerful enough one / wanting to use SD on the go, you can start the hlky stable diffusion webui (yes, web ui) in Google Colab with this notebook[0].

It's simple and it works, using colab for processing but actually giving you a URL (ngrok-style) to open the pretty web ui in your browser.

I've been using that on-the-go when not at my PC and it's been working very well for me (after trying numerous other colab-dedicated repos, trying to fix them, and failing).

Additionally, you can have all your generated images sync to Google Drive automatically.

[0]: https://github.com/altryne/sd-webui-colab

Llamamoe · on Sept 4, 2022

Who's paying for all the Google Collab notebooks I've been seeing around? Can I really just start and keep using it for free?

Karuma · on Sept 4, 2022

Google is paying, and yes, you can, but they will disconnect you after a while. And if you abuse it too much, you won't be able to use it until the following day...

You can also buy Colab Pro and Colab Pro+, which have fewer limitations and faster GPUs.

capableweb · on Sept 4, 2022

How fast is the Colab stuff? Is Colab Pro/Pro+ a lot faster too?

I run it locally and can generate images with 50 steps in about 6 seconds per image, would it be faster for me to use Colab Free/Pro/Pro+?

cube2222 · on Sept 4, 2022

In my usage Colab and Colab Pro were similar, with plain Colab occasionally OOMing during model loading. That said I've actually been seeing times slower than yours on Colab and I think they're slower than on my RTX 3080. ~15 secs per image. I'm not sure why, though.

monkmartinez · on Sept 4, 2022

You are much better off running it locally at those speeds. P100 does 13 to 33 seconds a batch in my experience. Cloud to cloud data transfer (Hugginface to Colab) is ridiculously fast tho.

Rebelgecko · on Sept 4, 2022

I'm on Colab Pro and get about 3 steps per second when generating a single 512x512 image at a time, with slight throughput improvement when I batch 2-3 images

fragmede · on Sept 4, 2022

Yup, totally free (with a Google account). It's run as a learning resource and there's an upsell to Collab+ and Collab Pro, but for running StableDiffusion it makes it very easy to get started!

blagie · on Sept 4, 2022

I think it's less of an upsell and more of a data collection and a market positioning. Google would like to, for example:

1. Be central in the machine learning ecosystem. This has broad ripple effects, such as recruiting.

2. Doing things there means Google can track how you use machine learning. This can be used for everything from understanding trends in machine learning, to, again, robustly identifying individuals for recruiting efforts.

It seems like the cost is nominal at Google scale for what Google is getting. I suspect the pricing for the higher-end services is less a money-making scheme, as at some point, free is no longer sustainable (and if unlimited CPU were free, that would be prone to abuse / misuse / overuse / wasteful use). The amount of money Google makes there is nominal at Google scale.

schleck8 · on Sept 4, 2022

Also there are two optimized forks which run on < 6 GB

https://github.com/basujindal/stable-diffusion

https://github.com/neonsecret/stable-diffusion

barrkel · on Sept 4, 2022

I started out using my old GTX 1080 on Thursday, could generate 512x512 just fine. That's in 8G of VRAM. It worked well on the hlky branch using webui (built using gradio).

Seeing that training etc. is much more memory intensive, and wanting to get faster results, I bought an RTX 3090, which has 24G of VRAM. However it maxes out at about 1024x512, only twice as many pixels. Observing the card with GPUZ, it never actually allocates more than 13.9G.

Using the lstein branch, I can't get above 896x512. Similarly, GPUZ shows allocated VRAM never reaches 14G. The interface isn't as good as the webui on hlky either - never mind the web interface, a bigger problem is it doesn't save all the parameters alongside generated images.

This is all running using Miniconda on Windows. On Linux it may be a different story, but my gaming PC is not dual-boot (yet).

tbalsam · on Sept 4, 2022

Recommendations: - Linux w/ Display drivers on CPU (and just...ditch miniconda please) - Use lower FP precision mode if available to use the tensor cores (also to double "effective" memory) - Batch things! - I don't know what the max resolution of the diffusion network is, you may have to just simply tile it past a certain point (with overlap please! ;P

Hope that helps somewhat. A 3090 should be more than enough for what you're doing, I'm stuck with P100s at best for me! (Cost :'( )

Best of luck! :D :)

cbozeman · on Sept 4, 2022

I ordered an NVIDIA Tesla K80 off eBay (and the power adapter... and the blower fan shroud, etc.) and intend to install it when it arrives around Thursday or Friday. I'm hoping that after I install the NVIDIA Linux datacenter drivers I'll be able to use the card with SD.

My only worry is that because the K80 is two GPUs on one board, that it might only utilize one of them, with only 12 GB of VRAM instead of both chips and all 24 GB.

5992 CUDA cores and 24 GB VRAM would be a pretty decent SD accelerator for only $150.

monkmartinez · on Sept 4, 2022

I really want to put one of these in my Dell Precision workstation, and I share similar concerns. I have an older Quadro in it now and could use a proper upgrade, but I really want to wait for the 4000 series cards due in a few months.

Craft Computing on Youtube has the best information from what I have seen so far. I don't like watching Youtube videos for information like this, but I understand why creators have moved to this medium in general. Linux should be much easier to configure for using the K80 to capacity.

glenneroo · on Sept 4, 2022

You'll be able to render but it won't be fast, those CUDA cores are ancient and VRAM speed is slow. Check the Stable Diffusion discord for more info, but I found these comments:

> one minute per 512x512 @ 50 steps

> 1m20s to run 50 ddims on 512x512 vs 2080 ti in 12 seconds

You'll have to run the optimized model as well, since you can't connect the 2x 12GB together.

cbozeman · on Sept 6, 2022

> since you can't connect the 2x 12GB together.

Any idea why this is?

simcop2387 · on Sept 7, 2022

The architecture of the k80 just doesn't allow memory sharing or pooling. The newer systems (Pascal and above I think?) can allow it on the right hardware but it's all datacenter and workstation cards for that.

cbozeman · on Sept 7, 2022

Man, feel like I shoulda picked up the M40 instead then, since it was 24 GB VRAM in total. :(

Oh well, I've spent $90 on dinners and had far less fun than I will have with this video card when it arrives, so I can't say it was wasted money... and I can always just buy an M40 off eBay.

If RTX 3090 prices keep dropping through the floor, I may just bite the bullet and pick one up. I saw a ZOTAC on sale for $999 recently, which is $500 less than the launch MSRP of $1499 (which honestly is where it should have launched anyway... so far as I'm concerned, these cards only just now hit reasonable pricing).

monkmartinez · on Sept 4, 2022

I totally disagree with ditching miniconda. The Colab notebooks that make use of it have been super easy to run and modify. There is documentation everywhere and its very easy to find on SO, and Google. Its a joy to use and really like it for all of my Python workloads. I think of it like a Python VM that just works where ever I place it... so far, haven't been let down.

traverseda · on Sept 4, 2022

Miniconda is a pain as it introduces it's own package build format that (IMHO) just isn't very good. It might have been an improvement on python's binary packages when it was released, but now days the conda package format creates more problems than it solves.

>I think of it like a Python VM that just works where ever I place it.

That's called a virtualenv, which is a feature built into python. Miniconda is a thin wrapper around virtualenv (actual python packages) and the conda package format. If you're using an IDE it probably has virtualenv support baked in.

Personally I prefer to use python-poetry for managing virtual envs, but honestly just using the virtualenv command directly is not hard if you're already using conda from the CLI.

yoru-sulfur · on Sept 4, 2022

> and just...ditch miniconda please

Why the recommendation to stop using miniconda?

r2_pilot · on Sept 5, 2022

Well, for starters(not that proficient with PyCharm as I only use it for less-professional use), I have tried for a few days to get PyCharm to use the miniconda environment on Windows, to no avail. It at least works on the command line so I get to play around with it. I'm going to spend more time tomorrow trying alternatives.

ladberg · on Sept 4, 2022

Just checking, are you also using the same GPU for rendering your desktop? If so then try switching over to your integrated GPU or the 1080 if it’s still attached so you can leave 100% of the 3090 available to the network.

glenneroo · on Sept 4, 2022

That probably won't help much since OP said they aren't even using 14GB of VRAM. I have dual GPUs and use the 2nd one (3060 Ti with no monitors connected) for rendering, which is nice because I have the full 8GB free.

barrkel · on Sept 4, 2022

I was looking at GPU graphs and neglected my physical RAM. This machine only has 32G and I didn't notice I was hitting a ceiling on memory allocations too - I took the error message about GPU memory allocations at face value.

I bumped my system commit cap (increased paging file size) by 24G and now I can use all my VRAM.

Mossly · on Sept 11, 2022

I'm doing 576x576 on my 3060 Ti - just wondering the max resolution you can render now using all your VRAM?

MattRix · on Sept 4, 2022

Have you tried something like 1024x768? Going to full 1024x1024 would double your VRAM usage so I can see why that wouldn’t work.

For my uses, the real benefit of having more VRAM is that you can generate more images simultaneously. My 3080 can generate only one 512x512 in 7 seconds but three 384x384 in that same timeframe. It’s allowed me to generate grids of hundreds of images in just a few minutes.

the-golden-one · on Sept 4, 2022

I can squeeze 640x512 out of my Mobile RTX 3080 with 8Gb VRAM (as the secondary GPU). Anything more than that fails.

barrkel · on Sept 4, 2022

1024x768 didn't work until I bumped my system commit cap (i.e. increased my paging file max size). I hadn't paid attention to system memory, this box only has 32G.

WhereWhyWhat · on Sept 4, 2022

How many it/s do you get with 3090 compared to the 1080?

My 1080 Ti gets around 2.5it/s with the k_lms sampler.

barrkel · on Sept 4, 2022

Where the 1080 would do about 2 it/sec, the 3090 does about 10 it/sec.

When I do batches, it slows down but not linearly; if 1x does 10 it/sec, 2x does about 6 it/sec. Batching is the other upside of more VRAM.

capableweb · on Sept 4, 2022

On a 2080 Ti I get around 8 it/s with k_lms

monkmartinez · on Sept 4, 2022

How have you configured Pytorch in the 'setup' section for your card? The hlky/webui (Shout out to Altryne), is configured for lower end GPU that are memory constrained. The knobs that need a twistn' on these DL models feel infinitesimal.

pixl97 · on Sept 4, 2022

Maybe look in to booting off a USB stick as a means to test this. I wouldn't be surprised if there were some kind of driver reservation I Windows causing this issue.

TrinaryWorksToo · on Sept 4, 2022

WSL2 is a thing you should look in to.

lagrange77 · on Sept 4, 2022

https://constant.meiring.nz/assets/posts/2022-08-04-playing-...

How it holds the cigarette with its little paw. Ehem, i mean, it's technically interesting, how the model correctly extrapolated, how this would look like..

BTCarel · on Sept 4, 2022

Can't believe how awesome these generated images are. Thank you for the guide!

T0Bi · on Sept 4, 2022

If you want to have a really good experience using stable diffusion, use this guide: https://rentry.org/GUItard

- includes a nice GUI - txt2img and img2img - upscaling, face correction - many more

constantlm · on Sept 4, 2022

This is indeed a very thorough, albeit not very nicely named, guide.

pavlov · on Sept 4, 2022

Let's just pretend it's named after a background process that keeps track of your guitar.

sdflhasjd · on Sept 4, 2022

It originates from 4chan /vg/ & /g/ boards

hackernewds · on Sept 4, 2022

> --ULTIMATE GUI RETARD GUIDE--

fortyseven · on Sept 5, 2022

Do we REALLY this kind of garbage associated with SD?

Bad enough I see trashy right-wing extremist shit over on the Stable Diffusion discord server zip past now and then.

I'll pass on that guide. Hopefully they grow the fuck up at some point.

throwaway675309 · on Sept 5, 2022

Unfortunately as long as there are people who are easily triggered by this sort of thing (seems like they got a bit of a rise out of you) they'll continue in this fashion.

Datagenerator · on Sept 4, 2022

Can this be used with the optimizedSD by basujindal?

mutant · on Sept 4, 2022

Edgy title

spapas82 · on Sept 4, 2022

I'd like to confirm that this works in my GTX 2060 with 6 GB VRAM on windows. I didn't do any modifications on the provided source code; faces are a little problematic.

I don't use anaconda so I created a new venv with python 3.10, installed the requirements as proposed, registered with hugging face and create the api key and run the provided source code.

Any way to improve the quality of the faces? Also how could I tune the parameters a bit ? (I'm not familiar with this AI stuff at all, I'm just a humble python programmer)

r2_pilot · on Sept 5, 2022

It's my understanding that v1.5 will be coming out in a few weeks; I recall that hands and faces will be better-trained in the new model. I'm about to try what you did (install requirements manually) to get it to run in PyCharm on Windows. Neither miniconda nor anaconda really worked for me after spending time trying to get it to pick up the dependencies.

hwers · on Sept 4, 2022

If you have even just 4gb stable diffusion will run fine if u go for 448x448 instead (basically the same quality).

SuperCuber · on Sept 4, 2022

I feel like I'm going insane. Everyone says 512x512 should work with 8gb but when I do it I get:

    CUDA out of memory. Tried to allocate 3.00 GiB (GPU 0; 8.00 GiB total capacity; 5.62 GiB already allocated; 0 bytes free; 5.74 GiB reserved in total by PyTorch)

any ideas? I have a 3060ti with 8gb vram...

with 448x448 I get:

    CUDA out of memory. Tried to allocate 902.00 MiB (GPU 0; 8.00 GiB total capacity; 6.73 GiB already allocated; 0 bytes free; 6.86 GiB reserved in total by PyTorch)

schleck8 · on Sept 4, 2022

Use halfprecision float and/or the optimized forks

https://github.com/basujindal/stable-diffusion

https://github.com/neonsecret/stable-diffusion

Or the hlky webui, that is optimized too.

http://rentry.co/kretard

baobabKoodaa · on Sept 4, 2022

I've been trying to get the basujindal fork to work, but it seems to be putting all work on the CPU. I've been running the example txt2img prompt for 30 minutes now and it's still not finished. It has reserved 4Gb memory from the GPU, but the GPU doesn't appear to be doing any work, only CPU is doing work.

prettydeep · on Sept 4, 2022

Use the original SD repo. But modify the txt2img.py according to:

https://github.com/CompVis/stable-diffusion/issues/86#issuec...

baobabKoodaa · on Sept 4, 2022

I now did everything I could to constrain the memory usage of the original SD repo, I was finally able to get it to run, and it produced green squares as output :(

What I did:

- scripts/txt2img.py, function - load_model_from_config, line - 63, change from: model.cuda() to model.cuda().half()

- removed invisible watermarking

- reduced n_samples to 1

- reduced resolution to 256x256

- removed sfw filter

Just can't get it to work and it's not producing an error message or anything that I could debug it with.

rrobukef · on Sept 4, 2022

Your model is overflowing/underflowing generating NaNs. I got it with memory optimised, increased resolution (multiples of 32, 384 x 384) and full precision while keeping it in 4 GB.

moffkalast · on Sept 4, 2022

> I feel like I'm going insane.

That's the world of running machine learning models for you. Why would anything ever work the first time right? Or at least the 10th time...

naillo · on Sept 4, 2022

Which is so silly since ML models should be the most portable thing in the world. It's just a series of math operations, not a bunch of OS/hardware specific API calls or something like that. We should be at a stage where each ML model is boiled down to a simple executable with zero dependencies at this point.

raphlinus · on Sept 4, 2022

Agree 100% and I spend a fair amount of time wondering why this hasn't happened. I built piet-gpu-hal because I couldn't find any abstraction layer over compute shaders that supports precompiled shaders. A motivated person absolutely could write shaders to do all the operations needed by Stable Diffusion, and ship a binary in the megabyte range (obviously not counting the models themselves). That would support Metal, Vulkan, and D3D12. The only thing holding this back is a will to build it.

sdenton4 · on Sept 4, 2022

This is the part that tensorflow is really good at, while just about everything else lags behind. The tf saved model is the graph plus weights, and is super easy to just load up and run. (Also, tflite for mobile...)

But one of the tricky parts with stable diffusion is that people are trying to get it to run on lighter hardware, which is basically another engineering problem where simple apis typically won't expose the kind of internals people want to mess around with.

jeroenhd · on Sept 4, 2022

Others may have reduced the batch size (n_samples) to reduce the memory load. A lower batch size will significantly help with the memory consumption.

This comment: https://news.ycombinator.com/item?id=32710550 talks about running SD with 8GiB of VRAM and mentions needing to reduce this parameter to 1 to get it to output right.

SuperCuber · on Sept 4, 2022

This helped and I finally generated something larger than 256x256 :D thanks

jeroenhd · on Sept 4, 2022

If you're okay waiting a while linger and have plenty of RAM, https://github.com/bes-dev/stable_diffusion.openvino has a somewhat CPU-optimized version as well that relies on system memory rather than VRAM.

My laptop takes about 6 seconds per iteration so it's significantly slower, but if you're willing to wait I bet you'll have a much easier time plugging more RAM into your system than adding VRAM.

glenneroo · on Sept 4, 2022

I've been running it fine on my 3060 Ti, then again I don't have any monitors connected so the full 8GB is free. Check VRAM usage, I'm guessing you don't have 8GB free, more like 5-6GB, since you have monitors connected.

Also, you could try Visions of Chaos and use the Mode > Machine Learning > Text-to-Image > Stable Diffusion. It also has tons of other AI tools e.g. image-to-text captioning, diffusion model training, mandelbrot, music, and a ton more. The dev(s) push out updates almost every day.

Warning: You will first need to go through the 12 steps of Machine Learning setup first[0], then it will download 3-400GB of models since it has scripts for pretty much every latent diffusion out there, some of which e.g. Disco Diffusion I find to still give more interesting results and you can get much higher res on a 3060 Ti, plus you have a TON more parameters to play with, not to mention you can train your own models and load those in (which I've been doing the past few weeks using my photography to get away from using unlicensed imagery :)

[0] https://softology.pro/tutorials/tensorflow/tensorflow.htm

hwers · on Sept 4, 2022

Oh sorry I guess i need to mention that you need to put the text encoder on the cpu (or precompute the text embedding somehow). (Im using a custom codebase to make that possible idk how trivial that is to achieve with StableDiffusionPipeline.) Only the unet and vae should be on the gpu.

For your case with 8 gb you shouldn’t need to do either of those things (run it all on gpu), just make sure you have batch size 1 and are using the fp16 version.

vaughnegut · on Sept 4, 2022

On my 3070 I get that error unless I set my batch size to 1. My typical setup is to do six batches of one and it works fine (although I minimize the number of visible things on my screen while it's running). This reliably produces one image every 7-8 seconds.

ad404b8a372f2b9 · on Sept 4, 2022

Be aware python processes don't always terminate correctly when you keyboard interrupt out while using Pytorch.

Make sure you kill all python processes before restarting or some of your VRAM will be in use.

You can check with nvidia-smi how much ram is currently in use by what processes.

mordymoop · on Sept 4, 2022

For some reason — no idea why — this problem went away when I set n_samples to 1 and scale to 10.0 or less. Why these parameters would impact memory usage, I don’t know, but the image quality seems fine, afaict.

rrobukef · on Sept 4, 2022

n_samples is the batching number. Total memory used scales like "Model Mem Size + n_samples * Batch Mem Size". The memory needed for a batch is smaller than the model but not trivial.

vimy · on Sept 4, 2022

How much ram is your gpu using before you start stable diffusion? You can check with ‘nvidia-smi’ in terminal.

The not-optimized release works with my 2070 with 8 gb ram.

vid · on Sept 4, 2022

Not an Apple guy, but I think an Apple M chip will run at ⅓ the speed of a top end RTX GPU, however it uses system memory, so it can easily be 32GB or 64Gb. That's pretty compelling, and if this is really a new class of application, NVidia is going to have to think about more memory for mainstream-ish products.

coolspot · on Sept 4, 2022

It is 50x times slower on M1 than on RTX 3090.

M1 takes ~4.2s per iteration, 3.5 minutes per image [0].

RTX 3090 takes ~4.7s per image (all 50 iterations) [1].

[0] - https://wandb.ai/morgan/stable-diffusion/reports/Running-Sta...

[1] - trust me bro

shrimpx · on Sept 4, 2022

Btw that's the kind of perf I see on my M1, but I keep seeing "0.00G VRAM used" for each generation. I wonder what that's about. In Activity Monitor I do see the GPU being used.

coolspot · on Sept 4, 2022

SD measures VRAM usage by calling a specific pyTorch method which usually wraps CUDA call.

I guess whomever ported that to M1 just haven’t implemented that method.

vid · on Sept 4, 2022

OK, I must have misread some comments. Thanks for the update.

skybrian · on Sept 4, 2022

This is a specialty application. I don't think it's going to be big enough to drive consumer technology like gaming?

Particularly since cloud services are likely to be competitive and work for anyone.

layer8 · on Sept 4, 2022

I’ll get downvoted, but it’s a genuine question: Will “a photo of tits and ass” generate photos of birds with donkeys, or will it rickroll you [0]?

[0] https://twitter.com/qDot/status/1565076751465648128

coolspot · on Sept 4, 2022

Just tested it locally[0] with two prompts (all default params): "A photo of tits and ass" and "A photo of tits (birds) and ass (donkey)"

Result: https://imgur.com/a/c1GM28U (NSFW)

Censored version would just replace anything NSFW with a picture of Rick Astley (for real [1]).

[0] - https://github.com/hlky/stable-diffusion

[1] - https://github.com/CompVis/stable-diffusion/issues/120

layer8 · on Sept 4, 2022

Thanks for actually trying that out. Surprisingly few tits with those asses. The (animal) ass-tit chimeras are amusing.

lbotos · on Sept 4, 2022

Depends on if you are running stable diffusion with the safety filter on or not.

By default it's on, some forks have it turned off.

lbotos · on Sept 4, 2022

As I understand it, SD was trained on this dataset:

https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2...

So go here, turn off the safety filter and you can search to see what SD was trained on. I suspect that if you actually want the bush tit bird and donkeys, you'll want to use that instead.

blfr · on Sept 4, 2022

Is there a similar guide for Linux/Ubuntu with some sort of light sandboxing, at least python virtual virtual environment?

politelemon · on Sept 4, 2022

Yes I followed one recently, though it uses conda. The SD script runs in a conda environment, so when you uninstall conda your system is preserved and hasn't been stomped on.

https://code.mendhak.com/run-stable-diffusion-on-ubuntu/

blfr · on Sept 4, 2022

Like you read my mind. Thank you!

forgingahead · on Sept 4, 2022

https://github.com/basujindal/stable-diffusion

I use this on my Ubuntu 18 machine, works nicely on a GPU with 8GB VRAM.

As usual, some python dependency nonsense to sort out even with Anaconda, but pretty quick and easy to get up and running.

Datagenerator · on Sept 4, 2022

This one is from scratch on Debian:

https://notes.datagenerator.eu/#Stable%20Diffusion%20install...

DarthNebo · on Sept 4, 2022

Always wondered why we can't virtualize VRAM like how we did for VMs.

WithinReason · on Sept 4, 2022

Good question.

    Bandwidth of dual channel DDR4-3600: 48 GB/s
    Bandwidth of PCIe 4 x16: 26 GB/s
    Bandiwdth of 3090 GDDR6X memory: 935.8 GB/s

Since neural network evaluation is usually bandwidth limited, it's possible that pushing the data through PCI-E from CPU to GPU is actually slower than doing the evaluation on CPU only for typical neural networks.

https://www.microway.com/knowledge-center-articles/performan...

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

redox99 · on Sept 4, 2022

And that's without even taking into account latency of accessing main memory through PCIe, which would make matters even worse.

sp332 · on Sept 4, 2022

Ok, but at least it would run.

zamadatix · on Sept 4, 2022

What's the point of running it on the GPU if to do so you need to make it slower tham running in the CPU? Just run it on the CPU at that point.

exikyut · on Sept 4, 2022

I once tried to start Firefox (back in the 2.5-3.0 days >:D) on a Celeron with 64MB RAM.

It worked perfectly fine, with the sole exception that the HDD LED was on solid the whole time, a single window took just over a literal half an hour to open, and loading a webpage took about 1-2 minutes.

But it worked.

WithinReason · on Sept 4, 2022

It already does, on the CPU.

kernelsanderz · on Sept 4, 2022

You kind of can - projects like deepspeed (https://www.deepspeed.ai/) enable running a model that is larger than in VRAM through various tricks like moving weights from regular system RAM into VRAM between layers. Can come with a performance hit though depending on the model, of course.

matsemann · on Sept 4, 2022

For training you can often divide the batch size by n (and then only apply the backprop gradient stuff after each n batches for it to be mathematically equivalent). At a cost of speed, though.

amelius · on Sept 4, 2022

Do libraries like torch and tensorflow facilitate this?

fragmede · on Sept 4, 2022

Yes, eg https://pytorch.org/docs/stable/generated/torch.nn.parallel....

amelius · on Sept 4, 2022

Thank you!

matsemann · on Sept 4, 2022

Quite trivial to implement this yourself if you want to. See gradient accumulation in fastai for instance https://www.kaggle.com/code/jhoward/scaling-up-road-to-the-t...

SubiculumCode · on Sept 4, 2022

All I keep thinking is, how can I make money off of this. Aww the power of open-source. Right now, my thinking is that its just going to cut costs (sorry artists) for in existing workflows, maybe change some endeavors from red to black profit margins. Probably more likely will be using this SD as a basis for more specialized content training.

hedora · on Sept 4, 2022

I had good luck with these directions, which let you run inside a docker container:

https://github.com/AshleyYakeley/stable-diffusion-rocm

I had to make the one line change suggested in issue #3 to get it to run under 8GB.

radeontop suggests 4GB might work.

I also had to add this environment variable to make it work on my unsupported radeon 6600xt:

HSA_OVERRIDE_GFX_VERSION=10.3.0

It takes under two minutes per batch of 5 images with the --turbo option.

(Base OS is manjaro; using the distro's version of docker; not the flatpack docker package.)

If you don't have a GPU, paperspace will rent you an appropriate VM.

password4321 · on Sept 4, 2022

I didn't realize 512x512 on 4GB VRAM (Win10 over RDP) was anything unusual, just followed https://github.com/awesome-stable-diffusion/awesome-stable-d... to "Optimized Stable Diffusion" https://github.com/basujindal/stable-diffusion (linked many times in this discussion).

verytrivial · on Sept 4, 2022

From the diff, perhaps stale but:

> Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 11250 kg CO2 eq.

That's ... Sobering.

IshKebab · on Sept 4, 2022

My work uses a monorepo without precise dependency tracking (Bazel or similar) so every single diff builds everything and runs a ton of tests. About 6 kWh of electricity per diff. Even for typos.

Nobody seems especially bothered.

baobabKoodaa · on Sept 4, 2022

It's unfortunate that this article doesn't specify the amount of VRAM needed, other than specifying it's "less than 10Gb". I have 6,1Gb of VRAM and I tried to follow the article until eventually encountering an "unable to allocate memory" error. (I'm now trying to run basujindal's repo as an alternative.)

capableweb · on Sept 4, 2022

Reduce the resolution and run with half-precision instead of full-precision and you should be able to avoid OOM errors. Author seems to have had 8GB VRAM available, so I'm guessing that's the "minimum required" for their solution.

baobabKoodaa · on Sept 4, 2022

It's not possible to halve the precision further. The precision was already dropped from float32 to float16 in the OP.

I now used parameters to drop the resolution to 256x256, and now it's running, but it's somehow broken. Every output image it produces is literally a green square.

LanternLight83 · on Sept 4, 2022

The green square issue has been well known, particularly on AMD cards, and I believe the solution is... full precision :c But idk, I haven't had that issue. My issue's that I can run it in <4GB VRAM, but can only do a couple dozen images before some memory leak or smth drives it out of memory (effects my 2070S too, but only after many more images). Restarting it isn't too bad, but it's enough to have me looking to using either if two AMD APU's that I have on hand.

neurostimulant · on Sept 5, 2022

I also got green square when running with half precision on gtx 1650. Try 384x384 with full precision instead, or use the intel cpu fork.

XorNot · on Sept 4, 2022

You need to be in full precision mode in that case. Running on my AMD card this was necessary.

baobabKoodaa · on Sept 4, 2022

Runs out of memory in full precision mode.

XorNot · on Sept 5, 2022

Try the basujindal fork https://github.com/basujindal/stable-diffusion - this is what works for me on 8gb of VRAM (without maybe 6.5 available from whatever my DE is using up).

On Linux with AMD `radeontop` is quite informative.

hedora · on Sept 4, 2022

See my other comment in this article. I'm running with 8, but it never uses more than 6GB.

mtoddsmith · on Sept 4, 2022

Integrate this into a game for infinite playability.

Does the image generator return some kind of seed that allows you to reproduce the result?

the-golden-one · on Sept 4, 2022

The seed is passed on the command line.

figomore · on Sept 4, 2022

Other option is to use the Openvino one (https://github.com/bes-dev/stable_diffusion.openvino). It uses CPU and runs very fast. It takes ~90s to generate an image on my Ryzen 3800X.

hombre_fatal · on Sept 4, 2022

I've been running Stable Diffusion on my M1 Macbook since the thread a few days ago about doing just that.

I am comically bad at getting it to generate what I want. e.g. "A furry watermelon" or "A dog flexing its biceps" just generates normal watermelons and normal dogs most of the time.

Any tips?

nickthegreek · on Sept 6, 2022

Use www.lexica.art for prompts that work, modify the keywords for your needs.

XorNot · on Sept 4, 2022

I have this running on my fairly mundane Radeon 5600XT at about 1 minute per image generated (under rootless podman, which is the real cool news to me) which isn't bad all things considered. Definitely get some interesting sounds from coil whine when it's going.

nilolo · on Sept 8, 2022

Could you share which repo you got running? I followed this guide (https://gitgudblog.vercel.app/posts/stable-diffusion-amd-win...) that uses Onnx to get it running on my 5700XT but I'm not happy with the performance at about 2m30s per image.

alkonaut · on Sept 4, 2022

What’s the easiest way of using SD on a Windows box? Can I run it off a Linux live USB or can it run directly under Windows?

Edit: never mind this is the missing guide I had been looking for

andybak · on Sept 4, 2022

On Windows the app Visions of Chaos (mostly)-automates the installs for dozens of ML models including SD: https://softology.pro/tutorials/tensorflow/tensorflow.htm and provides a fairly respectable UI.

It's also updated almost daily and tracks the latest features where possible.

redacted · on Sept 4, 2022

The Linux/not-Windows instructions on https://github.com/hlky/stable-diffusion/wiki/Docker-Guide worked well for me using WSL2 with nvidia-docker

constantlm · on Sept 4, 2022

The guide posted is for Windows 11.

reckless · on Sept 4, 2022

Would be great to be able to utilise outpainting to generate larger images in smaller tiles at full precision.

mabbo · on Sept 4, 2022

I believe I saw a repo that was doing exactly that. They also included a step at the end to reintegrate the results better.

I was also able to use the basic scripts to generate a few samples, pick one I liked, then used inpaint to expand the photo, masking out the original input so it wouldn't be altered.

thepra · on Sept 4, 2022

The issues with f*ng console commands is that they fail, too often.

After installing CUDA 11.7 and reinstalling torch I'm still facing:

> AssertionError: Torch not compiled with CUDA enabled

monkmartinez · on Sept 4, 2022

I totally understand the frustration. Hop on the Conda train and don't look back. There is no performance penalty from using Conda for the boring stuff. The only thing it will cost you is more disk space. Otherwise, its an absolute joy to use. You know where everything is if you want to inspect packages, bin files, wheels, etc. It seems like chasing your tail when you install these things from apt, git, curl, pip and brew/choco. To me, I want to see where everything has come from and where it is going on my system. Conda gives me that in spades.

mugivarra69 · on Sept 4, 2022

anyone tried to quantize or use bfloat?

qayxc · on Sept 4, 2022

blfoat would indeed be nice. It's supported on a wide range of hardware (basically all mid-range to high-end Intel CPUs since 2013, AMD MI5 and up compute cards, ARM NEON and NVIDIA cards since Pascal [10-series, 2016!]).

It could speed up calculations and significantly reduce memory requirements. I'd expect slightly worse results, though.

edit: also https://github.com/basujindal/stable-diffusion/pull/103

mugivarra69 · on Sept 4, 2022

neat. thanks!

cdelsolar · on Sept 4, 2022

anyone know how to get conda running on arch linux? `conda init bash` gives me some Python errors.

moron4hire · on Sept 4, 2022

Yeah, I don't think that's an Arch Linux problem. I had similar problems on Windows, and one version of the project even was supposedly setup to run in Docker. What is the point of setting up Docker if the whole setup and build process is not turnkey?

Seems like all of these projects are broken until you speak shibboleth by guessing at random python incantations. By this point, it's starting to feel intentional, like a way to mark you as part of an in-crowd, not a "L-User".

Unfortunately, I don't remember what I did. I did eventually get SD to work (though not in Docker, just as a normal python project). If I had been sober at the time, I probably would have given up. I know you need no greater than Python 3.9.

cdelsolar · on Sept 6, 2022

I'm on Python 3.10, so I probably have to wait till they fix something.

moron4hire · on Sept 6, 2022

Yeah, I don't really get why Python developers tolerate this state of being where minor point releases are so commonly incompatible with each other, or having no standard way to manage building against different versions. I also don't get why ML developers continue with Python, considering these issues.

But, that's the point of the Miniconda dependency. By using Conda, the project can be setup locally with the Python version it expects without clobbering your local, system-level install.

It was still a bit of a pain to learn how to use Conda, but it worked out a little better than figuring it out on my own.

neurostimulant · on Sept 5, 2022

Pretty easy actually. Just install miniconda from here [1]. It'll add some codes into your bashrc / zshrc so you'll need to reopen your terminal after installation.

[1] https://docs.conda.io/en/latest/miniconda.html#linux-install...

Joyfield · on Sept 5, 2022

Gb != GB.

constantlm · on Sept 5, 2022

Thank you - fixed it on the post title. Not sure it'll update on here though.