For those without a GPU / not a powerful enough one / wanting to use SD on the go, you can start the hlky stable diffusion webui (yes, web ui) in Google Colab with this notebook[0].
It's simple and it works, using colab for processing but actually giving you a URL (ngrok-style) to open the pretty web ui in your browser.
I've been using that on-the-go when not at my PC and it's been working very well for me (after trying numerous other colab-dedicated repos, trying to fix them, and failing).
Additionally, you can have all your generated images sync to Google Drive automatically.
Google is paying, and yes, you can, but they will disconnect you after a while. And if you abuse it too much, you won't be able to use it until the following day...
You can also buy Colab Pro and Colab Pro+, which have fewer limitations and faster GPUs.
In my usage Colab and Colab Pro were similar, with plain Colab occasionally OOMing during model loading. That said I've actually been seeing times slower than yours on Colab and I think they're slower than on my RTX 3080. ~15 secs per image. I'm not sure why, though.
You are much better off running it locally at those speeds. P100 does 13 to 33 seconds a batch in my experience. Cloud to cloud data transfer (Hugginface to Colab) is ridiculously fast tho.
I'm on Colab Pro and get about 3 steps per second when generating a single 512x512 image at a time, with slight throughput improvement when I batch 2-3 images
Yup, totally free (with a Google account). It's run as a learning resource and there's an upsell to Collab+ and Collab Pro, but for running StableDiffusion it makes it very easy to get started!
I think it's less of an upsell and more of a data collection and a market positioning. Google would like to, for example:
1. Be central in the machine learning ecosystem. This has broad ripple effects, such as recruiting.
2. Doing things there means Google can track how you use machine learning. This can be used for everything from understanding trends in machine learning, to, again, robustly identifying individuals for recruiting efforts.
It seems like the cost is nominal at Google scale for what Google is getting. I suspect the pricing for the higher-end services is less a money-making scheme, as at some point, free is no longer sustainable (and if unlimited CPU were free, that would be prone to abuse / misuse / overuse / wasteful use). The amount of money Google makes there is nominal at Google scale.
I started out using my old GTX 1080 on Thursday, could generate 512x512 just fine. That's in 8G of VRAM. It worked well on the hlky branch using webui (built using gradio).
Seeing that training etc. is much more memory intensive, and wanting to get faster results, I bought an RTX 3090, which has 24G of VRAM. However it maxes out at about 1024x512, only twice as many pixels. Observing the card with GPUZ, it never actually allocates more than 13.9G.
Using the lstein branch, I can't get above 896x512. Similarly, GPUZ shows allocated VRAM never reaches 14G. The interface isn't as good as the webui on hlky either - never mind the web interface, a bigger problem is it doesn't save all the parameters alongside generated images.
This is all running using Miniconda on Windows. On Linux it may be a different story, but my gaming PC is not dual-boot (yet).
Recommendations:
- Linux w/ Display drivers on CPU (and just...ditch miniconda please)
- Use lower FP precision mode if available to use the tensor cores (also to double "effective" memory)
- Batch things!
- I don't know what the max resolution of the diffusion network is, you may have to just simply tile it past a certain point (with overlap please! ;P
Hope that helps somewhat. A 3090 should be more than enough for what you're doing, I'm stuck with P100s at best for me! (Cost :'( )
I ordered an NVIDIA Tesla K80 off eBay (and the power adapter... and the blower fan shroud, etc.) and intend to install it when it arrives around Thursday or Friday. I'm hoping that after I install the NVIDIA Linux datacenter drivers I'll be able to use the card with SD.
My only worry is that because the K80 is two GPUs on one board, that it might only utilize one of them, with only 12 GB of VRAM instead of both chips and all 24 GB.
5992 CUDA cores and 24 GB VRAM would be a pretty decent SD accelerator for only $150.
I really want to put one of these in my Dell Precision workstation, and I share similar concerns. I have an older Quadro in it now and could use a proper upgrade, but I really want to wait for the 4000 series cards due in a few months.
Craft Computing on Youtube has the best information from what I have seen so far. I don't like watching Youtube videos for information like this, but I understand why creators have moved to this medium in general. Linux should be much easier to configure for using the K80 to capacity.
You'll be able to render but it won't be fast, those CUDA cores are ancient and VRAM speed is slow. Check the Stable Diffusion discord for more info, but I found these comments:
> one minute per 512x512 @ 50 steps
> 1m20s to run 50 ddims on 512x512 vs 2080 ti in 12 seconds
You'll have to run the optimized model as well, since you can't connect the 2x 12GB together.
The architecture of the k80 just doesn't allow memory sharing or pooling. The newer systems (Pascal and above I think?) can allow it on the right hardware but it's all datacenter and workstation cards for that.
Man, feel like I shoulda picked up the M40 instead then, since it was 24 GB VRAM in total. :(
Oh well, I've spent $90 on dinners and had far less fun than I will have with this video card when it arrives, so I can't say it was wasted money... and I can always just buy an M40 off eBay.
If RTX 3090 prices keep dropping through the floor, I may just bite the bullet and pick one up. I saw a ZOTAC on sale for $999 recently, which is $500 less than the launch MSRP of $1499 (which honestly is where it should have launched anyway... so far as I'm concerned, these cards only just now hit reasonable pricing).
I totally disagree with ditching miniconda. The Colab notebooks that make use of it have been super easy to run and modify. There is documentation everywhere and its very easy to find on SO, and Google. Its a joy to use and really like it for all of my Python workloads. I think of it like a Python VM that just works where ever I place it... so far, haven't been let down.
Miniconda is a pain as it introduces it's own package build format that (IMHO) just isn't very good. It might have been an improvement on python's binary packages when it was released, but now days the conda package format creates more problems than it solves.
>I think of it like a Python VM that just works where ever I place it.
That's called a virtualenv, which is a feature built into python. Miniconda is a thin wrapper around virtualenv (actual python packages) and the conda package format. If you're using an IDE it probably has virtualenv support baked in.
Personally I prefer to use python-poetry for managing virtual envs, but honestly just using the virtualenv command directly is not hard if you're already using conda from the CLI.
Well, for starters(not that proficient with PyCharm as I only use it for less-professional use), I have tried for a few days to get PyCharm to use the miniconda environment on Windows, to no avail. It at least works on the command line so I get to play around with it. I'm going to spend more time tomorrow trying alternatives.
Just checking, are you also using the same GPU for rendering your desktop? If so then try switching over to your integrated GPU or the 1080 if it’s still attached so you can leave 100% of the 3090 available to the network.
That probably won't help much since OP said they aren't even using 14GB of VRAM. I have dual GPUs and use the 2nd one (3060 Ti with no monitors connected) for rendering, which is nice because I have the full 8GB free.
I was looking at GPU graphs and neglected my physical RAM. This machine only has 32G and I didn't notice I was hitting a ceiling on memory allocations too - I took the error message about GPU memory allocations at face value.
I bumped my system commit cap (increased paging file size) by 24G and now I can use all my VRAM.
Have you tried something like 1024x768? Going to full 1024x1024 would double your VRAM usage so I can see why that wouldn’t work.
For my uses, the real benefit of having more VRAM is that you can generate more images simultaneously. My 3080 can generate only one 512x512 in 7 seconds but three 384x384 in that same timeframe. It’s allowed me to generate grids of hundreds of images in just a few minutes.
1024x768 didn't work until I bumped my system commit cap (i.e. increased my paging file max size). I hadn't paid attention to system memory, this box only has 32G.
How have you configured Pytorch in the 'setup' section for your card? The hlky/webui (Shout out to Altryne), is configured for lower end GPU that are memory constrained. The knobs that need a twistn' on these DL models feel infinitesimal.
Maybe look in to booting off a USB stick as a means to test this. I wouldn't be surprised if there were some kind of driver reservation I Windows causing this issue.
How it holds the cigarette with its little paw.
Ehem, i mean, it's technically interesting, how the model correctly extrapolated, how this would look like..
Unfortunately as long as there are people who are easily triggered by this sort of thing (seems like they got a bit of a rise out of you) they'll continue in this fashion.
I'd like to confirm that this works in my GTX 2060 with 6 GB VRAM on windows. I didn't do any modifications on the provided source code; faces are a little problematic.
I don't use anaconda so I created a new venv with python 3.10, installed the requirements as proposed, registered with hugging face and create the api key and run the provided source code.
Any way to improve the quality of the faces? Also how could I tune the parameters a bit ? (I'm not familiar with this AI stuff at all, I'm just a humble python programmer)
It's my understanding that v1.5 will be coming out in a few weeks; I recall that hands and faces will be better-trained in the new model. I'm about to try what you did (install requirements manually) to get it to run in PyCharm on Windows. Neither miniconda nor anaconda really worked for me after spending time trying to get it to pick up the dependencies.
I feel like I'm going insane. Everyone says 512x512 should work with 8gb but when I do it I get:
CUDA out of memory. Tried to allocate 3.00 GiB (GPU 0; 8.00 GiB total capacity; 5.62 GiB already allocated; 0 bytes free; 5.74 GiB reserved in total by PyTorch)
any ideas? I have a 3060ti with 8gb vram...
with 448x448 I get:
CUDA out of memory. Tried to allocate 902.00 MiB (GPU 0; 8.00 GiB total capacity; 6.73 GiB already allocated; 0 bytes free; 6.86 GiB reserved in total by PyTorch)
I've been trying to get the basujindal fork to work, but it seems to be putting all work on the CPU. I've been running the example txt2img prompt for 30 minutes now and it's still not finished. It has reserved 4Gb memory from the GPU, but the GPU doesn't appear to be doing any work, only CPU is doing work.
I now did everything I could to constrain the memory usage of the original SD repo, I was finally able to get it to run, and it produced green squares as output :(
What I did:
- scripts/txt2img.py, function - load_model_from_config, line - 63, change from: model.cuda() to model.cuda().half()
- removed invisible watermarking
- reduced n_samples to 1
- reduced resolution to 256x256
- removed sfw filter
Just can't get it to work and it's not producing an error message or anything that I could debug it with.
Your model is overflowing/underflowing generating NaNs. I got it with memory optimised, increased resolution (multiples of 32, 384 x 384) and full precision while keeping it in 4 GB.
Which is so silly since ML models should be the most portable thing in the world. It's just a series of math operations, not a bunch of OS/hardware specific API calls or something like that. We should be at a stage where each ML model is boiled down to a simple executable with zero dependencies at this point.
Agree 100% and I spend a fair amount of time wondering why this hasn't happened. I built piet-gpu-hal because I couldn't find any abstraction layer over compute shaders that supports precompiled shaders. A motivated person absolutely could write shaders to do all the operations needed by Stable Diffusion, and ship a binary in the megabyte range (obviously not counting the models themselves). That would support Metal, Vulkan, and D3D12. The only thing holding this back is a will to build it.
This is the part that tensorflow is really good at, while just about everything else lags behind. The tf saved model is the graph plus weights, and is super easy to just load up and run. (Also, tflite for mobile...)
But one of the tricky parts with stable diffusion is that people are trying to get it to run on lighter hardware, which is basically another engineering problem where simple apis typically won't expose the kind of internals people want to mess around with.
My laptop takes about 6 seconds per iteration so it's significantly slower, but if you're willing to wait I bet you'll have a much easier time plugging more RAM into your system than adding VRAM.
I've been running it fine on my 3060 Ti, then again I don't have any monitors connected so the full 8GB is free. Check VRAM usage, I'm guessing you don't have 8GB free, more like 5-6GB, since you have monitors connected.
Also, you could try Visions of Chaos and use the Mode > Machine Learning > Text-to-Image > Stable Diffusion. It also has tons of other AI tools e.g. image-to-text captioning, diffusion model training, mandelbrot, music, and a ton more. The dev(s) push out updates almost every day.
Warning: You will first need to go through the 12 steps of Machine Learning setup first[0], then it will download 3-400GB of models since it has scripts for pretty much every latent diffusion out there, some of which e.g. Disco Diffusion I find to still give more interesting results and you can get much higher res on a 3060 Ti, plus you have a TON more parameters to play with, not to mention you can train your own models and load those in (which I've been doing the past few weeks using my photography to get away from using unlicensed imagery :)
Oh sorry I guess i need to mention that you need to put the text encoder on the cpu (or precompute the text embedding somehow). (Im using a custom codebase to make that possible idk how trivial that is to achieve with StableDiffusionPipeline.) Only the unet and vae should be on the gpu.
For your case with 8 gb you shouldn’t need to do either of those things (run it all on gpu), just make sure you have batch size 1 and are using the fp16 version.
On my 3070 I get that error unless I set my batch size to 1. My typical setup is to do six batches of one and it works fine (although I minimize the number of visible things on my screen while it's running). This reliably produces one image every 7-8 seconds.
For some reason — no idea why — this problem went away when I set n_samples to 1 and scale to 10.0 or less. Why these parameters would impact memory usage, I don’t know, but the image quality seems fine, afaict.
n_samples is the batching number. Total memory used scales like "Model Mem Size + n_samples * Batch Mem Size". The memory needed for a batch is smaller than the model but not trivial.
Not an Apple guy, but I think an Apple M chip will run at ⅓ the speed of a top end RTX GPU, however it uses system memory, so it can easily be 32GB or 64Gb. That's pretty compelling, and if this is really a new class of application, NVidia is going to have to think about more memory for mainstream-ish products.
Btw that's the kind of perf I see on my M1, but I keep seeing "0.00G VRAM used" for each generation. I wonder what that's about. In Activity Monitor I do see the GPU being used.
So go here, turn off the safety filter and you can search to see what SD was trained on. I suspect that if you actually want the bush tit bird and donkeys, you'll want to use that instead.
Yes I followed one recently, though it uses conda. The SD script runs in a conda environment, so when you uninstall conda your system is preserved and hasn't been stomped on.
Bandwidth of dual channel DDR4-3600: 48 GB/s
Bandwidth of PCIe 4 x16: 26 GB/s
Bandiwdth of 3090 GDDR6X memory: 935.8 GB/s
Since neural network evaluation is usually bandwidth limited, it's possible that pushing the data through PCI-E from CPU to GPU is actually slower than doing the evaluation on CPU only for typical neural networks.
I once tried to start Firefox (back in the 2.5-3.0 days >:D) on a Celeron with 64MB RAM.
It worked perfectly fine, with the sole exception that the HDD LED was on solid the whole time, a single window took just over a literal half an hour to open, and loading a webpage took about 1-2 minutes.
You kind of can - projects like deepspeed (https://www.deepspeed.ai/) enable running a model that is larger than in VRAM through various tricks like moving weights from regular system RAM into VRAM between layers. Can come with a performance hit though depending on the model, of course.
For training you can often divide the batch size by n (and then only apply the backprop gradient stuff after each n batches for it to be mathematically equivalent). At a cost of speed, though.
All I keep thinking is, how can I make money off of this. Aww the power of open-source. Right now, my thinking is that its just going to cut costs (sorry artists) for in existing workflows, maybe change some endeavors from red to black profit margins. Probably more likely will be using this SD as a basis for more specialized content training.
My work uses a monorepo without precise dependency tracking (Bazel or similar) so every single diff builds everything and runs a ton of tests. About 6 kWh of electricity per diff. Even for typos.
It's unfortunate that this article doesn't specify the amount of VRAM needed, other than specifying it's "less than 10Gb". I have 6,1Gb of VRAM and I tried to follow the article until eventually encountering an "unable to allocate memory" error. (I'm now trying to run basujindal's repo as an alternative.)
Reduce the resolution and run with half-precision instead of full-precision and you should be able to avoid OOM errors. Author seems to have had 8GB VRAM available, so I'm guessing that's the "minimum required" for their solution.
It's not possible to halve the precision further. The precision was already dropped from float32 to float16 in the OP.
I now used parameters to drop the resolution to 256x256, and now it's running, but it's somehow broken. Every output image it produces is literally a green square.
The green square issue has been well known, particularly on AMD cards, and I believe the solution is... full precision :c
But idk, I haven't had that issue.
My issue's that I can run it in <4GB VRAM, but can only do a couple dozen images before some memory leak or smth drives it out of memory (effects my 2070S too, but only after many more images). Restarting it isn't too bad, but it's enough to have me looking to using either if two AMD APU's that I have on hand.
I've been running Stable Diffusion on my M1 Macbook since the thread a few days ago about doing just that.
I am comically bad at getting it to generate what I want. e.g. "A furry watermelon" or "A dog flexing its biceps" just generates normal watermelons and normal dogs most of the time.
I have this running on my fairly mundane Radeon 5600XT at about 1 minute per image generated (under rootless podman, which is the real cool news to me) which isn't bad all things considered. Definitely get some interesting sounds from coil whine when it's going.
I believe I saw a repo that was doing exactly that. They also included a step at the end to reintegrate the results better.
I was also able to use the basic scripts to generate a few samples, pick one I liked, then used inpaint to expand the photo, masking out the original input so it wouldn't be altered.
I totally understand the frustration. Hop on the Conda train and don't look back. There is no performance penalty from using Conda for the boring stuff. The only thing it will cost you is more disk space. Otherwise, its an absolute joy to use. You know where everything is if you want to inspect packages, bin files, wheels, etc. It seems like chasing your tail when you install these things from apt, git, curl, pip and brew/choco. To me, I want to see where everything has come from and where it is going on my system. Conda gives me that in spades.
blfoat would indeed be nice. It's supported on a wide range of hardware (basically all mid-range to high-end Intel CPUs since 2013, AMD MI5 and up compute cards, ARM NEON and NVIDIA cards since Pascal [10-series, 2016!]).
It could speed up calculations and significantly reduce memory requirements. I'd expect slightly worse results, though.
Yeah, I don't think that's an Arch Linux problem. I had similar problems on Windows, and one version of the project even was supposedly setup to run in Docker. What is the point of setting up Docker if the whole setup and build process is not turnkey?
Seems like all of these projects are broken until you speak shibboleth by guessing at random python incantations. By this point, it's starting to feel intentional, like a way to mark you as part of an in-crowd, not a "L-User".
Unfortunately, I don't remember what I did. I did eventually get SD to work (though not in Docker, just as a normal python project). If I had been sober at the time, I probably would have given up. I know you need no greater than Python 3.9.
Yeah, I don't really get why Python developers tolerate this state of being where minor point releases are so commonly incompatible with each other, or having no standard way to manage building against different versions. I also don't get why ML developers continue with Python, considering these issues.
But, that's the point of the Miniconda dependency. By using Conda, the project can be setup locally with the Python version it expects without clobbering your local, system-level install.
It was still a bit of a pain to learn how to use Conda, but it worked out a little better than figuring it out on my own.
Pretty easy actually. Just install miniconda from here [1]. It'll add some codes into your bashrc / zshrc so you'll need to reopen your terminal after installation.
It's simple and it works, using colab for processing but actually giving you a URL (ngrok-style) to open the pretty web ui in your browser.
I've been using that on-the-go when not at my PC and it's been working very well for me (after trying numerous other colab-dedicated repos, trying to fix them, and failing).
Additionally, you can have all your generated images sync to Google Drive automatically.
[0]: https://github.com/altryne/sd-webui-colab