32GB M1 Max is taking 25 seconds on the exact same prompt as in the example.
Edit: it seems the "per second" requires the `--continuous` flag to bypass the initial startup time. With that, I'm now seeing the ~1 second per image time (if initial startup time is ignored).
I’m probably missing something but if the bottleneck is disk read speed, wouldn’t it only take about 5-6 seconds to fill the entire 32GB memory from disk? I just googled and found a benchmark quoting 5,507 MB/s read on an M1 Max.
Everytime i execute: python main.py \
"a beautiful apple floating in outer space, like a planet" \
--steps 4 --width 512 --height 512
It redownloads 4 gigs worth of stuff every execution. Can't you have the script save, and check if its there, then download it or am I doing something wrong?
For me it does not re-download anything on the second run. But it is also only running on the CPU and is slow AF.
With 5 iterations the quality is...not good. It looks just like Stable Diffusion with low iteration count. Maybe there is some magic that kicks in if you have a more powerful Mac?
This is awesome! It only takes a few minutes to get installed and running. On my M2 mac, it generates sequential images in about a second when using the continuous flag. For a single image, it takes about 20 seconds to generate due to the initial script loading time (loading the model into memory?).
I know what I'll be doing this weekend... generating artwork for my 9 yo kid's video game in Game Maker Studio!
Does anyone know any quick hacks to the python code to sequentially prompt the user for input without purging the model from memory?
Answered my own question. Here's how to add an --interactive flag to the script to continuously ask for prompts and generate images without needing to reload the model into memory each time.
I've got a 500Mb wifi connection. it took me less than 5 minutes from git clone to having my first image (I did have python installed already, though).
Likely not show any generative software till macOS (next ver) comes out, they don’t usually showcase stand alone features without a bigger strategy to include the OS
* on line 17 of main.py change torch.float32 to torch.float16 and change mps:0 to cuda:0
* add a new line after 17 `model.enable_xformers_memory_efficient_attention()`
The xFormers stuff is optional, but it should make it a bit faster.
For me this got it generating images in less than second [00:00<00:00, 9.43it/s] and used 4.6GB of VRAM.
Mac shortcuts are exactly the use case for this. Menu bar, ask for a prompt, run script. I was always wary of shortcuts, but they're quite powerful and nicely integrated with the OS in the latest versions
What will be possible to do once these things run at interactive frame rates? It’s a little mind boggling to think about what types of experiences this will allow not so long from now.
Yeah, high res performance is very non linear, especially without swapping out the attention for xformers, flashattention2 or torch SDP (and I don't think torch MPS works with any of those).
True, not the best quality, but still fantastic results for a free model running locally on a laptop. Setting the steps between 10-20 seemed to produce the best results for me for realistic looking images. About one out of 10 images were useful for my test case of "a realistic photo of a german shepard riding a motorcycle through Tokyo at night"
Good point. I haven't done a lot of testing yet. I'm not sure if the default of 8 steps yields poorer results than 10-20 steps. Either way, it was fast on my M2 mac with 8 to 20 steps, much faster than other models I've played with.
Was gonna comment the same thing, feels ridiculous to include it here for local use. I believe you should be able to remove it if you edit the python inference code from huggingface.
On my M1 macbook, did a test of 10 images, including the one-off loading time. With checker: 10.51s, without safety checker: 9.48s. So not that big of a hit.
I've got a Windows laptop with an RTX 3080 in it that runs this model no problem. I don't have it to hand or else I'd post some timings.
On my Desktop PC with a 4090 in I was getting speeds of 0.2 to 0.3 seconds for reasonably acceptable quality settings so I would expect 0.5s or so on the laptop.
What Apple are ahead on is doing this on a fanless laptop that doesn't hit internal temperatures of triple digits.
> What Apple are ahead on is doing this on a fanless laptop that doesn't hit internal temperatures of triple digits.
You also forgot the bit where Apple are ahead of doing it on a laptop that can achieve it without needing to be tethered to a power socket to achieve the performance.
> You also forgot the bit where Apple are ahead of doing it on a laptop that can achieve it without needing to be tethered to a power socket to achieve the performance.
Kind of sad that a huge anti-competitive, trillion dollar company is the one offering it. Especially given their stances around user freedom.
I'd much rather innovation be distributed. The goal posts should be moved to a point everyone is pushing towards the next thing. Having Apple be the only game in town is unhealthy.
45it/s (0.1~s per image) on 7900XTX here, so it's still one magnitude faster on GPU with a lot higher power draw than the macs. Doing 10x slower with non-tethered is quite nice outcome.
> What Apple are ahead on is doing this on a fanless laptop that doesn't hit internal temperatures of triple digits.
I think you could pull this off on a Asus G14 in an ultra power saver mode, with the fans off or running inaudibly. The cooling is so beefy they will actually work fanless if you throttle everything down and mostly keep the GPU asleep.
The M chips could certainly sustain image generation better without a fan.
At this point what Apple is ahead with is hype that M Macs are that fast, and the developers targeting them because things just work. Plenty of people should be able to run these models locally but there's close to no nice software that does that out of the box for Windows or Linux
Not sure why you think it's limited to M series Macs or has to do anything with Apple at all. It's just an instruction on how to run a diffusion model trained in a novel way on particular hardware.
It's possible to do on non-Apple Silicon Macs, just more annoying. There are a few generative AI implementations which use raw Metal but not sure what the most popular one is.
The implementation is not even optimized for Macs. LCM is just very easy to be fast (batch size = 1 and only 2 to 8 steps, depending on what kind of headline you are trying to make).
They also have a decent advantage for LLMs because of their memory bandwidth to system memory vs GPU's with limited VRAM limited over PCIE to the system memory.
Edit: it seems the "per second" requires the `--continuous` flag to bypass the initial startup time. With that, I'm now seeing the ~1 second per image time (if initial startup time is ignored).