Hacker News new | past | comments | ask | show | jobs | submit login
Generate images in one second on your Mac using a latent consistency model (replicate.com)
215 points by bfirsh on Oct 27, 2023 | hide | past | favorite | 71 comments



32GB M1 Max is taking 25 seconds on the exact same prompt as in the example.

Edit: it seems the "per second" requires the `--continuous` flag to bypass the initial startup time. With that, I'm now seeing the ~1 second per image time (if initial startup time is ignored).


What does bypass startup time really do? Does it keep everything in memory or something?


Probably, you have to load the weights from disk at some point.


That's exactly it. These models are huge.


I’m probably missing something but if the bottleneck is disk read speed, wouldn’t it only take about 5-6 seconds to fill the entire 32GB memory from disk? I just googled and found a benchmark quoting 5,507 MB/s read on an M1 Max.


PyTorch checkpoint is slow to load.


The diffusers format this repo uses should be faster, but there is still some overhead, yeah.


Yeah, the PyTorch disk format is pretty bad.


Everytime i execute: python main.py \ "a beautiful apple floating in outer space, like a planet" \ --steps 4 --width 512 --height 512

It redownloads 4 gigs worth of stuff every execution. Can't you have the script save, and check if its there, then download it or am I doing something wrong?


For me it does not re-download anything on the second run. But it is also only running on the CPU and is slow AF.

With 5 iterations the quality is...not good. It looks just like Stable Diffusion with low iteration count. Maybe there is some magic that kicks in if you have a more powerful Mac?


Did you enable the virtualenv first? If not, it might not be caching the models properly.


This is awesome! It only takes a few minutes to get installed and running. On my M2 mac, it generates sequential images in about a second when using the continuous flag. For a single image, it takes about 20 seconds to generate due to the initial script loading time (loading the model into memory?).

I know what I'll be doing this weekend... generating artwork for my 9 yo kid's video game in Game Maker Studio!

Does anyone know any quick hacks to the python code to sequentially prompt the user for input without purging the model from memory?


Answered my own question. Here's how to add an --interactive flag to the script to continuously ask for prompts and generate images without needing to reload the model into memory each time.

https://github.com/replicate/latent-consistency-model/commit...


> It only takes a few minutes to get installed and running

A few minutes? I have to download at least 5GiB of data to get this running.


Lol. Yeah, I have 1.2Gb internet.


The stupid script seem to not know how to save to disk, so it downloads on every run.


I've got a 500Mb wifi connection. it took me less than 5 minutes from git clone to having my first image (I did have python installed already, though).


Well, how do they look? I've seen some other image generation optimizations, but a lot of them make a significant tradeoff in reduced quality.


Interesting timing because part of me thinks Apple's Spooky Fast event has to do with generative AI.


I think the current rumors are MBPs which would be odd to do the pros before the base models but I wouldn’t complain.


Not only odd because of that but because it's less than a year since they got updated to M2 Pro/Max


Likely not show any generative software till macOS (next ver) comes out, they don’t usually showcase stand alone features without a bigger strategy to include the OS


If you want to run this on a linux machine and use the machine's cpu.

Follow the instructions. Before actually running the command to generate an image.

Open up main.py Change line 17 to model.to(torch_device="cpu", torch_dtype=torch.float32).to('cpu:0')

Basically change the backend from mps to cpu


For linux CPU only, you want https://github.com/rupeshs/fastsdcpu


It is very easy to tweak this to generate images quickly on a nvidia GPU:

* after `pip install -r requirements.txt` do `pip3 install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121`

* on line 17 of main.py change torch.float32 to torch.float16 and change mps:0 to cuda:0

* add a new line after 17 `model.enable_xformers_memory_efficient_attention()`

The xFormers stuff is optional, but it should make it a bit faster. For me this got it generating images in less than second [00:00<00:00, 9.43it/s] and used 4.6GB of VRAM.


This....but a menu item that does it for you.


Mac shortcuts are exactly the use case for this. Menu bar, ask for a prompt, run script. I was always wary of shortcuts, but they're quite powerful and nicely integrated with the OS in the latest versions


Gpt4 likely can give you code for this


What will be possible to do once these things run at interactive frame rates? It’s a little mind boggling to think about what types of experiences this will allow not so long from now.


Trippy VR is where my mind goes. Specifically with eye tracking to determine where to go and what to generate next.


Buy 60 machines, and it interactive.


Alas you've mixed up throughput and latency.

But you might be able to generate at 15fps and interpolate between them or something.


TTS --> Prompt --> Generating live imagery from your rambles?


Thought it was too good to be true, tried it with an M2 Pro MacBook Pro.

Generation takes 20-40 seconds, when using "--continuous" it takes 20-40 seconds once and then keeps generating every 3-5 seconds.


Without continuous it vacates the memory again which can be useful if you use the machine for other things.


Does anyone know of other image generation models that run well on a M1/M2 mac laptop?

I'd like do to some comparison testing. The model in the post is fast but results are hit or miss for quality.


There are plenty of models to try with Draw Things app. You can try SDXL on it to see what's the quality looks like. The speed comparison here: https://engineering.drawthings.ai/integrating-metal-flashatt...


Thanks!


https://github.com/lllyasviel/Fooocus#mac

Its not fast, but its SOTA local quality as far as I know, and I've tried many UIs and augmentations.

Also, maybe it will run better if you grab Pytorch 2.1 or nightly.


Is fast but only if you go 512 512 res will generate an image from start script to finish in 5 seconds, but if you up it to 1024 it takes 10x as long

This on an M2 Max 32gb


Yeah, high res performance is very non linear, especially without swapping out the attention for xformers, flashattention2 or torch SDP (and I don't think torch MPS works with any of those).

That model doesn't work well at 1024x1024 anyway without some augmentations. You want this instead: https://huggingface.co/segmind/SSD-1B


Quality of these LCM is not the best though


True, not the best quality, but still fantastic results for a free model running locally on a laptop. Setting the steps between 10-20 seemed to produce the best results for me for realistic looking images. About one out of 10 images were useful for my test case of "a realistic photo of a german shepard riding a motorcycle through Tokyo at night"

https://github.com/simple10/ai-image-generator/blob/main/exa...


> Setting the steps between 10-20

But thats the point where regular diffusion (with the UniPC scheduler and FreeU) overtakes this in terms of quality.


Good point. I haven't done a lot of testing yet. I'm not sure if the default of 8 steps yields poorer results than 10-20 steps. Either way, it was fast on my M2 mac with 8 to 20 steps, much faster than other models I've played with.


The speed is impressive, but the output is honestly not. It feels like DALLE3 is light years ahead of it.


Maybe could use this to get halfway there, then feed an image to DALLE for enhancement?


Why bother with the safety checker if the model is running locally? I wonder how much faster it would be if the safety checks were skipped.


Was gonna comment the same thing, feels ridiculous to include it here for local use. I believe you should be able to remove it if you edit the python inference code from huggingface.

edit: I tried it out by copying this pipeline file locally and then disabling the safety checker. https://raw.githubusercontent.com/huggingface/diffusers/main...

On my M1 macbook, did a test of 10 images, including the one-off loading time. With checker: 10.51s, without safety checker: 9.48s. So not that big of a hit.


Not much faster tbh, but it's a bit of virtue signaling you're often required to do with generative AI.


I agree. It's pretty easy to bypass if you know a bit of Python, though.

Doing a search for "nsfw" in all subdirectories seems to turn up all the files you need to edit.


You only need to add two lines to the example main.py file to disable it, no need to go editing anything else


Seems like a waste of time, more of a nice to have/tip of the cap to yud


Seeing this kind of image generation limited to M series Macs just goes to show how far ahead Apple is in the notebook GPU game.


I've got a Windows laptop with an RTX 3080 in it that runs this model no problem. I don't have it to hand or else I'd post some timings.

On my Desktop PC with a 4090 in I was getting speeds of 0.2 to 0.3 seconds for reasonably acceptable quality settings so I would expect 0.5s or so on the laptop.

What Apple are ahead on is doing this on a fanless laptop that doesn't hit internal temperatures of triple digits.


> What Apple are ahead on is doing this on a fanless laptop that doesn't hit internal temperatures of triple digits.

You also forgot the bit where Apple are ahead of doing it on a laptop that can achieve it without needing to be tethered to a power socket to achieve the performance.


It's the same thing, power is heat with talking about chips


> You also forgot the bit where Apple are ahead of doing it on a laptop that can achieve it without needing to be tethered to a power socket to achieve the performance.

Kind of sad that a huge anti-competitive, trillion dollar company is the one offering it. Especially given their stances around user freedom.

I'd much rather innovation be distributed. The goal posts should be moved to a point everyone is pushing towards the next thing. Having Apple be the only game in town is unhealthy.


Would say that rather than one company being the only one who can do it, there is only one company that can't do it, and it's Intel.


Ouch. But true.


45it/s (0.1~s per image) on 7900XTX here, so it's still one magnitude faster on GPU with a lot higher power draw than the macs. Doing 10x slower with non-tethered is quite nice outcome.


> What Apple are ahead on is doing this on a fanless laptop that doesn't hit internal temperatures of triple digits.

I think you could pull this off on a Asus G14 in an ultra power saver mode, with the fans off or running inaudibly. The cooling is so beefy they will actually work fanless if you throttle everything down and mostly keep the GPU asleep.

The M chips could certainly sustain image generation better without a fan.


At this point what Apple is ahead with is hype that M Macs are that fast, and the developers targeting them because things just work. Plenty of people should be able to run these models locally but there's close to no nice software that does that out of the box for Windows or Linux


It's because of the unified memory architecture. It's harder/different to do this on x86, because you have to have a large memory GPU and target that.


Not sure why you think it's limited to M series Macs or has to do anything with Apple at all. It's just an instruction on how to run a diffusion model trained in a novel way on particular hardware.


It's possible to do on non-Apple Silicon Macs, just more annoying. There are a few generative AI implementations which use raw Metal but not sure what the most popular one is.


The implementation is not even optimized for Macs. LCM is just very easy to be fast (batch size = 1 and only 2 to 8 steps, depending on what kind of headline you are trying to make).


They also have a decent advantage for LLMs because of their memory bandwidth to system memory vs GPU's with limited VRAM limited over PCIE to the system memory.


Got this working on an intel Mac


It mostly shows how shitty compatibility is between platforms that share the same roots.


Awesome




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: