I've used this + Groq yesterday to augment (with a chrome extension) the infinite fun game from Neal Agrawal, but generate actual images and not only emojis.
This feels like the future, near real time image and LLM generation (using Mixtral from Groq as my prompt writer) and Fal API for read time generation!
Idea: convert this into a side scrolling game where the background gradually and seamlessly transitions into a rendering of the words we are dealing with as we progress.
I am imagining the green lush landscape from early parts of the demo to slowly transform into the dry mountainous landscape from later images while new characters appear in the foreground.
(I'd posted this comment incorrectly under the main HN post earlier, instead of as reply here. Too late to delete it apparently.)
I've had an idea for a cards vs humanity style game but using image generation instead, there's a central card for the round and you add something from your hand and then pick from 5-10 images to submit.
Yep, this is using SDXL Lightning underneath which is trained by ByteDance on top of Stable Diffusion XL and released as an open source model. In addition to that, it is using our inference engine and real-time infrastructure to provide a smooth experience compared to other UIs out there (which as far as I know, speed-wise, are not even comparable, ~370ms for 4 step here vs ~2-3 seconds in the replicate link you posted).
Any plans to make an API? I'm building a website to catalog fairly common objects, and could use images to spice it up. I was looking at pexels...but this is just so much better.
EDIT - ah you have one. You're welcome. Sign up here folks. :)
Couple of questions in that case:
a) What is the avg price per 512x512 image? Your pricing is in terms of machine resources, but (for my use case) I want a comparison to pexels.
b) What would the equivalent machine setup be to get inference to be as fast as the website demo?
c) Is the fast-sdxl api using the exact same stack as the website?
There's no hidden magic in the playground and in the demo app, we use the same API available for all customers and also the same JS client and best practices available in our docs.
To all your questions, I recommend playing with it in the API playground, you'll be able to test different image sizes, parameters, and have an idea of the cost per inference.
If you have any other questions, say hello on our Discord and I'm happy to help you.
As for the quality, I borrowed the query ([1]) that people used to test Stable Diffusion 3 and other models today: "Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right is a dog, on the left is a cat".
Spatial prompt adherence is a general missing piece is SDXL (or previous versions of the SD). Hoping that the SD will get it into a good shape as your examples!
_Really_ impressive demo but it'd be oh-so-much-more-impressive if it was smooth, right now ex. deleting a word or adding a space causes 4 inferences in quick session so it feels janky (EDIT: maybe intentional? step by step displayed?)
Btw this is from fal.ai, I first heard of them when they posted a Stable Cascade demo the morning it was released.
They're *really* good, I *highly* recommend them for any inferencing you're doing outside OpenAI. Been in AI for going on 3 years, and on it 24/7 since last year.
Fal is the first service that sweats the details to get it to the point it runs _this_ fast in practice, not just in papers. ex. web socket connection, use short-lived JWTs to avoid having to go through an edge function to sign a request with an API key, etc.
Good point. If it’s this fast, maybe it should generate intermediate images along a smooth path through the latent space, rather than just jumping right to the target
It's sort of the inverse if I'm seeing it correctly: adding one character triggers one inference, but you see steps 1, 2, 3, and 4 of the inference
the latent space stuff became popular through it being a visual allegory, which accidentally confused the technical term it originated from. there's nothing visually smooth about it, it's not a linear interpolation in 3D space, it's a chaotic journey through 3 billion dimension space
Well, it ends up being a journey through different images pulled from the same noise, so yes, any smoothness results more from the degree to which the sampling approach produces similar features when pulled towards slightly different target embeddings than from intrinsically the images being 'neighbors'.
These low-step approaches probably preserve a lot less of the 'noise' features in the final image so latent space cruising is probably less fun.
Sure, but it still has to result in a smooth interpolation. If the relation between latent and pixel space isn't continuous you're gonna have problems during learning.
This and Groq were really surprising to see. I still remember, not too long ago, waiting around for ages just to get a messed up image generation from some site where you got like 20 generations for free with an account. The fact that we're at the point where you can just go to a website and get lightning-fast text and image generation without sign-ups or captcha solving is amazing. I didn't have this type of performance uplift in the cards for early 2024, especially to the extent that they (Groq and fal.ai) can afford to open up their demos completely.
Side note: In my opinion, the fast generations also make up a lot for the shortcomings in image generation quality. I find that even if it messes up, a good result is usually just a seed or small prompt change away.
This is incredible. The reduction in latency has a profound effect in terms of the way I interact with this type of tool. The speed benefit is more than just more image generation. The speed here helps me easily keep the same train of thought moving along as I try different things.
Wow this is super impressive but does somebody know a way to generate consistent characters with stable diffusion?
What I mean is if my first prompt is a girl talking to cat and second prompt is girl playing with that cat, I want the girl and cat to be the same in both pictures.
Is that possible? If so any links or tutorials will be super helpful to learn.
You can do this on Dashtoon Studio. They let you upload just one image and train a consistent character Lora. It's a software for AI comic creation. Found this video on their youtube https://www.youtube.com/watch?v=EEQwEvKQGvE
Lora is by far the most versatile because you can get your character consistently in any pose and in any camera angle. IP adapter replicate too many traits from the input image, and you can't choose what not replicate like the pose. So getting a character from a portrait input to do anything else can become. For Reactor you need a generated image into which you can swap in a face. Works very well for realistic images, for stylized images the style is not maintained. Also hairstyle won't get copied.
So Dashtoon is the most reliable thing and easy thing I've found so far because collecting 20 images of a new character is hard and the properties of the images in a Lora training set are really important like how many close ups, how many expressions etc.
Check out https://scenario.gg - they let you train your own LoRAs on custom images of a character (you need around 20 or so images from different angles for good consistency). A bit simpler, and actually still pretty decent is IP-Adapter, which they also support. Having the cat be consistent is going to be challenging without a custom LoRA I reckon. See this for guidance: https://help.scenario.com/training-a-character-lora
It's curious what it does with single letters. Seems for me to often settle on a small rather detailed building. The more I repeat the letter (e.g., "111" vs "11111111" the odder the building gets. Which I can see now seems pretty sensitive to the seed.
A word or a concept that is unknown has simply no impact on the output. Try to replace "baby raccoon" with "maxolhx" in the prompt, and it will ignore the word and render an Italian priest instead. Strictly speaking it still has an impact, but nothing we could easily describe. You're pretty much just playing with the seed.
love it! I have to log off but I should let you know it seems like the generation is different depending on whether you arrow up or arrow down into the seed when the focus is on the seed input (i.e. going up from 5 to 6 will have a different result than going from 7 to 6)
Idea: convert this into a side scrolling game where the background gradually and seamlessly transitions into a rendering of the words we are dealing with as we progress.
I am imagining the green lush landscape from early parts of the demo to slowly transform into the dry mountainous landscape from later images while new characters appear in the foreground.
This is an important thing, but not a consistent thing between humans. For example, while I can tell half of these are AI (and not just because I typed in the prompt), they have very different looks and feels to me:
• https://fastsdxl.ai/share/6djh0dlat0s6 "Will Smith facing a white plastic robot, close up, side view, renaissance masterpiece oil painting by da Vinci"
But there are many others like yourself who apparently have higher standards than I do.
(And there are also many who have lower standards than me, who were happy to print huge posters where the left and right eyes of subject didn't match).
What is the difference between SDXL turbo (released last november) and lightning (released 2 days ago?)? I havent seen any discussion of lightning on here the last few days and hnsearch only shows a few posts with no comments.
OK - I found the lightning paper and it seems the difference (to an end user) is turbo is max 512px (with no lora support) and this is 1024px (with lora support). The example images in the paper are also subjectively better quality and composition - but appear more stylised to my eye.
This is a great idea. Being able to scan through previous images would encourage freer concept exploration, since you could then jump back to your favoured branch-off point easily.
wow! this is really fast. this is actually at the speed I want image generation to be at. it makes me want to explore and use the tool. everything else has felt like begrudgingly throwing a prompt and eventually giving up because the iteration cycle is too slow.
of course the quality of what is being generated is not competitive with SOTA, but this is going in a really good direction!
Looking forward to it once it is refined. It still creates some strange artifacts. Within the HN message, the demo image of the baby raccoon in priest robes, one can count five paws.
This is so quick! The demo being publicly consumable is powerful, though SDXL is not immune to abuse or NSFW (and possibly illegal) generations. I wonder who ends up being held accountable for such generations.
That's a strange question. Why is nudity something that an image generator shouldn't be able to create? Are genitals a fiction that should never ve present in human art?
Oh indeed. I believe these models should be uncensored. But as we’ve seen with the latest SD model, and with the much more locked down ‘LLM-fronted’ image generators from OpenAI and Google, safety is a massive concern and so they’ve been ridiculously cautious. Not only with the outputs but also with the training material. (‘Porn-In-Porn-Out’)
Regardless of how we feel, lawyers and regulators wait at the door. We should expect new legal precedent within the next year re the generation of copyright-infringing, deepfake, and pornographic material.
Simply try to get it to output an image with a female that's not a beauty queen. Even when specifically prompted to produce an image of ugly people, it can only generate beautiful people.
But when I switched the prompt to include "portrait", yeah it produces women who are too attractive, like an actress wearing makeup to appear ugly for a role.
The neat thing about this speed though is you can flip through the seeds quickly. Seed 626925 is giving me a fish holding some kind of gun, with what I guess are leather gloves. This has always been the main problem with SD imo, it can't really parse sentence structure so adjective descriptions often don't affect the thing you want.
Yeah, I like the speed of the renders. It feels relatively smooth.
OP's post to me feels like a marketing post where the output image is a really close representation of the product they hope to sell. We always called these types of things "carefully selected, random examples", in short they are cherry-picked for their adherence to a standard.
In that same vein mine is also a carefully selected, random example of the output you get when the algorithms don't work well, therefore the "Needs work" qualification.
Both are useful since you need to understand the limitations of the tools that you are employing. In my case I stepped thru animals until I found one that it could not render accurately. It did know that a coelacanth is a fish but it couldn't produce an accurate image of one. Then I added modifiers that it could not place in context.
It's a bit like searching the debris field of a tornado for perfectly rounded debris particles and holding those up as a typical result without mentioning that you end up having to ignore all the splintery debris scattered from hell to breakfast around it.
pixel art is a particularly hard thing for these models to do, especially without further fine tunes or loras. but I'm pretty sure you should be able to get that quality with one of nerijs's loras [0]. But for now, i'd do some prompt templating and try some variations of this: 'pixel-art picture of a cat. low-res, blocky, pixel art style, 8-bit graphics'
Maybe it's an iOS thing? I just spent about an hour generating stuff from my iPad.
*Edit:* The previous poster specifically said they use an iPad, that's odd. Worked fine on my 12,9" iPad Pro, Display Zoom is set to "more space", i.e. more content shown. In case it has an issue with not enough screen space?
They’ve since added a refresh button if you check the GitHub history. Both my previous comments were valid at the time of posting them, so again I don’t understand the downvotes-I suspect it’s a personal thing as it’s happened to many of my recent comments.
To be clear: I did not downvote any comment, and I don't agree with anyone who did, since I don't believe the reports of the website not working for someone on i(Pad) OS to be made-up.
That being said, I don't think the refresh button (actually a seed randomizer) is related. I was definitely already playing around with it on my iPad when you posted your comment. However, I am not sure about the timeline in relation to the first comment (above yours).
This feels like the future, near real time image and LLM generation (using Mixtral from Groq as my prompt writer) and Fal API for read time generation!
https://x.com/altryne/status/1760561501096575401?s=20