With glide I think we've reached something of a plateau in terms of architecture on the "text to image generator S curve". DALL-E-2 is a very similar architecture to glide and has some notable downsides (poorer language understanding)
glid-3 is a relatively small model trained by a single guy on his workstation (aka me) so it's not going to be as good. It's also not fully baked yet so ymmv, although it really depends on the prompt. The new latent diffusion model is really amazing though and is much closer to DALLE-2 for 256px images.
I think the open source community will rapidly catch up with Openai in the coming months. The data, code and compute are all there to train a model of similar size and quality.
glid-3 is trained specifically on photographic-style images, and is a bit better at generalization compared to the latent diffusion model.
eg. prompt: half human half Eiffel tower. A human Eiffel tower hybrid (I get mostly normal Eiffel towers from LDM but some sensical results from glid-3)
glid-3 will be worse for things that require detailed recall, like a specific person.
With smaller models you kind of have to generate a lot of samples and pick out the best ones.
glid-3 is a relatively small model trained by a single guy on his workstation (aka me) so it's not going to be as good. It's also not fully baked yet so ymmv, although it really depends on the prompt. The new latent diffusion model is really amazing though and is much closer to DALLE-2 for 256px images.
I think the open source community will rapidly catch up with Openai in the coming months. The data, code and compute are all there to train a model of similar size and quality.