I think they meant scrollback. as conventionally a backlog would evoke "work yet to be done", whereas in this context we're talking about a conversation history one can revisit.
> 1.1 License.
> BRIA grants Customer a time-limited, non-exclusive, non-sublicensable, personal and non-transferable right and license to install, deploy and use the Foundation Model for the sole purpose of evaluating and examining the Foundation Model.
> The functionality of the Foundation Model is limited. Accordingly, Customer are not permitted to utilize the Foundation Model for purposes other than the testing and evaluation thereof.
> 1.2.Restrictions. Customer may not:
> 1.2.2. sell, rent, lease, sublicense, distribute or lend the Foundation Model to others, in whole or in part, or host the Foundation Model for access or use by others.
> The Foundation Model made available through Hugging Face is intended for internal evaluation purposes and/or demonstration to potential customers only.
A lot of these AI licenses are a lot more restrictive than old school open source licenses were.
My company runs a bunch of similar web-based services and plan to do a background remover at some stage, but as far as I know there's no current models with a sufficiently permissive license that can also feasibly download & run in browsers.
Meta's second Segment Anything Model (SAM2) has an Apache license. It only does segmenting, and needs additional elbow grease to distill it for browsers, so it's not turnkey, but it's freely licensed.
Yeah, that one seems to be the closest so far. Not sure if it would be easier to create a background removal model from scratch (since that's a more simple operation than segmentation) or distill it.
I got pretty far down that path during Covid for a feature of my saas, but limited to specific product categories on solid-ish backgrounds. Like with a lot of things, it’s easy to get good, and takes forever to get great.
Keep in mind that whether or not a model can be copyrighted at all is still an open question.
Everyone publishing AI model is actually acting as if they owned copyright over it and as such are sharing it with a license, but there's no legal basis for such claim at this point, it's all about pretending and hoping the law will be changed later on to make their claim valid.
It's kind of silly to complain about not abiding by the model license when these models are trained on content not explicitly licensed for AI training.
You might say that the models were legally trained since no law mandates consent for AI training. But no law says that models are copyrightable either.
Surely they would at least be protected by Database Rights in the EU (not the US):
>The TRIPS Agreement requires that copyright protection extends to databases and other compilations if they constitute intellectual creation by virtue of the selection or arrangement of their contents, even if some or all of the contents do not themselves constitute materials protected by copyright
At some point the worlds going to need a Richard Stallman of AI who builds up a foundation that is usable and not in the total control of major corporations. With reasonable licensing. OpenAI was supposed to fit that mold.
It doesn't, except that it runs it. There's no download link or code playground for running arbitrary code on it, so while technically it transfers the model to the computer where it's running (I think) it's not usually considered the same as distributing it.
I think it's a bit more subtle than that. The code of this tool runs in your browser and makes it download the model from huggingface. So it does not host the model or provide it to you, it just does the download on your behalf directly from where the owner of the model put it. The author of this tool is not providing the model to you, just automating the download for you.
Not saying it's not a copyright violation, and IANAL, but it's not a obvious one.
I think it's either running the model in the browser or a small part of it there. Maybe it's downloading parts of the model on the fly. But I kinda doubt it's all running on the server except for some simple RPC calls to the browser's WebGL.
Well, my question is about where it lies within the gray area between fully online and fully offline, so that wouldn't work.
Edit: Good call! It's fully offline - I disabled the network in Chrome and it worked. Says it's 176MB. I think it must be downloading part of the model, all at once, but that's just a guess.
The 176MB is in storage which makes me think that my browser will hold onto it for a while. That's quite a lot. My browser really should provide a disk clearing tool that's more like OmniDiskSweeper than Clear History. If for instance it showed just the ones over 20MB, and my profile was using 1GB, at most it would be 50, a manageable amount to go through and clear the ones I don't need.
Yeah, this is why I think browsers need to start bundling some foundational models for websites to use. It's too unscalable if many websites start trying to store a significantly sized model each.
Google has started addressing this. I hope it becomes part of web standards soon.
"Since these models aren't shared across websites, each site has to download them on page load. This is an impractical solution for developers and users"
The browser bundles might become quite large, but at least websites won't be.
I was referring to the input image in the diagram, what is that and how is the output image generated from it? Is it 256x256 noise that gets denoised into an image? I guess what I'm really asking is what guides the process into the final image if it's not text to image?
The "input image" is just the noisy sample from the previous timestep, yes.
The overall architecture diagram does not explicitly show the conditioning mechanism, which is a small separate network. For this paper, we only trained on class-conditional ImageNet and completely unconditional megapixel-scale FFHQ.
Training large-scale text-to-image models with this architecture is something we have not yet attempted, although there's no indication that this shouldn't work with a few tweaks.
Both Latent Consistency Models and Adversarial Diffusion Distillation (the method behind SDXL Turbo) are methods that do not depend on any specific properties of the backbone. So, as Hourglass Diffusion Transformers are just a new kind of backbone that can be used just like the Diffusion U-Nets in Stable Diffusion (XL), these methods should also be applicable to it.
FID doesn't reward high-resolution detail. the inception feature size is 299x299! so we are forced to downsample our FFHQ-1024 samples to compute FID.
it also doesn't punish poor detail either! this advantages latent diffusion, which can claim to achieve a high resolution but without actually needing to have correct textures to get good metrics.
the FFHQ-1024 examples shouldn't be blurry. you can download the originals from the project page[0] — click any image in the teaser, or download our 50k samples.
the ImageNet-256 examples also aren't typically blurry (but they are 256x256 so your viewer may be bicubic scaling them or something). the ImageNet dataset _can_ have blurry, compressed or low resolution training samples, which can afflict some classes more than others, and we learn to produce samples like the training set.
cross-attention doesn't need to involve NATTEN. there's no neighbourhood involved because it's not self-attention. so you can do it the stable-diffusion way: after self-attention, run torch sdp with Q=image and K=V=text.
I tried adding "stable-diffusion-style" cross-attn to HDiT, text-conditioning on small class-conditional datasets (oxford flowers), embedding the class labels as text prompts with Phi-1.5. trained it for a few minutes, and the images were relevant to the prompts, so it seemed to be working fine.
but if instead of a text condition you have a single-token condition (class label) then yeah the adanorm would be a simpler way.
I'm one of the authors; happy to answer questions.
this arch is of course nice for high-resolution synthesis, but there's some other cool stuff worth mentioning..
activations are small! so you can enjoy bigger batch sizes. this is due to the 4x patching we do on the ingress to the model, and the effectiveness of neighbourhood attention in joining patches at the seams.
the model's inductive biases are pretty different than (for example) a convolutional UNet's. the innermost levels seem to train easily, so images can have good global coherence early in training.
there's no convolutions! so you don't need to worry about artifacts stemming from convolution padding, or having canvas edge padding artifacts leak an implicit position bias.
we can finally see what high-resolution diffusion outputs look like _without_ latents! personally I think current latent VAEs don't _really_ achieve the high resolutions they claim (otherwise fine details like text would survive a VAE roundtrip faithfully); it's common to see latent diffusion outputs with smudgy skin or blurry fur. what I'd like to see in the future of latent diffusion is to listen to the Emu paper and use more channels, or a less ambitious upsample.
it's a transformer! so we can try applying to it everything we know about transformers, like sigma reparameterisation or multimodality. some tricks like masked training will require extra support in [NATTEN](https://github.com/SHI-Labs/NATTEN), but we're very happy with its featureset and performance so far.
but honestly I'm most excited about the efficiency. there's too little work on making pretraining possible at GPU-poor scale. so I was very happy to see HDiT could succeed at small-scale tasks within the resources I had at home (you can get nice oxford flowers samples at 256x256px with half an hour on a 4090). I think with models that are better fits for the problem, perhaps we can get good results with smaller models. and I'd like to see big tech go that direction too!
Hi Alex
Amazing work. I scanned the paper and dusted off my aging memories of Jeremy Howard’s course. Will your model live happily alongside the existing SD infrastructure such as ControlNet, IPAdapter, and the like? Obviously we will have to retrain these to fit onto your model, but conceptually, does your model have natural places where adapters of various kinds can be attached?
regarding ControlNet:
we have a UNet backbone, so the idea of "make trainable copies of the encoder blocks" sounds possible. the other part, "use a zero-inited dense layer to project the peer-encoder output and add it to the frozen-decoder output" also sounds fine. not quite sure what they do with the mid-block but I doubt there'd be any problem there.
regarding IPAdapter:
I'm not familiar with it, but from the code it looks like they just run cross-attention again and sum the two attention outputs. feels a bit weird to me, because the attention probabilities add up to 2 instead of 1. and they scale the bonus attention output only instead of lerping. it'd make more sense to me to formulate it as a cross-cross attention (Q against cat([key0, key1]) and cat([val0, val1])), but maybe they wanted it to begin as a no-op at the start of training or something.
anyway.. yes, all of that should work fine with HDiT. the paper doesn't implement cross-attention, but it can be added in the standard way (e.g. like stable-diffusion) or as self-cross attention (e.g. DeepFloyd IF or Imagen).
I'd recommend though to make use of HDiT's mapping network. in our attention blocks, the input gets AdaNormed against the condition from the mapping network. this is currently used to convey stuff like class conditions, Karras augmentation conditions and timestep embeddings. but it supports conditioning on custom (single-token) conditions of your choosing. so you could use this to condition on an image embed (this would give you the same image-conditioning control as IPAdapter but via a simpler mechanism).
But generally, most other UIs support it. It has serious limitations though, for example it center-crops the input to 224x224px. (which is enough for a surprisingly large amount of uses, but not enough for many others)
Yes. I discussed this issue with the author of the ComfyUI IP-Adapter nodes. It would doubtless be handy if someone could end-to-end train a higher resolution IP-Adapter model that integrated its own variant of CLIPVision that is not subject to the 224px constraint. I have no idea what kind of horsepower would be required for that.
A latent space CLIPVision model would be cool too. Presumably you could leverage the semantic richness of the latent space to efficiently train a more powerful CLIPVision. I don’t know whether anyone has tried this. Maybe there is a good reason for that.
I appreciate the restraint of showing the speedup on a log-scale chart rather than trying to show a 99% speed up any other way.
I see your headline speed comparison is to "Pixel-space DiT-B/4" - but how does your model compare to the likes of SDXL? I gather they spent $$$$$$ on training etc, so I'd understand if direct comparisons don't make sense.
And do you have any results on things that are traditionally challenging for generative AI, like clocks and mirrors?
ah, originally lstein/stable-diffusion? yeah that was an important fork for us Mac users in the early days. I have to confess I've still never used a UI. :)
this year I'm hoping for efficiency and small models! even if it's proprietary. if our work can reduce some energy usage behind closed doors that'd still be a good outcome.
Not yet, we focused on the architecture for this paper. I totally agree with you though - pixel space is generally less limiting than a latent space for diffusion, so we would expect good performance inpainting behavior and other editing tasks.
Most things require workarounds, some things aren't possible (or we haven't found workaround yet) and it's not as fast as CUDA. But stable-diffusion inference works, and so does textual inversion training. I was also able to run training of a T5 model with just a couple of tweaks.
I'd stick with PyTorch 1.12.1 for now. 1.13 has problems with backpropagation (I get NaN gradients now when I attempt CLIP-guided diffusion -- I think this applies to training too), and some einsum formulations are 50% slower (there is a patch to fix this; I expect it'll be merged soon), making big self-attention matmuls slow and consequently making stable-diffusion inference ~6% slower.
For performance reasons, glibc may not return freed memory to the OS. You can increase the incentive for it to do so, by reducing MALLOC_ARENA_MAX to 2.
https://github.com/prestodb/presto/issues/8993