More

Birch-san · 2025-02-22T02:28:34 1740191314

I think they meant scrollback. as conventionally a backlog would evoke "work yet to be done", whereas in this context we're talking about a conversation history one can revisit.

Birch-san · 2024-08-26T21:40:24 1724708424

feels like it could be nice to abide by the license terms https://bria.ai/bria-huggingface-model-license-agreement/

> 1.1 License. > BRIA grants Customer a time-limited, non-exclusive, non-sublicensable, personal and non-transferable right and license to install, deploy and use the Foundation Model for the sole purpose of evaluating and examining the Foundation Model. > The functionality of the Foundation Model is limited. Accordingly, Customer are not permitted to utilize the Foundation Model for purposes other than the testing and evaluation thereof.

> 1.2.Restrictions. Customer may not: > 1.2.2. sell, rent, lease, sublicense, distribute or lend the Foundation Model to others, in whole or in part, or host the Foundation Model for access or use by others.

> The Foundation Model made available through Hugging Face is intended for internal evaluation purposes and/or demonstration to potential customers only.

jfoster · 2024-08-26T23:17:29 1724714249

A lot of these AI licenses are a lot more restrictive than old school open source licenses were.

My company runs a bunch of similar web-based services and plan to do a background remover at some stage, but as far as I know there's no current models with a sufficiently permissive license that can also feasibly download & run in browsers.

sangnoir · 2024-08-27T00:59:03 1724720343

Meta's second Segment Anything Model (SAM2) has an Apache license. It only does segmenting, and needs additional elbow grease to distill it for browsers, so it's not turnkey, but it's freely licensed.

jfoster · 2024-08-27T01:58:44 1724723924

Yeah, that one seems to be the closest so far. Not sure if it would be easier to create a background removal model from scratch (since that's a more simple operation than segmentation) or distill it.

jaxn · 2024-08-27T03:11:29 1724728289

I got pretty far down that path during Covid for a feature of my saas, but limited to specific product categories on solid-ish backgrounds. Like with a lot of things, it’s easy to get good, and takes forever to get great.

littlestymaar · 2024-08-27T07:22:10 1724743330

Keep in mind that whether or not a model can be copyrighted at all is still an open question.

Everyone publishing AI model is actually acting as if they owned copyright over it and as such are sharing it with a license, but there's no legal basis for such claim at this point, it's all about pretending and hoping the law will be changed later on to make their claim valid.

NewJazz · 2024-08-27T15:26:45 1724772405

Train on copyrighted material

Claim fair use

Release model

Claim copyright

Infinite copyright!

Mathnerd314 · 2024-08-27T03:44:42 1724730282

It is a 2024 model, for comparison https://github.com/danielgatis/rembg/ uses U2-Net which is open source from 2022. There is also https://github.com/ZhengPeng7/BiRefNet (another 2024 model, also open source), it's not too late to switch.

xdennis · 2024-08-27T11:23:02 1724757782

It's kind of silly to complain about not abiding by the model license when these models are trained on content not explicitly licensed for AI training.

You might say that the models were legally trained since no law mandates consent for AI training. But no law says that models are copyrightable either.

EMIRELADERO · 2024-08-26T23:58:59 1724716739

AI model weights are probably not even copyrightable.

thenickdude · 2024-08-27T00:47:36 1724719656

Surely they would at least be protected by Database Rights in the EU (not the US):

>The TRIPS Agreement requires that copyright protection extends to databases and other compilations if they constitute intellectual creation by virtue of the selection or arrangement of their contents, even if some or all of the contents do not themselves constitute materials protected by copyright

https://en.wikipedia.org/wiki/Database_right

EMIRELADERO · 2024-08-27T00:52:04 1724719924

Those require the "database" in question to be readable and for every single element to be so too. Model weights don't satisfy that requirement.

giancarlostoro · 2024-08-27T13:50:03 1724766603

At some point the worlds going to need a Richard Stallman of AI who builds up a foundation that is usable and not in the total control of major corporations. With reasonable licensing. OpenAI was supposed to fit that mold.

Laaas · 2024-08-26T23:12:22 1724713942

The repo doesn’t include the model.

NewJazz · 2024-08-26T23:24:22 1724714662

Does the site not distribute it?

benatkin · 2024-08-26T23:31:39 1724715099

It doesn't, except that it runs it. There's no download link or code playground for running arbitrary code on it, so while technically it transfers the model to the computer where it's running (I think) it's not usually considered the same as distributing it.

papa_bear · 2024-08-27T00:44:23 1724719463

Pretty sure downloading it to your browser counts as distributing it, legally speaking.

p91paul · 2024-09-01T08:03:27 1725177807

I think it's a bit more subtle than that. The code of this tool runs in your browser and makes it download the model from huggingface. So it does not host the model or provide it to you, it just does the download on your behalf directly from where the owner of the model put it. The author of this tool is not providing the model to you, just automating the download for you. Not saying it's not a copyright violation, and IANAL, but it's not a obvious one.

benatkin · 2024-08-27T01:07:42 1724720862

AYAL?

awwaiid · 2024-08-27T12:45:06 1724762706

Sure!

NewJazz · 2024-08-26T23:45:38 1724715938

Yeah, that doesn't sound right to me.

benatkin · 2024-08-27T01:10:13 1724721013

What's the point of running it in WebGPU then?

I think it's either running the model in the browser or a small part of it there. Maybe it's downloading parts of the model on the fly. But I kinda doubt it's all running on the server except for some simple RPC calls to the browser's WebGL.

NewJazz · 2024-08-27T01:22:59 1724721779

What's the point of running it in WebGPU then?

Use client resources instead of server resources.

jfoster · 2024-08-27T02:02:22 1724724142

Anyone can easily do a online/offline binary check for web apps like these:

1. Load the page

2. Disconnect from the internet

3. Try to use the app without reconnecting

benatkin · 2024-08-27T02:21:39 1724725299

Well, my question is about where it lies within the gray area between fully online and fully offline, so that wouldn't work.

Edit: Good call! It's fully offline - I disabled the network in Chrome and it worked. Says it's 176MB. I think it must be downloading part of the model, all at once, but that's just a guess.

The 176MB is in storage which makes me think that my browser will hold onto it for a while. That's quite a lot. My browser really should provide a disk clearing tool that's more like OmniDiskSweeper than Clear History. If for instance it showed just the ones over 20MB, and my profile was using 1GB, at most it would be 50, a manageable amount to go through and clear the ones I don't need.

jfoster · 2024-08-27T05:13:08 1724735588

Yeah, this is why I think browsers need to start bundling some foundational models for websites to use. It's too unscalable if many websites start trying to store a significantly sized model each.

Google has started addressing this. I hope it becomes part of web standards soon.

https://developer.chrome.com/docs/ai/built-in

"Since these models aren't shared across websites, each site has to download them on page load. This is an impractical solution for developers and users"

The browser bundles might become quite large, but at least websites won't be.

nkrisc · 2024-08-27T08:59:16 1724749156

As long as there’s a way to disable it. I don’t want my disk space wasted by a browser with AI stuff I won’t use.

Birch-san · on Jan 24, 2024

> Is it super resolution?

nope, we don't do Imagen-style super-resolution. we go direct to high resolution with a single-stage model.

sorenjan · on Jan 24, 2024

I was referring to the input image in the diagram, what is that and how is the output image generated from it? Is it 256x256 noise that gets denoised into an image? I guess what I'm really asking is what guides the process into the final image if it's not text to image?

stefanbaumann · on Jan 24, 2024

The "input image" is just the noisy sample from the previous timestep, yes.

The overall architecture diagram does not explicitly show the conditioning mechanism, which is a small separate network. For this paper, we only trained on class-conditional ImageNet and completely unconditional megapixel-scale FFHQ.

Training large-scale text-to-image models with this architecture is something we have not yet attempted, although there's no indication that this shouldn't work with a few tweaks.

sorenjan · on Jan 24, 2024

Thank you, I'm not used to reading this kind of research papers but I think I got the gist of it now.

Can this architecture be used to distill models that need fewer timesteps like LCMs or SDXL turbo?

stefanbaumann · on Jan 24, 2024

Both Latent Consistency Models and Adversarial Diffusion Distillation (the method behind SDXL Turbo) are methods that do not depend on any specific properties of the backbone. So, as Hourglass Diffusion Transformers are just a new kind of backbone that can be used just like the Diffusion U-Nets in Stable Diffusion (XL), these methods should also be applicable to it.

Birch-san · on Jan 24, 2024

FID doesn't reward high-resolution detail. the inception feature size is 299x299! so we are forced to downsample our FFHQ-1024 samples to compute FID.

it also doesn't punish poor detail either! this advantages latent diffusion, which can claim to achieve a high resolution but without actually needing to have correct textures to get good metrics.

Birch-san · on Jan 24, 2024

the FFHQ-1024 examples shouldn't be blurry. you can download the originals from the project page[0] — click any image in the teaser, or download our 50k samples.

the ImageNet-256 examples also aren't typically blurry (but they are 256x256 so your viewer may be bicubic scaling them or something). the ImageNet dataset _can_ have blurry, compressed or low resolution training samples, which can afflict some classes more than others, and we learn to produce samples like the training set.

[0] https://crowsonkb.github.io/hourglass-diffusion-transformers...

Birch-san · on Jan 24, 2024

cross-attention doesn't need to involve NATTEN. there's no neighbourhood involved because it's not self-attention. so you can do it the stable-diffusion way: after self-attention, run torch sdp with Q=image and K=V=text.

I tried adding "stable-diffusion-style" cross-attn to HDiT, text-conditioning on small class-conditional datasets (oxford flowers), embedding the class labels as text prompts with Phi-1.5. trained it for a few minutes, and the images were relevant to the prompts, so it seemed to be working fine.

but if instead of a text condition you have a single-token condition (class label) then yeah the adanorm would be a simpler way.

Birch-san · on Jan 23, 2024

I'm one of the authors; happy to answer questions. this arch is of course nice for high-resolution synthesis, but there's some other cool stuff worth mentioning..

activations are small! so you can enjoy bigger batch sizes. this is due to the 4x patching we do on the ingress to the model, and the effectiveness of neighbourhood attention in joining patches at the seams.

the model's inductive biases are pretty different than (for example) a convolutional UNet's. the innermost levels seem to train easily, so images can have good global coherence early in training.

there's no convolutions! so you don't need to worry about artifacts stemming from convolution padding, or having canvas edge padding artifacts leak an implicit position bias.

we can finally see what high-resolution diffusion outputs look like _without_ latents! personally I think current latent VAEs don't _really_ achieve the high resolutions they claim (otherwise fine details like text would survive a VAE roundtrip faithfully); it's common to see latent diffusion outputs with smudgy skin or blurry fur. what I'd like to see in the future of latent diffusion is to listen to the Emu paper and use more channels, or a less ambitious upsample.

it's a transformer! so we can try applying to it everything we know about transformers, like sigma reparameterisation or multimodality. some tricks like masked training will require extra support in [NATTEN](https://github.com/SHI-Labs/NATTEN), but we're very happy with its featureset and performance so far.

but honestly I'm most excited about the efficiency. there's too little work on making pretraining possible at GPU-poor scale. so I was very happy to see HDiT could succeed at small-scale tasks within the resources I had at home (you can get nice oxford flowers samples at 256x256px with half an hour on a 4090). I think with models that are better fits for the problem, perhaps we can get good results with smaller models. and I'd like to see big tech go that direction too!

-Alex Birch

ttul · on Jan 23, 2024

Hi Alex Amazing work. I scanned the paper and dusted off my aging memories of Jeremy Howard’s course. Will your model live happily alongside the existing SD infrastructure such as ControlNet, IPAdapter, and the like? Obviously we will have to retrain these to fit onto your model, but conceptually, does your model have natural places where adapters of various kinds can be attached?

Birch-san · on Jan 23, 2024

regarding ControlNet: we have a UNet backbone, so the idea of "make trainable copies of the encoder blocks" sounds possible. the other part, "use a zero-inited dense layer to project the peer-encoder output and add it to the frozen-decoder output" also sounds fine. not quite sure what they do with the mid-block but I doubt there'd be any problem there.

regarding IPAdapter: I'm not familiar with it, but from the code it looks like they just run cross-attention again and sum the two attention outputs. feels a bit weird to me, because the attention probabilities add up to 2 instead of 1. and they scale the bonus attention output only instead of lerping. it'd make more sense to me to formulate it as a cross-cross attention (Q against cat([key0, key1]) and cat([val0, val1])), but maybe they wanted it to begin as a no-op at the start of training or something. anyway.. yes, all of that should work fine with HDiT. the paper doesn't implement cross-attention, but it can be added in the standard way (e.g. like stable-diffusion) or as self-cross attention (e.g. DeepFloyd IF or Imagen).

I'd recommend though to make use of HDiT's mapping network. in our attention blocks, the input gets AdaNormed against the condition from the mapping network. this is currently used to convey stuff like class conditions, Karras augmentation conditions and timestep embeddings. but it supports conditioning on custom (single-token) conditions of your choosing. so you could use this to condition on an image embed (this would give you the same image-conditioning control as IPAdapter but via a simpler mechanism).

bravura · on Jan 24, 2024

IPAdapter, I am curious if there are useful GUIs for this? Creating image masks through uploading to colab is not so cute.

orbital-decay · on Jan 24, 2024

Here's one example: https://github.com/Acly/krita-ai-diffusion/

But generally, most other UIs support it. It has serious limitations though, for example it center-crops the input to 224x224px. (which is enough for a surprisingly large amount of uses, but not enough for many others)

ttul · on Jan 24, 2024

Yes. I discussed this issue with the author of the ComfyUI IP-Adapter nodes. It would doubtless be handy if someone could end-to-end train a higher resolution IP-Adapter model that integrated its own variant of CLIPVision that is not subject to the 224px constraint. I have no idea what kind of horsepower would be required for that.

A latent space CLIPVision model would be cool too. Presumably you could leverage the semantic richness of the latent space to efficiently train a more powerful CLIPVision. I don’t know whether anyone has tried this. Maybe there is a good reason for that.

michaelt · on Jan 24, 2024

I appreciate the restraint of showing the speedup on a log-scale chart rather than trying to show a 99% speed up any other way.

I see your headline speed comparison is to "Pixel-space DiT-B/4" - but how does your model compare to the likes of SDXL? I gather they spent $$$$$$ on training etc, so I'd understand if direct comparisons don't make sense.

And do you have any results on things that are traditionally challenging for generative AI, like clocks and mirrors?

sophrocyne · on Jan 23, 2024

Alex - I run Invoke (one of the popular OSS SD UIs for pros)

Thanks for your work - it’s been impactful since the early days of the project.

Excited to see where we get to this year.

Birch-san · on Jan 23, 2024

ah, originally lstein/stable-diffusion? yeah that was an important fork for us Mac users in the early days. I have to confess I've still never used a UI. :)

this year I'm hoping for efficiency and small models! even if it's proprietary. if our work can reduce some energy usage behind closed doors that'd still be a good outcome.

sophrocyne · on Jan 24, 2024

Yes, indeed. Lincoln's still an active maintainer.

Energy efficiency is key - Especially with some of these extremely inefficient (wasteful, even) features like real-time canvas.

Good luck - Let us know if/how we can help.

bertdb · on Jan 24, 2024

Did you do any inpainting experiments? I can imagine a pixel-space diffusion model to be better at it than one with a latent auto-encoder.

stefanbaumann · on Jan 24, 2024

Not yet, we focused on the architecture for this paper. I totally agree with you though - pixel space is generally less limiting than a latent space for diffusion, so we would expect good performance inpainting behavior and other editing tasks.

Birch-san · on Oct 29, 2022

Most things require workarounds, some things aren't possible (or we haven't found workaround yet) and it's not as fast as CUDA. But stable-diffusion inference works, and so does textual inversion training. I was also able to run training of a T5 model with just a couple of tweaks.

I'd stick with PyTorch 1.12.1 for now. 1.13 has problems with backpropagation (I get NaN gradients now when I attempt CLIP-guided diffusion -- I think this applies to training too), and some einsum formulations are 50% slower (there is a patch to fix this; I expect it'll be merged soon), making big self-attention matmuls slow and consequently making stable-diffusion inference ~6% slower.

Birch-san · on Sept 3, 2022

It's probably the same thing prestodb encountered: https://github.com/prestodb/presto/issues/8993

Birch-san · on Sept 3, 2022

For performance reasons, glibc may not return freed memory to the OS. You can increase the incentive for it to do so, by reducing MALLOC_ARENA_MAX to 2. https://github.com/prestodb/presto/issues/8993