DisTrO – a family of low latency distributed optimizers

arjvik · 2024-08-27T20:34:23 1724790863

There's no information about what this is, beyond a teaser of a loss graph. Really hoping this is something that gets released to the world, not hidden behind closed doors.

arilotter · 2024-08-27T21:18:26 1724793506

Paper & code in the next couple months. We're workin on em :)

logicchains · 2024-08-27T19:39:18 1724787558

I'd love to believe it's true but I suspect they're overstating the result, or it's a fluke. Presumably teams at large firms like Meta would have put a lot of effort into checking whether not-synchronise-every-step training could match synchronise-every-step training before investing hundreds of millions of dollars into the low-latency, high-throughput network hardware necessary for the latter.

arilotter · 2024-08-27T20:38:18 1724791098

We're pretty confident it's not a fluke, and paper + code are the next step, within a couple months. It's not "synchronize every step", but it's "do something every step".

We double and triple and quadruple checked our results, to make sure that we are in fact getting results like this while only doing our thing every step, and it really keeps holding up.

Don't trust our word for it, though, you'll see when the paper comes out :)

RicoElectrico · 2024-08-27T21:08:28 1724792908

Um, so why announce something before even a paper with replicable details is available? To put it bluntly, what are we supposed to do with the information?

I could be less harsh if this was some grant requirement to release a report before a certain date, but I don't see any grant funding declaration.

arilotter · 2024-08-27T21:19:08 1724793548

We're excited about the potential and want to find other folks also excited about it that are interested in working for/with us to build things on the foundations of DisTrO! Plus also it's so cool and mind boggling to us that we wanted to share the hype a little bit, it was hard not being able to tell anyone we were working on it

SchwKatze · 2024-08-27T22:50:13 1724799013

I sent a email yesterday to you guys to find a way I can help to build this pretty pretty cool idea.

CuriouslyC · 2024-08-27T21:14:10 1724793250

I'm happy to have the project on my radar, and though they could be a bit clearer about the provisional nature of the research I don't think it's wrong to want to hype the potential of it a bit.

hobofan · 2024-08-27T20:51:31 1724791891

Is synchronize-every-step training the status quo for training LLMs?

I've not kept up-to-date with training/optimizer research for quite some time but during the deep learning craze there were papers like the ones about DistBelief/Downpour SDG[0] that showed how to scale up training by only doing occasional synchronization. Did that not transfer to transformer/LLM training?

[0]: https://proceedings.neurips.cc/paper_files/paper/2012/hash/6...

adw · 2024-08-27T21:28:01 1724794081

Yes, ultimately everyone is currently doing something which looks like synchronous data parallel training on the outside.

The linked PDF is very light on detail, but what results they do claim are about a 1.2bn parameter model. This is tiny; you don't need network-bound distributed training (ie, anything beyond a single datacenter class machine, or less if you're patient) to train a model that size. The comms requirements also scale with the model size, so I strongly suspect people hoping for embarrassingly-parallel-style scaling properties are going to be disappointed.

(They also appear to have, in part, reinvented parameter servers.)

huac · 2024-08-27T21:47:01 1724795221

in particular it appears that they only implement data parallel DP - at 1.2B you can fit full copy of model into memory, but larger models require splitting the weights across multiple machines (different techniques eg distributed data parallel DDP, tensor parallel TP, pipeline parallel TP, ...)

without more details it's unclear if the proposed technique keeps its speedups in that case

itkovian_ · 2024-08-27T23:11:25 1724800285

This is not true

bugglebeetle · 2024-08-27T23:52:40 1724802760

I was into Nous at first, but it seems they mostly just do graphic design and vibes stuff so a16z gives them money. Which, whatever, nice work if your can get it, but don’t use the same tactics for research projects.

regularfry · 2024-08-27T20:33:12 1724790792

Not if it cost them a month to do so.

iamronaldo · 2024-08-27T20:02:05 1724788925

This seems huge no? Couldn't this enable "community based" ai training at home?

FuckButtons · 2024-08-28T06:06:25 1724825185

AFAIK, the main bottleneck on training is memory bandwidth. Distributed gpu compute has multiple orders of magnitude less than an equivalent number of GPUs colocated, because they don’t share a physical bus, but have a network connection instead. This work improves on that, but the fundamental limitations remain.

simonw · 2024-08-27T19:31:22 1724787082

Most of the information about this is in this PDF (I hate when people publish interesting information exclusively in PDFs): https://raw.githubusercontent.com/NousResearch/DisTrO/main/A...

I converted it to Markdown (using Gemini 1.5 Pro) and pasted it into a Gist here: https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdab...

From the abstract:

> Training large scale neural networks typically involves sharing gradients between all accelerators, which necessitates specialized, high-speed interconnects. To address this, we introduce DisTrO, a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.

This could be a HUGE deal.

Currently if you want to train giant LLMs you need a big pile of GPUs in the same location as each other due to the amount of information that needs to shuffle between them during training.

If DisTrO works as intended, it will be possible to train models using GPUs in different places - potentially enabling SETI@home style training where thousands of people with gaming PCs at home could donate their GPU time to a large training effort.

Their tweet about this has more: https://twitter.com/NousResearch/status/1828121648383566270

> Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by 1000x to 10,000x without relying on amortized analysis, and matches AdamW+All-Reduce in convergence rates. This enables low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.

> DisTrO can increase the resilience and robustness of training LLMs by minimizing dependency on a single entity for computation. DisTrO is one step towards a more secure and equitable environment for all participants involved in building LLMs.

> Without relying on a single company to manage and control the training process, researchers and institutions can have more freedom to collaborate and experiment with new techniques, algorithms, and models. This increased competition fosters innovation, drives progress, and ultimately benefits society as a whole.

liuliu · 2024-08-27T19:40:40 1724787640

As much as I liked the team, there is really no information other than the loss graph :(

az226 · 2024-08-27T21:19:40 1724793580

That’s not quite true. They also tested benchmarks and compared with an AdamW trained model.

bugglebeetle · 2024-08-27T23:57:02 1724803022

Not reproducible, so this doesn’t exist.

arilotter · 2024-08-28T01:45:33 1724809533

we're excited to drop the paper & code and let you run the code for yourself to see otherwise !! coming in the next couple months :)

nickpsecurity · 2024-08-28T00:02:26 1724803346

Look at the PDF in simonw’s first link. There’s plenty of information. One part looked like the communication requirements were reduced down to under 100MB. That suggests a communication rate that could be handled by dirt-cheap instances spread across the globe. Like on vast.ai or something.

liuliu · 2024-08-28T02:38:43 1724812723

GaLore can do that too (if you transplant it). There are similar methods prior LLM era doing the same. They are not quite there on the loss graph side though (I am actually unsure about GaLore, but both FLoRA and ReLoRA were not quite there on loss graph side).

amrb · 2024-08-28T16:22:34 1724862154

It's a red flag that the 1.2bil model has to fit in gpu memory, happy to be provided wrong when the code drops

arilotter · 2024-08-28T18:19:36 1724869176

That's not something that DisTrO solves, but there's plenty of research in that area! See https://arxiv.org/abs/2301.11913 , https://arxiv.org/abs/2206.01288 , https://arxiv.org/abs/2304.11277 etc :)

atlas_hugged · 2024-08-28T00:51:30 1724806290

Awwww snaaaaap, decentralized skyneeet here we goooooo! Kidding of course, but very exciting breakthrough if this pans out.