There's no information about what this is, beyond a teaser of a loss graph. Really hoping this is something that gets released to the world, not hidden behind closed doors.
I'd love to believe it's true but I suspect they're overstating the result, or it's a fluke. Presumably teams at large firms like Meta would have put a lot of effort into checking whether not-synchronise-every-step training could match synchronise-every-step training before investing hundreds of millions of dollars into the low-latency, high-throughput network hardware necessary for the latter.
We're pretty confident it's not a fluke, and paper + code are the next step, within a couple months.
It's not "synchronize every step", but it's "do something every step".
We double and triple and quadruple checked our results, to make sure that we are in fact getting results like this while only doing our thing every step, and it really keeps holding up.
Don't trust our word for it, though, you'll see when the paper comes out :)
Um, so why announce something before even a paper with replicable details is available? To put it bluntly, what are we supposed to do with the information?
I could be less harsh if this was some grant requirement to release a report before a certain date, but I don't see any grant funding declaration.
We're excited about the potential and want to find other folks also excited about it that are interested in working for/with us to build things on the foundations of DisTrO!
Plus also it's so cool and mind boggling to us that we wanted to share the hype a little bit, it was hard not being able to tell anyone we were working on it
I'm happy to have the project on my radar, and though they could be a bit clearer about the provisional nature of the research I don't think it's wrong to want to hype the potential of it a bit.
Is synchronize-every-step training the status quo for training LLMs?
I've not kept up-to-date with training/optimizer research for quite some time but during the deep learning craze there were papers like the ones about DistBelief/Downpour SDG[0] that showed how to scale up training by only doing occasional synchronization. Did that not transfer to transformer/LLM training?
Yes, ultimately everyone is currently doing something which looks like synchronous data parallel training on the outside.
The linked PDF is very light on detail, but what results they do claim are about a 1.2bn parameter model. This is tiny; you don't need network-bound distributed training (ie, anything beyond a single datacenter class machine, or less if you're patient) to train a model that size. The comms requirements also scale with the model size, so I strongly suspect people hoping for embarrassingly-parallel-style scaling properties are going to be disappointed.
(They also appear to have, in part, reinvented parameter servers.)
in particular it appears that they only implement data parallel DP - at 1.2B you can fit full copy of model into memory, but larger models require splitting the weights across multiple machines (different techniques eg distributed data parallel DDP, tensor parallel TP, pipeline parallel TP, ...)
without more details it's unclear if the proposed technique keeps its speedups in that case
I was into Nous at first, but it seems they mostly just do graphic design and vibes stuff so a16z gives them money. Which, whatever, nice work if your can get it, but don’t use the same tactics for research projects.
AFAIK, the main bottleneck on training is memory bandwidth. Distributed gpu compute has multiple orders of magnitude less than an equivalent number of GPUs colocated, because they don’t share a physical bus, but have a network connection instead. This work improves on that, but the fundamental limitations remain.
> Training large scale neural networks typically involves sharing gradients between all accelerators, which necessitates specialized, high-speed interconnects. To address this, we introduce DisTrO, a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.
This could be a HUGE deal.
Currently if you want to train giant LLMs you need a big pile of GPUs in the same location as each other due to the amount of information that needs to shuffle between them during training.
If DisTrO works as intended, it will be possible to train models using GPUs in different places - potentially enabling SETI@home style training where thousands of people with gaming PCs at home could donate their GPU time to a large training effort.
> Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by 1000x to 10,000x without relying on amortized analysis, and matches AdamW+All-Reduce in convergence rates. This enables low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.
> DisTrO can increase the resilience and robustness of training LLMs by minimizing dependency on a single entity for computation. DisTrO is one step towards a more secure and equitable environment for all participants involved in building LLMs.
> Without relying on a single company to manage and control the training process, researchers and institutions can have more freedom to collaborate and experiment with new techniques, algorithms, and models. This increased competition fosters innovation, drives progress, and ultimately benefits society as a whole.
Look at the PDF in simonw’s first link. There’s plenty of information. One part looked like the communication requirements were reduced down to under 100MB. That suggests a communication rate that could be handled by dirt-cheap instances spread across the globe. Like on vast.ai or something.
GaLore can do that too (if you transplant it). There are similar methods prior LLM era doing the same. They are not quite there on the loss graph side though (I am actually unsure about GaLore, but both FLoRA and ReLoRA were not quite there on loss graph side).