Microsoft Zero and DeepSpeed: Memory Efficient Large Neural Network Training

skwb · on Feb 10, 2020

I work in deep learning for 3D imaging, and memory has constantly been the primary bottleneck for our group. U-net for example tends to be fairly "chonky", and isn't really super great in terms of parameter efficiency (but it is nice when you need an out of the box network that just "works"...). This has led medical imaging to use a lot of "patching" and other sliding window sort of techniques to help get over this burden.

I tend to think that a lot of this is due to Facebook/Google/Etc being more interested in 2D picture images, and hasn't really put a ton of effort into developing approaches that are exponentially harder in terms of parameters. While I don't think I can comment on if parallelism is the future to solve (vs. single massive GPU memory chips vs. more efficient NN design vs. data compression techniques), I think this is where a lot of the bleeding technical edge will come from.

andbberger · on Feb 10, 2020

I love to hate on U-net. It works but it's just so inelegant. That is not a true convolution and only works for particular 'patch' sizes bothers me to no end.

I am not super up to date with the field, but has anyone caught on to using 'wavenet' like architectures yet? That is, dialated convolutions.

You have to be a little clever to get residual connections to work properly, but it's a true convolution that works for any patch size, is super-parameter efficient, and captures the same multi-scale features U-net was designed for.

Anecdotally, I used such an arch for some (unfortunately proprietary) 3D imaging work and achieved some nice results.

skwb · on Feb 11, 2020

> "It works".

Well that's sorta the point. Personally I'm not a super huge fan of creating a super specific network architecture and resulting in 2-3% difference in performance. Certainly if you're doing something where a configuration makes sense (LSTM for time series for example), but I think there needs to be a rethinking of the Grand Theory of Deep Learning Architecture TM.

And frankly I think a unsaid reason why U-net is so popular is that it does generalize reasonably well with limited data, which in many fields is not as massive as COCO.

I realize it's sorta asking too much (I both want a NN that works both out of the box, super easily, and doesn't require a TON of data), but I think that's where the current pains are for really explosive growth in AI.

andbberger · on Feb 11, 2020

> I think there needs to be a rethinking of the Grand Theory of Deep Learning Architecture TM.

strong agree. Although perhaps not so much a rethinking as a theory of all. Huge dearth of theory in the field. Daily practition involves regular use of black magic intuition for arch, problem posing and debugging. Weird times.

rckoepke · on Feb 11, 2020

I think that silicon and data compression are well-optimized domains already.

However, Neural Nets are provably horrible w.r.t resource efficiency, as demonstrated by recent advances like EfficientNet, which managed to decrease the number of parameters drastically and improve top 1% performance as well.

choppaface · on Feb 10, 2020

Even from the paper, it's hard to tell what this library actually does: Section 5 in https://arxiv.org/pdf/1910.02054.pdf

The paper talks about parameter partitioning and overlapped communication, but doesn't actually give many details on how those things happen.

The library appears to be an implementation of some common algos for solving the 'pebble game,' as explained decently here: https://medium.com/tensorflow/fitting-larger-networks-into-m...

The essential point is that:

(1) model parallelism is hard to do and has historically been done manually to scale wide models across GPUs

(2) inter-GPU I/O is expensive for vanilla data-parallel jobs (that typically use naive mirroring strategies)

(3) researchers have figured out now how to 'compile' a deep model so that layers span GPUs and save on both memory usage and I/O

(4) so scaling wide models is still hard, but now we have better tools for deep models

Existing all-reduce-based data-parallel problems have already been well-studied (see e.g. https://people.eecs.berkeley.edu/~jfc/papers/14/Kylix.pdf ), so it's really nice to see gains through new techniques.

Definitely like seeing this 'compilation' being wrapped up into a library. Just wish they did a better job of communicating key ideas.

jeffra · on Feb 10, 2020

We tried to communicate the key ideas in the video released with the blog post. It shows how DeepSpeed and the ZeRO optimizer save memory, and shows exactly what happens during each iteration of training. It is quite different from standard data or model parallelism.

The ZeRO optimizer helps scale large models regardless of the model topology. It works equally well for wide or deep models. Please let us know if you have specific questions that we can address.

choppaface · on Feb 11, 2020

Oh sorry I didn't make it to the video because the blog post intro made me bounce straight to the paper. I agree the video is a big help versus what's given in the paper.

It looks like your approach plays the 'pebble counting' game described in the OpenAI article I linked. Or maybe you'd like to explain what's different.

What would really help in the video (and paper) is a grounded example (like Resnet10 or AlexNet or just a 2-layer MLP) and drawing the connection between GPU buffers and layers. I feel the video covers details of the memory savings in way too much precision while the intuition behind the method (and how it translates to a graphical model of a NN) is essentially absent.

jeffra · on Feb 10, 2020

I'm from the DeepSpeed team, we're happy to answer questions if people have them.

Tenoke · on Feb 10, 2020

This is great and looks very easy to use! I'd expect it to have a huge impact given how easy it makes for people to leverage a few or a few thousand GPUs. I do have a few questions, of course.

Is it getting a lot of internal use already (beyond the example we just heard about)?

Is it possible to do inference using a CPU and a lot of RAM using a model trained on multiple GPUs via DeepSpeed?

Does it work with TPUs right out of the box? It looks like maybe not - if not, any plans to support them?

Can you use DeepSpeed to train using a lot of CPUs + ram rather than GPUs?

jeffra · on Feb 10, 2020

> Is it getting a lot of internal use already (beyond the example we just heard about)?

We have hundreds of internal users of DeepSpeed using it to train production ready models, many of which have been already shipped.

> Is it possible to do inference using a CPU and a lot of RAM using a model trained on multiple GPUs via DeepSpeed?

It is definitely possible to do inference on CPU using a model trained on multiple GPUs via DeepSpeed. For models trained without model parallelism, this is straight forward. The tricky part is if the model was trained using model parallelism, which would require merging checkpoints corresponding to different pieces of the model into a single one.

> Does it work with TPUs right out of the box? It looks like maybe not - if not, any plans to support them?

The ZeRO technology is compatible with TPU or any accelerator in a cluster setting, but we have not tested it with the TPUs. It likely would require some small refactoring to get DeepSpeed to work with TPUs. We do not have any internal plans to support them yet, but of course completely open to contribution from the community.

> Can you use DeepSpeed to train using a lot of CPUs + ram rather than GPUs?

It is possible to use DeepSpeed to train using a lot of CPUs. The major limitation of the approach is that CPUs can be an order of magnitude slower than GPUs in terms of computational performance.

tixocloud · on Feb 11, 2020

Are you able to share the use cases for production ready models?

davidwhite · on Feb 11, 2020

Looks super cool. Does it remove the need for manual gradient checkpointing?

Also curious if there are expected to be memory / speed improvements if you're using it on a single GPU or if most gains come from improved parallelism across devices.

coolswan · on Feb 10, 2020

Will there be a whitepaper pertaining to T-NLG? Post is light on details.

Also, what's the timeline to release private preview? Outside of academia, who else will you likely be working with first?

saurkt · on Feb 11, 2020

We don't have an exact date, but, we plan to share more details in a later submission. If you want access, please send an email to [turing_ AT _microsoft _DOT_ com]. Remove underscores and spaces.

liuliu · on Feb 10, 2020

Looks like what it does is similar to what Alex did a few years back with One Weird Trick paper: https://arxiv.org/abs/1404.5997

When attempting to train transformers, I do notice a lot of time spend on allreduce more than with CNN models, probably due the parameter sizes. OWT seems to be natural to exploit for this situation (a lot of GEMMs, lot time spent on allreduce).

Edit:

Read the paper. The implementation is much less tricky than OWT, but for a good reason probably. Language model's GEMMs are smaller, therefore, partition the model would have efficiency impact (smaller GEMM will be slower). This does require much better interconnects, which NVLink / infiniband conveniently provides, that is also not available on consumer grade hardware anywhere (2-way NVLink is not meaningful).

easysnap · on Feb 15, 2020

ZeRO is mainly a clever improvement that moves optimizer computation into the 2 phases of Ring-AllReduce. It greatly helps Adam and similar optimizers to reduce per-GPU memory overhead.

The naive approach, as used in the well-known Megatron, completes Ring-AllReduce first so that each GPU has a full set of aggregated gradients. Then it does the same optimizer computation for all parameters on each GPU. That's OK for vanilla SGD because vanilla SGD has no optimizer state variable. But for Adam, the naive approach has to store a copy of full set of Adam m/v storage on each GPU, which is super memory consuming. Actually, after the 1st phase of All-Reduce each GPU has its subset of gradients. Each GPU can do Adam SGD for that subset, and importantly, it just need to keep m/v corresponding to that subset of gradients. After the Adam optimizer completes, the 2nd phase of Ring-AllReduce will scatter the updated parameters to all GPUs. Therefore, that's memory saving and computation saving. (The naive approach is more general as it allows optimizers to use optimizer variables of different network layers. But most optimizers, like Adam, don't really need that capability. ZeRO cleverly leveraged that locality.)

minimaxir · on Feb 10, 2020

Link to GitHub: https://github.com/microsoft/DeepSpeed

bitforger · on Feb 10, 2020

I wrote a blog post on the difficulties of memory-efficient training, which seems relevant: http://mitchgordon.me/machine/learning/2020/01/13/do-we-real...

The methods discussed there take a different angle at the problem.

tixocloud · on Feb 11, 2020

While on the surface, this looks interesting, can anyone help me understand who exactly needs to do and redo neural network training that will take advantage of these optimizations? I’m struggling to understand which companies/data scientists would use this.