Also see the example repo README: https://github.com/microsoft/DeepSpeedExamples...

nacs · on April 12, 2023

> single consumer-grade NVIDIA A6000 GPU with 48GB memory

I wouldn't call an A6000 "consumer-grade" -- it's about $5000 and 99% of consumers that have graphics cards wouldn't have that.

Top of the line consumer grade GPU would be a Nvidia RTX 4090/3090 with 24GB VRAM.

teruakohatu · on April 12, 2023

It is solidly a workstation card that is often deployed in data centres. Consumers are far better off with a 4090.

freeone3000 · on April 12, 2023

“Consumer grade” here means “you can buy it in a store”. (This is not true of DGX devices.)

coolspot · on April 13, 2023

Your local MicroCenter doesn’t stock DGX A100/H100 ???

magicalhippo · on April 13, 2023

So "off the shelf" rather then.

Tepix · on April 13, 2023

Agreed, but two RTX 3090/4090 should be as capable in this regard (having 2x 24GB).

mejutoco · on April 13, 2023

AFAIK the 4090 cannot share ram like that (no nvlink)

alwayslikethis · on April 13, 2023

Is it? It might have even more computing power, but are cards now able to share VRAM now? My hands-on experience of all this is from a few years ago and I think it was not possible back then.

Tepix · on April 13, 2023

DeepSpeed is designed to take care of spreading the work for you.

You can link two 3090s with nvlink to increase bandwidth.

passion__desire · on April 13, 2023

I can see the future where each company will have "assistant AI model" trained/updated on its internal data at periodic intervals. The data sources could be group emails, slack / team messages, docs, company pdfs and so on. Maybe MS will provide it to you since it already has access to many of the data sources.

ttul · on April 13, 2023

Honestly if you can train a decent model for $1,000, a 15 person company could afford to train it up monthly.

MasterScrat · on April 13, 2023

Note that by "train" here they mean "finetune that network a bit", not train from scratch.

UncleEntity · on April 13, 2023

For the 1.3 billion parameter one they do mean train from scratch on your 2 hour “coffee break”.

Which totally begs the question, Microsoft AI researchers get two hour coffee breaks?

—edit—

The way they word it is confusing but, yeah, fine tune the model on your two hour coffee break.

MasterScrat · on April 13, 2023

No, you can see the time breakdown in the table under the "coffee break" quote: it is the time for the 3-step RLHF process only. Training a 1.3B parameter model from scratch is still a very large undertaking.

lcastricato · on April 13, 2023

FYI they don't compare to trlX bc trlX is roughly just as fast. Similarly, they put trl in the worst light possible (trl is actually must faster than they claim.)

lcastricato · on April 13, 2023

We're doing some stuff with NVIDIA right now that I can't talk about yet. Super exciting though.

pavelstoev · on April 13, 2023

[flagged]

htrp · on April 13, 2023

if you are gonna advertise, at least give some hard figures on performance improvements