Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Also see the example repo README: https://github.com/microsoft/DeepSpeedExamples/tree/master/a...

> With just one click, you can train, generate and serve a 1.3 billion parameter ChatGPT model within 1.36 hours on a single consumer-grade NVIDIA A6000 GPU with 48GB memory. On a single DGX node with 8 NVIDIA A100-40G GPUs, DeepSpeed-Chat enables training for a 13 billion parameter ChatGPT model in 13.6 hours. On multi-GPU multi-node systems (cloud scenarios),i.e., 8 DGX nodes with 8 NVIDIA A100 GPUs/node, DeepSpeed-Chat can train a 66 billion parameter ChatGPT model under 9 hours. Finally, it enables 15X faster training over the existing RLHF systems

> The following are some of the open-source examples that are powered by DeepSpeed: Databricks Dolly, LMFlow, CarperAI-TRLX, Huggingface-PEFT

(disclaimer: MSFT/GH employee, not affiliated with this project)



> single consumer-grade NVIDIA A6000 GPU with 48GB memory

I wouldn't call an A6000 "consumer-grade" -- it's about $5000 and 99% of consumers that have graphics cards wouldn't have that.

Top of the line consumer grade GPU would be a Nvidia RTX 4090/3090 with 24GB VRAM.


It is solidly a workstation card that is often deployed in data centres. Consumers are far better off with a 4090.


“Consumer grade” here means “you can buy it in a store”. (This is not true of DGX devices.)


Your local MicroCenter doesn’t stock DGX A100/H100 ???


So "off the shelf" rather then.


Agreed, but two RTX 3090/4090 should be as capable in this regard (having 2x 24GB).


AFAIK the 4090 cannot share ram like that (no nvlink)


Is it? It might have even more computing power, but are cards now able to share VRAM now? My hands-on experience of all this is from a few years ago and I think it was not possible back then.


DeepSpeed is designed to take care of spreading the work for you.

You can link two 3090s with nvlink to increase bandwidth.


I can see the future where each company will have "assistant AI model" trained/updated on its internal data at periodic intervals. The data sources could be group emails, slack / team messages, docs, company pdfs and so on. Maybe MS will provide it to you since it already has access to many of the data sources.


Honestly if you can train a decent model for $1,000, a 15 person company could afford to train it up monthly.


Note that by "train" here they mean "finetune that network a bit", not train from scratch.


For the 1.3 billion parameter one they do mean train from scratch on your 2 hour “coffee break”.

Which totally begs the question, Microsoft AI researchers get two hour coffee breaks?

—edit—

The way they word it is confusing but, yeah, fine tune the model on your two hour coffee break.


No, you can see the time breakdown in the table under the "coffee break" quote: it is the time for the 3-step RLHF process only. Training a 1.3B parameter model from scratch is still a very large undertaking.


FYI they don't compare to trlX bc trlX is roughly just as fast. Similarly, they put trl in the worst light possible (trl is actually must faster than they claim.)


We're doing some stuff with NVIDIA right now that I can't talk about yet. Super exciting though.


[flagged]


if you are gonna advertise, at least give some hard figures on performance improvements




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: