> is limited by GPU availability. Which is all the more curious, considering Ope...

HarHarVeryFunny · on May 31, 2023

There's an interesting recent video here from Microsoft discussing Azure. The format is a bit cheesy, but lots of interesting information nonetheless.

https://www.youtube.com/watch?v=Rk3nTUfRZmo&t=5s "What runs ChatGPT? Inside Microsoft's AI supercomputer"

The relevance here is that Azure appears to be very well designed to handle the hardware failures that will inevitably happen during a training run taking weeks or months and using many thousands of GPUs... There's a lot more involved than just renting a bunch of Amazon GPUs, and anyways the partnership between OpenAI and Microsoft appears quite strategic, and can handle some build-out delays, especially if they are not Microsoft's fault.

ma2rten · on June 1, 2023

That is only relevant for serving and not for inference, unless the model is too big to fit on a single host (typically 8 GPUs).

jiggawatts · on May 31, 2023

One of Azure's unique offerings is very large HPC clusters with GPUs. You can deploy ~1,000 node scale sets with very high speed networking. AWS has many single-server GPU offerings, but nothing quite like what Azure has.

Don't assume Microsoft is bad at everything and that AWS is automatically superior at all product categories...

JeremyNT · on June 1, 2023

Whether MS is good or not isn't really the point. If they're constrained by GPU availability, being locked in to any specific provider is going to be a problem.

renonce · on June 2, 2023

Large scale sets are only needed for training. For inference, 8x NVIDIA A100 80G will allow inference for 300b models (GPT-3 is 175b) or 1200b models with 4-bit quantization (quantization impact is negligible for large models), so a single machine is sufficient.

sebzim4500 · on May 31, 2023

>So... OpenAI is severely GPU constrained, it is hampering their ability to execute, onboard customers to existing products and launch products. Yet they signed an agreement not to just go rent a bunch of GPU's from AWS???

> Did someone screw up by not putting a clause in that contract saying "exclusive cloud provider, unless you cannot fulfil our requests"?

Maybe MSFT refused to sign such an agreement?

londons_explore · on May 31, 2023

Perhaps they are cash flow constrained, which in turn means they are GPU constrained, since GPU's are their biggest expense?

dbmnt · on June 2, 2023

I don't think Amazon offers what Azure does (yet) in terms of HPC or multi-GPU capacity. The blog post doesn't say how long the agreement is for, but the relationship probably makes sense at the moment.

All the cloud providers are building out this type of capacity right now. It's already having a big impact in terms of quarterly spend, which we just saw in the NVDA Q1 results. AWS, Azure, and GCP for sure, but also smaller players like Dell and HPE and even NVidia themselves are trying to get into this market. (Disclaimer: I work at one of these places but don't feel like saying which). I suspect the GPU constraints won't be around too long, at which point we'll find out if OpenAI made a contractual mistake.

ilaksh · on May 31, 2023

AWS might not really have much extra GPU capacity for them anyway.. also they would cost more.

I think that there aren't a lot of GPUs available and it takes time to add more to the datacenter even when you do get them.

carom · on May 31, 2023

I heard earlier this year that people were having trouble getting allocations on GCP as well. Probably why Nvidia is at $1T now.

doctor_eval · on June 1, 2023

Let’s not forget that Microsoft is a big investor in OpenAI. It is important to know on which side your bread is buttered.

chaostheory · on May 31, 2023

Even if they weren’t exclusive with Azure, aren’t GPU prices reasonable again?

verdverm · on May 31, 2023

They have to be a available to buy, regardless the price. My understanding is there is a distinct lack of supply

zamalek · on June 1, 2023

Barring a revolution in chip manufacture, there likely will always be a lack of supply relative to consumer GPUs. The size of the die results in terrible yields.

catchnear4321 · on May 31, 2023

this has nothing to do with sama clamoring for regulation.

that absolutely isn’t an attempt to slow down all competition.

which isn’t necessary because nobody made such a mistake.

this won’t lead to any hasty or reckless internal decisions in a feckless effort to stay in front.

not that any have already been made.

not that that could lead to disaster.