Which is all the more curious, considering OpenAI said this only in January:
> Azure will remain the exclusive cloud provider for all OpenAI workloads across our research, API and products [1]
So... OpenAI is severely GPU constrained, it is hampering their ability to execute, onboard customers to existing products and launch products. Yet they signed an agreement not to just go rent a bunch of GPU's from AWS???
Did someone screw up by not putting a clause in that contract saying "exclusive cloud provider, unless you cannot fulfil our requests"?
The relevance here is that Azure appears to be very well designed to handle the hardware failures that will inevitably happen during a training run taking weeks or months and using many thousands of GPUs... There's a lot more involved than just renting a bunch of Amazon GPUs, and anyways the partnership between OpenAI and Microsoft appears quite strategic, and can handle some build-out delays, especially if they are not Microsoft's fault.
One of Azure's unique offerings is very large HPC clusters with GPUs. You can deploy ~1,000 node scale sets with very high speed networking. AWS has many single-server GPU offerings, but nothing quite like what Azure has.
Don't assume Microsoft is bad at everything and that AWS is automatically superior at all product categories...
Whether MS is good or not isn't really the point. If they're constrained by GPU availability, being locked in to any specific provider is going to be a problem.
Large scale sets are only needed for training. For inference, 8x NVIDIA A100 80G will allow inference for 300b models (GPT-3 is 175b) or 1200b models with 4-bit quantization (quantization impact is negligible for large models), so a single machine is sufficient.
>So... OpenAI is severely GPU constrained, it is hampering their ability to execute, onboard customers to existing products and launch products. Yet they signed an agreement not to just go rent a bunch of GPU's from AWS???
> Did someone screw up by not putting a clause in that contract saying "exclusive cloud provider, unless you cannot fulfil our requests"?
I don't think Amazon offers what Azure does (yet) in terms of HPC or multi-GPU capacity. The blog post doesn't say how long the agreement is for, but the relationship probably makes sense at the moment.
All the cloud providers are building out this type of capacity right now. It's already having a big impact in terms of quarterly spend, which we just saw in the NVDA Q1 results. AWS, Azure, and GCP for sure, but also smaller players like Dell and HPE and even NVidia themselves are trying to get into this market. (Disclaimer: I work at one of these places but don't feel like saying which). I suspect the GPU constraints won't be around too long, at which point we'll find out if OpenAI made a contractual mistake.
Barring a revolution in chip manufacture, there likely will always be a lack of supply relative to consumer GPUs. The size of the die results in terrible yields.
Which is all the more curious, considering OpenAI said this only in January:
> Azure will remain the exclusive cloud provider for all OpenAI workloads across our research, API and products [1]
So... OpenAI is severely GPU constrained, it is hampering their ability to execute, onboard customers to existing products and launch products. Yet they signed an agreement not to just go rent a bunch of GPU's from AWS???
Did someone screw up by not putting a clause in that contract saying "exclusive cloud provider, unless you cannot fulfil our requests"?
[1]: https://openai.com/blog/openai-and-microsoft-extend-partners...