Something that doesn't seem worth bragging about is that the startup time increases linearly with the cluster size. Wouldn't you want it to be constant? What's the issue there?
Disclaimer: work associated with this team, didn't write or review the blog post
Article stated that it was throughput scheduling the pods on the clusters (from unrelated benchmarks that's usually ~300 pods/sec throughput for kube scheduler today) and then doing XLA compilation at pod launch, rather than amortizing once for all jobs.
Optimizing throughput of kube scheduler is a good general opportunity and something I believe we would like to see.
I believe AOT compilation just not a critical optimization for the test, we would recommend it when running large and long training jobs to AOT compile to keep pod start latency low for hardware failures and job restarts (from checkpoints).
> The start times we observed were impressive, but we believe we can improve these even further. We are working on areas such as optimizing scheduling in GKE to increase throughput and enabling ahead-of-time compilation in MaxText to avoid just-in-time compilations on the full cluster.
Thanks for the context. I remember recently reading a paper from I think Baidu where they claimed to have a container arrival rate in the millions per second, consequently it was practical to operate their whole site in the style of lambda/cloud functions.
Actually now that I am searching for that it seems Baidu has a number of papers on workload orchestration at scale specifically for learning.
I will note that a trend I have observed with recent ML - as we increasingly use accelerators and models correspondingly grow in size, we are returning to a "one machine, one workload" paradigm for the biggest training and inference jobs. You might have 8k accelerators, but only 1000 machines, and if you have one container per host 300 schedules / second is fast.
While at the same time as you note we have functional models for container execution that are approaching millions of dispatches for highly partitionable work, especially in data engineering and ETL.
(Contributor on the blog post, all opinions my own)
Agreed with you and we definitely weren't trying to brag! This is fast compared to people's expectations in the space but slow compared to what we should be able to accomplish and will accomplish in the future.