off the shelf: https://hradec.com/ebooks/CGI/RMS_1.0/rfm/User_Interface/Alf...
although that was with something like 6-10k nodes because there was an upper limit to how many dispatches alfred could do because it was single threaded, from the early 90s and not really designed to scale that high
https://renderman.pixar.com/tractor is probably what they use now, or https://www.opencue.io/
but any grid engine style dispatcher/manager will do what you want. It'll give you the primitives to manage wildly larger scale than k8s.
These clusters were on real steel, as elastic clusters were horrendously expensive, and the storage was/is nowhere near fast enough.
Nowadays, I'd use AWS batch, or at a push airflow.
off the shelf: https://hradec.com/ebooks/CGI/RMS_1.0/rfm/User_Interface/Alf...
although that was with something like 6-10k nodes because there was an upper limit to how many dispatches alfred could do because it was single threaded, from the early 90s and not really designed to scale that high
https://renderman.pixar.com/tractor is probably what they use now, or https://www.opencue.io/
but any grid engine style dispatcher/manager will do what you want. It'll give you the primitives to manage wildly larger scale than k8s.
These clusters were on real steel, as elastic clusters were horrendously expensive, and the storage was/is nowhere near fast enough.
Nowadays, I'd use AWS batch, or at a push airflow.