Carving the scheduler out of our orchestrator

schmichael · on Feb 2, 2023

As the Nomad Team Lead, this article is a gift - thank you Fly! - even if they're transitioning off of Nomad. Their description of Nomad is exactly what I would love people to hear, and their reasons for DIYing their own orchestration layer seem totally reasonable to me. Nomad has never wanted people to think we're The One True Way to run all workloads.

I hope Nomad covers cases like scaling-from-zero better in the future, but to do that within the latency requirements of a single HTTP request is quite the feat of design and implementation. There's a lot of batching Nomad does for scale and throughput that conflict with the desire for minimal placement+startup latency, and it's yet to be seen whether "having our cake and eating it too" is physically possible, much less whether we can package it up in a way operators can understand what tradeoffs they're choosing.

I've had the pleasure of chatting with mrkurt in the past, and I definitely intend to follow fly.io closely even if they're no longer a Nomad user! Thanks again for yet another fantastic post, and I wish fly.io all the best.

tptacek · on Feb 2, 2023

This whole article started with me rewatching your Nomad deep dive video and then chasing papers. :)

Dowwie · on Feb 3, 2023

Are you referring to, "Nomad Under the Hood" [1]? Which papers did you find useful?

[1] https://www.youtube.com/watch?v=m6DnmVqoXvw

sitkack · on Feb 3, 2023

I haven't watched the whole video so I don't know about papers referenced, but between chatgpt and papers mentioned in the source repo, here is a good starting list.

Large-scale cluster management at Google with Borg https://research.google/pubs/pub43438/

Omega: flexible, scalable schedulers for large compute clusters https://research.google/pubs/pub41684/

The Chubby lock service for loosely-coupled distributed systems https://disco.ethz.ch/courses/hs08/seminar/papers/osdi06-goo...

Apache Mesos https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Apac...

mentioned in the nomad repo and docs site

Sparrow: Distributed, Low Latency Scheduling https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf

SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/...

Raft: In search of an Understandable Consensus Algorithm https://raft.github.io/raft.pdf

----

Finally, take a look at the papers referenced on semanticscholar

https://www.semanticscholar.org/search?q=nomad%20orchestrati...

jallmann · on Feb 2, 2023

This was a great article. While I was at Livepeer (distributed video transcoding on Ethereum [1]), we converged onto a very similar architecture, disaggregating scheduling into the client itself.

The key piece is to have a registry [2] with a (somewhat) up-to-date view of worker resources. This could actually be a completely static list that gets refreshed once in a while. Whenever a client has a new job, they can look at their latest copy of the registry, select workers that seem suitable, submit jobs to workers directly, handle re-tries, etc.

One neat thing about this "direct-to-worker" architecture is that it allows for backpressure from the workers themselves. Workers can respond to a shifting load profile almost instantaneously without having to centrally deallocate resources, or wait for healthchecks to pick up the latest state. Workers can tell incoming jobs, "hey sorry but I'm busy atm" and the client will try elsewhere.

This also allows for rich worker selection strategies on the client itself; eg it can preemptively request the same job on multiple workers, keep the first one that is accepted, and cancel the rest, or favor workers that respond fastest, and so forth.

[1] We were more of a "decentralized job queue" than "distributed VM scheduler" with the corresponding differences, eg shorter-lived jobs with fluctuating load profiles and our clients could be thicker. But many of the core ideas are shared, even our workers were called "orchestrators" which in turn could use similar ideas to manage jobs on GPUs attached to it... schedulers all the way down!

[2] Here the registry seems to be constructed via the Corrosion gossip protocol; we used the blockchain with regular healthcheck probes.

tptacek · on Feb 2, 2023

I'm worried that I'm meaner about K8s in this than I mean to be, which would be a problem not least because I don't have enough K8s experience to justify meanness. I'm really more just surprised at how path-dependent the industry is; even systems that were consciously built not to echo Borg, even greenfield systems like Flynn that were reimaginactiments of orchestration, all seem to follow the same model of central, allocating schedulers based on distributed consensus about worker inventory.

vidarh · on Feb 2, 2023

Be mean about K8s, it can take it. I don't think any of the criticism of it is unreasonable. It's a huge, convoluted beast. To me, K8s makes most sense in terms of creating a swiss army knife that lots of people know.

You can do far better than K8s for specialised cases like yours and/or with people who has the right skills. But for a lot of people with simpler needs it's better to just stick to what is easy to hire for than look for an optimal solution.

intelVISA · on Feb 2, 2023

> You can do far better than K8s

"Nobody ever got fired for K8s" is how I imagine most SWEs justify using such a crappy system. Kinda like SAP.

btown · on Feb 2, 2023

However bad you think k8s is, SAP is worse on its best day.

snotrockets · on Feb 3, 2023

It ain't that. It's that k8s-aas is available everywhere (and if it's not, then it'd be soon). So knowledge is transferable.

And most don't want to, and shouldn't setup their own orchestrator. It's painful, cumbersome, and a waste of money, where you could get a k8s-aas everywhere.

Fly are the rare example of a company that does need to run their own orchestration, and indeed, k8s doesn't fit them (it rarely does, in such cases. Unless you sell k8s-aas).

vidarh · on Feb 3, 2023

I mostly agree with you, but would qualify that by saying that writing a specialized orchestrator is not that hard for a lot of cases. I have done so a couple of times.

Lots of decisions will likely end up similar to K8S - e.g. for one I ended up putting SkyDNS/CoreDNS for service discovery on top of Etcd and using a Nginx handler to route ingress - when we first upgraded our (many years old) system to doing that, K8S was not yet widely used and I was unaware K8S did the same.

But if you have a specific use case and understand it well, it can be worth it.

However, even if you have a situation like that where it's genuinely cost effective for you to write your own, a big reason why K8S might still be a better choice is that you can hire people who know K8S - if you build your own you need to worry about knowledge transfer for something very critical to your operation.

I think writing orchestrators is fun, yet I still think most people shouldn't do so for production (but write a toy one as a learning experience anyway)

snotrockets · on Feb 3, 2023

A lot of things are simple to write a naive implementation of, and sometime they useful to solve a small problem, even if just to prove this is the way to go, and its worth spending on a bigger, generalized and maybe more expensive to maintain in the short term solution.

I myself have written smol messaging queues, metrics collectors, mesh router, and a few more services that once it was agreed that they had a place in our systems, we migrated to NIH software, so we could enjoy the shoulders of giants (or at least buy support for it, rather than bus factoring it on myself and another colleague).

vidarh · on Feb 4, 2023

A production quality generic orchestrator is a lot of work because it needs to handle a vast amount of different scenarios (e.g. the list of volume types supported by k8s is vast). An orchestrator tailored to specific workloads and circumstances is often not for someone with a reasonable amount of devops experience. Hence about 1k lines of straightforward code for one of the production specialised orchestrators I wrote.

You don't avoid having to account for bus factoring anything by running k8s. It's dangerous to assume so. A k8s config for a similar sized fleet would've been many times the size of the custom orchestrator (heck, the k8s config for my home test k8s-cluster is bigger than that orchestrator), and you still need to train people on the architecture. In a custom orchestrator - at least the ones I've written - it's natural to encode a lot of assumptions about architectural choices in the code, and so all our devs could read the code and know the architecture if needed, without any presumption of needing to train them on tools they were unfamiliar with.

The reason I still agree most people should just run k8s is that it's easier to hire people who understand k8s than to hire people experienced to do a good job developing a custom setup in the first place. Our devs could trivially read and understand and modify the orchestrator once it was written as needed, but that's not the same as assuming they had the experience to make the architectural choices encoded in it in the first place.

K8s is a suitable straight-jacket to stop people from making really stupid choices, and so it can substitute for a lot of expensive and rare experience, and that's good. There are still plenty of places where there's room to do custom work.

EDIT: To add, I think a useful guideline to this is: Are you writing generic code that there's an existing solution to, or is your code effectively defining policy, architecture or config in a compact way? If the former, I'd agree you need a really good reason for doing so. In the latter, much less so. E.g my custom orchestrators have been small because 90% of the functionality we needed was there in other components - often the same components k8s ended up using - but where k8s needs to provide generic apis to allow applying a generic configuration, we could encode a specific architecture in the way we structured the orchestration. The availability of those components, incidentally, has gotten better thanks to the presence of the big, complex generic orchestration solutions, making it easier than ever to do custom orchestration when it works for you.

vidarh · on Feb 3, 2023

Surprisingly often "crappy but sufficient and possible to hire for" beats good.

Hiring good devops people with enough experience of alternatives to be able to pick an alternative or build something custom that is good enough to be worth it is hard.

rco8786 · on Feb 3, 2023

Ha! I said exactly this in a slack convo at work the other day.

AYBABTME · on Feb 3, 2023

It only became fashionable to call k8s crappy because it effectively won the orchestration wars. In reality it's quite good, and it changed the game for everyone. The hype for it went overboard a good bit so some swing back is justified. It's too complex for many use cases. The API surface isn't perfect but it's rather consistent and well defined, the project's codebase is mostly a good example of style at this size. Really I think the authors of k8s should be proud of what they built.

vidarh · on Feb 3, 2023

I think the complexity is the main source of people thinking it's "crappy". It certainly my main issue. E.g. I've built orchestrators from scratch smaller than some k8s configs I've worked on. However they were special purpose - they worked well for the specific type of loads, and for that they were suitable because their tiny size meant they were easy to tweak, but they'd have been totally inappropriate as a general-purpose alternative to k8s.

And so even though I don't like it much, I think it's fine to default to k8s for people (including at my current day job), and the default will always get a lot of crap because the default choice will get pressure to include all kinds of functionality that makes it more complicated in order to support all of the very diverse workloads that don't - and likely never will - justify a custom solution.

lupex · on Feb 2, 2023

Borg–at least as it is published in those papers()–and these other schedulers are really design to maximize the availability of large distributed apps.

() The real Borg now does a lot more...

But there is a completely different world if you look elsewhere. In batch systems like supercomputers (HPC), you want to maximize throughput, not availability. It is common to have something closer to your design: a inventory of jobs and scheduler that allocates workers to them. E.g. https://hpc-wiki.info/hpc/Scheduling_Basics

ilyt · on Feb 3, 2023

> all seem to follow the same model of central, allocating schedulers based on distributed consensus about worker inventory.

Because frankly, it's easy to code. You just look thru list of what you have, list of what you put on it, and assign based on this and that algorithm.

It keeps workers simple - register to cluster, send status (or wait for healthckeck), wait for request to start the job, done, you got a worker.

It keeps the scheduler to just be scheduler - take state, apply list of "changes" (jobs to start/stop), execute.

It keeps client simple - submit to scheduler, observe progress.

It is also easy to debug. You just have one place doing the scheduling (and not a bunch of workers playing the "stock market" of resources), running on one node without distributed mess to worry about. Any test is just "feed this state to code and see whether it reacted as we need". Any job that got "rejected" for scheduling have clear reason.

It also allows bigger optimizations to happen from point of view of a cluster.

Some designs are reinvented so many times because they are just good or easy to implement first approximation of solution.

linuxftw · on Feb 2, 2023

The nice thing about a centralized scheduler is the code for scheduling only needs to run in one place. If each node 'bids' on the workload, you need to run scheduling fit on every node for every application. You also need to work out a way to decide who wins the bid.

Every time you want to change your scheduling logic, guess what? Gotta update all the nodes.

The k8s scheduler is battle-tested. Across all the different environments and CI pipelines, it has scheduled billions, if not trillions, of pods at this point. If you have a better way of scheduling workloads across nodes, I advise you to make it work for kubernetes and sell it as an enterprise scheduler, and you'll make tons of money. Even if it's not better for the general use case, and just better for some edge (pun intended) use case, it would do quite well.

IMO, all roads lead to k8s. You're just eventually going to have to solve the same set of problems.

tptacek · on Feb 2, 2023

The code for scheduling in mainstream schedulers runs in lots of places, which is why mainstream schedulers tend to run Paxos or Raft.

zekrioca · on Feb 3, 2023

Distributed dbs are used to keep a reasonable recent view of resources. The logic itself for scheduling is centralized.

kalev · on Feb 2, 2023

I’m annoyed by the way this is written. The topic is super interesting but the author tried to hard to be funny and being a non-native reader it’s difficult to determine if certain words are technical jargon or trying to be funny.

romantomjak · on Feb 2, 2023

I'm quite the opposite. This style of writing is quite refreshing. It's playful, has enough technical details and is easy to read.

Dowwie · on Feb 2, 2023

I think it succeeded at being funny. tptacek is a Nümad lad.

tptacek · on Feb 2, 2023

I'm mostly just curious about which terms the reader thinks are made up.

titanomachy · on Feb 3, 2023

I particularly liked "Katamari Damacy scheduling", but I would guess that at least some of the cultural references (dilithium crystals, Marcellus Wallace) could be confusing to folks not steeped in a particularly nerdy subset of American-ish culture.

I think if references and jokes are landing for (say) 80% of your target audience then you're doing pretty well.

I liked the article. Interesting approach to scheduling as kind of a market exchange problem. It's cool to see what an infra service can look like when you ditch the constraints and path dependence that led to the "standard" cloud architecture and build something totally new. The user experience seems like multi-tenant Borg, but nimbler.

lmm · on Feb 3, 2023

That particular use of "Kleenex" is an Americanism I had to consciously interpret.

tptacek · on Feb 3, 2023

D'oh, I meant to link Katamari.

coredog64 · on Feb 2, 2023

Although it’s currently archived, there’s another open source orchestrator that’s similar to Borg: Treadmill.

AFS solves for Colossus, with packages being distributed into AFS.

[0] https://github.com/morganstanley/treadmill

filereaper · on Feb 2, 2023

Mesos always had these notions of two level scheduling that let you build your own orchestration.

Aurora, Marathon, etc... would add the flavor of Orchestration that's needed. Mesos provided the resources requested.

https://mesos.apache.org/documentation/latest/architecture/

necubi · on Feb 2, 2023

I remain sad that Mesos never really took off, and then k8s ate the market. It had a lot of really clever ideas and could do stuff well that k8s still can't (although it has been catching up, ten years later). In particular, the scheduler architecture allowed it to work well for both batch workloads and service-like workloads within a single resource pool.

tptacek · on Feb 2, 2023

Borg, Omega, K8s, and Nomad all have a separate scheduling pathway for batch jobs, don't they? My understanding is that this is the big complexifier for all distributed schedulers: that once you have two different service models for scheduling (placement sensitive delay insensitive, and placement insensitive delay sensitive), you now have two scheduler process, and you have to work out concurrency between their claims on resources.

The Omega design in particular is, like, ground up meant to address this problem; the whole paper is basically "why the Mesos approach is suboptimal for this problem".

necubi · on Feb 2, 2023

It's true that Omega was designed specifically to solve this problem well (and one of the big insights from the original Borg paper is how efficient it is to have heterogenous workloads on a single cluster). But (as a non-google person with no inside info) I think the amount of omega inspiration in kubernetes is overstated.

Until a few years ago there was very limited support for pluggable schedulers in kubernetes (I think it's gotten better recently but I'm a bit out of date). The beauty of Omega and Mesos is that schedulers were a first-level concept. How Spark thinks about job scheduling will not be the same as how Presto does, or a real-time system like Flink. Mesos provided a framework where you could create custom schedulers for each class of application you run.

For example, for a query engine you might care about scheduling latency more than anything else, and you'd be willing to accept subpar resources if you're able to get them more quickly. For a Spark job, the scheduler can introspect the actual properties of the job to figure out whether a particular resource is suitable.

tptacek · on Feb 2, 2023

For what it's worth: I didn't know if there was any Omega inspiration in K8s at all; I just know that Nomad was heavily inspired by it.

GauntletWizard · on Feb 3, 2023

There was a ton of Omega inspiration in K8s, in the sense that Omega was mostly a failed software project, and while the scheduler potions ended up being backported, K8s mostly avoided doing the hard things and settled for good enough because of it.

ilyt · on Feb 3, 2023

Yeah it was neat and oh so much less complex to setup than k8s from scratch.

But k8s is not just job scheduler, it comes with (at hefty complexity cost) with all the piping and plumbing around it so it appealed to developers that could pop out a single .yaml that described their whole architecture.

GauntletWizard · on Feb 3, 2023

That's precisely the reason Mesos lost - It didn't come batteries included with a scheduler, and nobody ever packed Aurora+Nomad nicely.

mochomocha · on Feb 2, 2023

I have my own fair share of complaints about k8s, but I can't say the author articulates clearly what is wrong with k8s scheduler exactly.

IMO it's one of the better part of k8s. The core scheduler is pretty well written and extensible through scheduling plugins, to implement whatever policies you heart desires (which we extensively make use of at Netflix).

The main issue I have with it is the lack of built-in observability, which makes it non-trivial to A/B test scheduling policies in large scale deployment setups because you want to be able to log the various subscores of your plugins. But it's so extensible through NodeAffinity and PodAffinity plugins that you can even delegate part (or all!) of the scheduling decisions outside of it if you want.

Besides observability, one issue we've had to overcome with k8s scheduling is the inheritance of the Borg design decisions around pod shape immutability, which makes implementing things like oversubscription less easy in a "native" way.

tptacek · on Feb 2, 2023

Nothing is wrong with the k8s scheduler!

Really, nothing is wrong with k8s at all, beyond our more general problem of "our users want to run Linux apps, not k8s apps".

K8s, Borg, Omega, Flynn, Nomad, and to some extent Mesos all share a common high-level scheduler architecture: a logically centralized, possibly distributed server process that functions like an allocator and is based on a consistent view of available cluster resources.

It's a straightforward and logical way to design an orchestrator. It's probably the way you'd decide to do it by default. Why wouldn't you? It's an approach that works well in other domains. And: it works well for clusters too.

The point of the post is that it's not the only way to design an orchestrator. You can effectively schedule without a centralized consistent allocator scheduler, and when you do that, you get some interesting UX implications.

They're not necessarily good implications! If you're running a cluster for, like, Pixar, they're probably bad. You probably want something that works like Borg or Omega did. You have a (relatively) small number of (relatively) huge jobs, you want optimal placement†, and you probably want to minimize your hardware costs.

We have the opposite constraints, so the complications of keeping a globally consistent real-time inventory of available resources and scheduling decisions don't pay their freight in benefits. That's just us, though. It's probably not anything resembling most k8s users.

† In fact, going back even before Borg but especially once Borg came on the scene, mainstream schedulers have been making this distinction --- between service jobs and batch jobs, where batch jobs are less placement sensitive and more delay sensitive. So one way to think about the design approach we're taking is, what if the whole platform scheduler thought in terms of a batch-friendly notion of jobs, and then you built the service placement logic on top of it, rather than alongside it?

audi0slave · on Feb 2, 2023

I wonder if you ever looked at sparrow paper[1] which came out Ion Stoica's lab. Its also a decentralized in nature focusing on low latency placement.

[1] https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf

tptacek · on Feb 2, 2023

Lot of similarities!

At a low level, there are distinctions like Sparrow being explicitly p2c based (choose two random workers, then use metrics to pick the "best" between them). We don't p2c at all right now; part of our idea is being able to scale-from-zero to serve incoming requests with sleeping Fly Machines, rather than keeping everything online. And, of course, we run whole VMs (/Docker containers), not jobs with fixed definitions running on long-running executors --- but that's just a detail.

My sense of it is that the big distinction though is that you'd run Sparrow alongside a classic scheduler design like Mesos or Borg or Omega. You'd schedule placement-sensitive servers with the Borglike, and handle Lambda-style HTTP request work with the Sparrowlike.

What we're working on instead if whether you can build something sufficiently Borg-like on top of something Sparrow-like, using "Sparrow-style" scheduling as a low-level primitive. This doesn't quite capture what we're doing (the scheduling API we export implicitly handles some constraints, for instance), but is maybe a good way to think about it.

I should have cited this paper! Thanks for linking it.

mochomocha · on Feb 2, 2023

I have a hard time being on board with the logic. Neither k8s nor mesos guarantee optimal placements (which as you pointed out is unrealistic given the NP-hardness of the problem), in fact they are explicitly trading placement quality for lower scheduling latency - and are mostly designed to schedule really fast.

Mainstream schedulers make no distinction between service & batch jobs out of the box - you need to explicitly go out of your way to implement differential preferences when it comes to scheduling latency for example.

I am surprised cost isn't a concern for you guys, I actually had assumed you went all in on bin-packing+firecracker oversubscription to maximize your margin.

tptacek · on Feb 2, 2023

Nothing guarantees optimal placement, but all the mainstream schedulers attempt an approximation of it. The general assumption in mainstream schedulers is that servers are placement-sensitive, and batch jobs aren't. If you assume there's only one kind of job (all servers, or all batch jobs), the design of a scheduler gets a lot simpler; like, most of what the literature talks about with respect to Mesos and Borg is how to do distributed scheduling for varying schedule regimes.

I like how I worded it earlier: instead of having a complex scheduler that keeps a synchronized distributed database of previous schedulings and available resources, what if you just scheduled everything as if it was a placement-insensitive batch job? Obviously, your servers aren't happy about that. But you don't expose that to your servers; you build another scheduling layer on top of the batch scheduling.

That's not precisely what we're doing (there are ultimately still constraints in our service model!) but it sort of gets at the spirit of it.

As for cost: we rent out server space. We rack servers to keep up with customer load. The more load we have, the more money we're making. If we're racking a bunch of new servers in FRA, that means FRA is making us more money. Ideally, we want to rack enough machines to have a decent amount of headroom past what we absolutely need for our current workload --- if our business is going well, customer load is consistently growing. So if ops is going well, we've generally got more machines than a mainstream scheduler would "need" to schedule on. But there's no point in having those machines racked and doing nothing.

At some point we'll reach a stage where we'll be more sensitive to hardware costs and efficiency. Think of a slider of sensitivity; we're currently closer to the insensitive size, because we're a high-growth business.

scottlamb · on Feb 3, 2023

> Nothing guarantees optimal placement, but all the mainstream schedulers attempt an approximation of it. The general assumption in mainstream schedulers is that servers are placement-sensitive, and batch jobs aren't.

I don't know the open source schedulers well, but modern Borg batch scheduling works quite differently than (my understanding of) your description. Batch jobs are still often placement-sensitive (needing to run near a large database, for example, for bandwidth reasons). The big distinction I see is that where serving tasks get tight SLOs about scheduling latency, evictions/preemptions, and availability of resources on the machine it's scheduled on, batch tasks basically don't. They can take a while to schedule, they can get preempted frequently, and they cram into the gap between the serving tasks' current usage and limit. E.g., if a serving job says it needs 10 cores but is only using 1, a batch job might use 8 of those then just kinda suffer if the serving job starts using more than 2, because CPU is "compressible", or it will be evicted if things get really bad. In the same situation with RAM (mostly "incompressible"), the batch job gets evicted ASAP if the serving job needs the RAM, or the system involves some second-class RAM solution (cross numa node, Optane, zramfs, rdma, page to SSD, whatever). Batch doesn't get better service in any respect, but it's cheaper.

> As for cost: we rent out server space. We rack servers to keep up with customer load. The more load we have, the more money we're making. If we're racking a bunch of new servers in FRA, that means FRA is making us more money.

and in the article, you wrote:

> It was designed for a cluster where 0% utilization was better, for power consumption reasons, than < 40% utilization. Makes sense for Google. Not so much for us.

IIUC, you mean that whoever you're renting space from doesn't charge you by power usage, so you have no incentive to prefer fully packing 1 machine before scheduling something on every machine available. Spreading is fine. Makes sense economically (although I'm a little sad to read it environmentally because the power usage difference should still be real).

I think another aspect to consider is avoiding "stranded" resources. Avoiding situations in which say a task that needs most/all of a machine's remaining RAM but very little CPU gets scheduled on a machine with a whole bunch of CPU available, effectively making that CPU unusable until something terminates. You've got headroom, but I presume that's based on forecasted need, and if that gets higher because you'll still have stranded resources when that need comes, the stranding is costing you real money.

Maybe this problem is avoided well enough just by spreading things out? or maybe you don't allow weird task shapes? or maybe (I'm seeing your final paragraph now about growth) it's just not worth optimizing for yet?

toast0 · on Feb 3, 2023

> Makes sense economically (although I'm a little sad to read it environmentally because the power usage difference should still be real).

Does fully loading 4 cores in one server save power over fully loading 2 cores in 2 servers? If you turn off the idle server, probably yes? If not, I'd have to see measurements, but I could imagine it going either way. Lower activity means less heat means lower voltage means less power per unit of work, maybe.

You're likely to get better performance out of the two servers though (which might not be great, because then you have a more variable product).

GauntletWizard · on Feb 3, 2023

In modern CPUs with modern thermal management approaches, it's probable that fully loading two cores in two servers is much more efficient than even powering off the second server, because in each machine the primary delta in power draw between idle-state and max-load is in thermal management (fans), and running cores more distributed will increase passive cooling, as well as allowing the CPU cores that are in use to run in more energy-efficient modes.

That said, I haven't done the actual math here, just seen the power draw benchmarks that show idle -> single core draw -> all core draw as a curve with idle and single core usage well under the ratio of number of cores, without even accounting for the fact that each core is more performant under single-core workloads.

scottlamb · on Feb 3, 2023

> Does fully loading 4 cores in one server save power over fully loading 2 cores in 2 servers?

That's the premise, and I have no particular reason to doubt it. There are several levels at which it might be true, from using deeper sleep states (core level? socket level?) to going wild and de-energizing entire PDUs.

> You're likely to get better performance out of the two servers though (which might not be great, because then you have a more variable product).

Yeah, exactly, it's a double-edged sword. The fly.io article says the following...

> With strict bin packing, we end up with Katamari Damacy scheduling, where a couple overworked servers in our fleet suck up all the random jobs they come into contact with. Resource tracking is imperfect and neighbors are noisy, so this is a pretty bad customer experience.

...and I've seen problems along those lines too. State-of-the-art isolation is imperfect. E.g., some workloads gobble up the shared last-level CPU cache and thus cause neighbors' instructions-per-cycle to plummet. (It's not hard to write such an antagonist if you want to see this in action.) Still, ideally you find the limits ahead of time, so you don't think you have more headroom than you really do.

tptacek · on Feb 3, 2023

No, it's not that power usage for us is free, it's that the business is growing (like any startup), so there is already a constant expansionary pressure on our regions; for the foreseeable future (years), our regions will tend to have significantly more servers than a scheduler would tell us we needed. Whatever we save in power costs by keeping some of those servers powered off, we lose in technical operational costs by keeping the rest of the servers running artificially hot.

steve_gh · on Feb 3, 2023

Actually, on thinking about it a bit more, it has a lot of similarities to a whole bunch of coordination problems that I worked on many years ago in the mobile phone industry. There was a problem was coordinating resource usage between a set of mobile phone base stations.

In the OP, you talked about a payoff curve for an individual server taking on a job based on its current loading. This is effectively a metric for the desirability of that server taking on another job, and that in turn would allow the supervisor to auction off jobs, assigning each job to the server which will benefit most from taking it.

You could even go one step further and implement a genetic algorithm classifier system. Each server would have a rule set for bidding for jobs coded it into the GA genotype. Every once in awhile a lower priority coordination process runs a generation of the GA, replacing the rule set on some of the less efficiently running servers with the offspring of the other more efficiently running servers. In this way the population of surfers evolves towards rulesets for bidding for jobs the drive the efficient use of your server resources.

mwcampbell · on Feb 2, 2023

> Think of a slider of sensitivity; we're currently closer to the insensitive size, because we're a high-growth business.

If the big-tech layoffs are any indication of a general trend toward belt-tightening, then reducing headroom, and even over-subscribing servers shared-hosting-style, must be tempting.

vidarh · on Feb 3, 2023

Even in a case where they for some reason might want to oversubscribe or at least keep the number of servers smaller the scheduling decision they want seems to be different.

It's the difference between "fit this into the smallest number of nodes so that we can shut down some nodes [possibly rented by the hour] if possible" vs "these nodes are sitting there whether or not they're doing anything, so spread the nodes across them".

In the long run there may come a time where packing tighter matters, e.g. if they become big enough to run their own data centres, where shutting down servers dynamically to save power might become worth it, but typical colo contracts for smaller number of racks rarely makes it worth your while to turn servers off.

tptacek · on Feb 2, 2023

Not for us, right now. We have if anything the opposite problem.

steve_gh · on Feb 3, 2023

The whole thing really comes down to a tension between on the one hand trying to find a globally optimal solution, which requires lots of state coordination, and locally optimal solutions, which are unlikely to be as good, but do not have the coordination overhead.

yencabulator · on Feb 3, 2023

This made me follow the link to https://fly.io/blog/ipv6-wireguard-peering/ and I think you have a copy-pasto:

> Our WireGuard mesh sees IPv6 addresses that look like fdaa:host:host::/48 but the rest of our system sees fdaa:net:net::48.

That was probably mean to be net:host -> host:net

dijksterhuis · on Feb 3, 2023

$CURRENT_JOB is breaking apart and refactoring an nightmarish "orchestration application" where about 70% of the codebase solely in django.

will be referencing and sharing this fairly frequently I reckon so massive thank you @tptacek

anyas · on Feb 3, 2023

What is the rationale for having the warm spares? If you're already able to spin up new VMs within an HTTP request, that seems pretty fast. Is the extra complexity just to make that path even faster?

tptacek · on Feb 3, 2023

The "warm spare" isn't running, but it's pre-loaded onto a particular worker. The slowest part of starting an app from scratch is checkout out the container; "warming up" the worker with the "spare" machine eliminates that.

nitwit005 · on Feb 3, 2023

I suppose I'm more curious how you tested this, than the basic idea. For example, if Nomad was doing unwanted movement of apps between machines, how do you prove the replacement code won't?

plaidfuji · on Feb 2, 2023

> With strict bin packing, we end up with Katamari Damacy scheduling, where a couple overworked servers in our fleet suck up all the random jobs they come into contact with.

korijn · on Feb 2, 2023

I can't shake the sense that they seem to just prefer building something new (and blog about it) over figuring out how to configure a/the scheduler properly.

Anyway, they did get it done and made it work, so whatever.

mrkurt · on Feb 2, 2023

We built Fly.io to treat chronic NIH syndrome.

korijn · on Feb 2, 2023

I don't understand how that relates to my comment.

Anyway, I'm happy you seem to be successful. Keep up the good work.

linuxftw · on Feb 2, 2023

NIH syndromes is 'not invented here syndrome.' This equates to the 'build something new and blog about it' remark you made, IMO.

vidarh · on Feb 3, 2023

Sometimes building something new is easier than beating a recalcitrant tool into submission.

ilyt · on Feb 3, 2023

And most times it's the devs being bored wanting to build something neat.