The article is a bit unclear because it's lacking the proper vocabulary. Priorities and deadlines (what the article calls "SLOs") are both valid ways to approach scheduling problems with different tradeoffs.
The fixed priority systems the article talks about trade off optimal "capacity" utilization for understandable failure dynamics in the overcapacity case. When you're over capacity, the messages that don't go through are the messages with the lowest priority. It's a nice property and simple to implement.
What the article proposes is better known as deadline scheduling. That's also fine and widely used, but it has more complicated failure dynamics in the overcapacity case. If your problem domain doesn't have an inherent "priority" linked to the deadlines, that may be acceptable, but in other cases it may not be.
Neither is inherently better and there's other approaches with yet different tradeoffs.
I've seen enough versions of these systems and reached the conclusion that the best you can do is shift lower priority work off-peak, but you still need to be overprovisioned, or else a p99 even will stall the system with too much work. People will talk about priorities, but on the scale of one day, almost everything is high priority.
IN multitenant situations this is even more complicated. Because I may have a time slice of the machine I’m “entitled” to and within that I have priorities with which I want to allocate things. If one of my tasks can’t complete, I hope it’s to the benefit of another of my priorities. Not someone else’s.
I suspect you intimately need a little language for this. Decoding priorities for frames in a video stream are already complex and those are trivial compared to the sorts of scenarios Conway’s Las introduces.
Yeah, I don't see how deadlines/SLOs does any better when there's insufficient capacity to meet those of queued jobs at their concurrency.
> I don’t think it’s possible to meet a latency target through prioritization when there is a fundamental lack of capacity.
This seemingly implies that it would be achievable with target deadlines/SLOs. At capacity I don't see a better solution than give each priority (or target latency) a defined minimum resource allocation.
Under deadline scheduling, every pending job _eventually_ has highest priority as time elapses. (Assuming new jobs can’t arrive with deadlines in the past.) Every job is eventually serviced.
The “pain” experienced in an overload situations is spread among all late jobs. Contrast this with fixed-priority scheduling, where lowest priority jobs will be starved completely, until the overload is resolved.
That's where I was saying in the case of insufficient capacity, the available capacity is divided among the priorities in non-equal but non-zero portions. The SLO/deadline method effectively would be doing something similar as everything would be overdue and the most overdue get highest priority to run. The only difference is that the there's no unequal portioning unless there's additional logic to say x overdue of job A is more important than x overdue of job B, which amounts to setting priorities in the end.
> The “pain” experienced in an overload situations is spread among all late jobs. Contrast this with fixed-priority scheduling, where lowest priority jobs will be starved completely, until the overload is resolved.
Though this is often not a good way to spread out the pain.
It's probably much worse for the 15s job to be a minute late than for the 8h job to be a minute late, but basic deadline scheduling will treat them the same. So you want sharing, but uneven sharing.
Thanks for sharing some of your insight onto the problem domain. Do you happen to have any reference for laymen like me to onboard onto the subject? It's an awfully interesting topic but when I stumbled upon it I tend to fall back to JIT learning and troubleshooting, which is far from the best position to be.
This is a class of problems known as scheduling. The Wikipedia page is a good place to start [0]. It focuses on task schedulers rather than message queues, but the same principles apply. For a deeper theoretical basis, Wiley has a good book [1]. Most undergrad curricula will have a module on this as well, so you can find info in most comp sci textbooks.
Another important aspect that seems to be overlooked in a lot of these discussions (specifically for Rails, but probably relevant in other contexts as well) is that if your jobs are very heterogeneous in terms of their degree of parallelizability then it will be very difficult to do effective provisioning and capacity planning.
Let's say you have two types of jobs: one that is highly parallelizable (Eg request to third party API), and one that doesn't parallelize well (video transcoding using 100% of available CPU). Then you'd want a lot of threads assigned to each CPU for the first type, but very few for the second type. If they're on the same queue, you can have a few threads and be blocked by job type one most of the time, or have a lot of threads and get tons of context switching when the second type of job dominates. Either way your throughput will be very bad.
My current idea is to partition job types first by their degree of parallelizability (as in Amdahl's law), then by priority if necessary.
There are better ways to manage and measure queues, IMHO, by using basic queuing theory such as 'ρ' for utilization, 'ε' for error rate, 'Aθ' for activity step time, and Little's Law.
In a similar vein, something about queuing that has annoyed me as a developer for multiple large FANG corporations is poor thinking about queue metrics. The TLDR is that metics provided by the queue itself are rarely helpful for knowing if your service is healthy, and when it is not healthy they are not very useful for determining why.
Most queue processing services that I have seen have an alarm on (a) oldest message age, and (b) number of messages in the queue.
In every team I joined I have quickly added a custom metric (c) that subtracts the time of successful processing from the time that a message was /initially/ added to the queue. This metric tends to uncover lots of nasty edge cases regarding retries, priority starving, and P99 behavior that are hidden by (a) and (b).
Having 100000 messages in the queue is only an issue if they are not being processed at (at least) 100000/s. Having a 6-hour-old message in the queue is concerning, but maybe it is an extreme outlier, so alarming is unnecessary. But you can bet your bottom dollar that if your average processing latency spikes by 10x that you want to know about it.
The other thing that is nice about an end to end latency metric is that (a) and (b) both tend to look great all the way up to the point of failure/back pressure and then they blow up excitingly. (c) on the other hand will pick up on things like a slight increase in application latency, allowing you to diagnose beforehand if your previously over-provisioned queue is becoming at-capacity or under-provisioned.
I was just talking with a Temporal solutions engineer this week and this metric is their recommended one for autoscaling on. Instead of autoscaling on queue depth, you scale on queue latency! Specifically for them they split up the time from enqueue to start, and then the time from start to done, and you scale on the former, not the total ("ScheduleToStart" in their terms).
Time from enqueue to start isn't a good metric - it completely disregards the queue size. Enqueuing 1M jobs won't change this metric as it only updates once the job reaches the front of the queue, and when the 1Mth job does that the situation is already over.
I had much better results with a metric that shows estimated queue time for jobs that are getting enqueued right now (queue_size * running_avg_job_processing_time / parallelism).
>>> I was just talking with a Temporal solutions engineer
Aha! Just as the second season of Loki dropped. Makes sense now
Less sarcastically - this ties in with the article i guess. runat time is the enqueue, and then you are arguing for two latencies - time enqueue to start and start to complete.
If you want to write any complicated concurrent code, the simplest and best way is one long polling loop state machine. You might use a thread to call a blocking API, but the majority of the logic should be in a polling loop.
I used to love chained callbacks when I was 16, and later I thought threads were the greatest, and I've written a bunch of device drivers that operate at different IPLs.
But 20 years ago a cofounder made me realize that a long polling loop is easier and faster to write, and much easier to understand than threads. That insight has made countless projects simpler and easier and I recommend considering it. You may be surprised, as I was.
Ideally, you don't have blocking APIs at all. There isn't any reason nowadays. Every blocking API in your system is there for historic reasons (ie it's >20 years old) or because somebody made a terrible design decision in the last 20 years.
I am obviously referring to operations that do actual IO or wait for work on other threads only.
I don't know that I agree. Blocking logic is far easier to understand, as it lets you ignore partial operations. It also lets you specify, in a more direct manner, where data is buffered.
Maybe, though that is a bit of having your cake and eating it, too.
I am very open to the argument that you should be able to do both. I'm even open to the idea that this is closer to preemptive multitasking. Where the manual version basically lost, and it is taking time for some of us to realize it. :)
I have yet to try it IRL, but I’ve always wondered if having a single hyper-scaled worker pool against a single queue would be best suited for a typical web app Do Things Async layer. Add a job timeout so that a single rogue process can’t saturate workers… okay and now add a long_jobs queue for the things that need to run longer… and one more for these other things that need to run longer than that… and… shit.
Deadline based scheduling actually lets you do super clever things like "time-shifting" -- you have 10 jobs, that each take 10 seconds, and the deadline is 300 seconds out -- you can fit them in between other jobs. In addition, if you know the requirement of the job in terms of computational needs, you can determine really efficient collocation patterns.
IMHO, the problem is that it's really hard for people to think in terms of "I want my job that takes 10 CPU seconds to be done in 300 wall clock seconds". In turn, what batch processing frameworks do, is they can estimate these things, and figure out where to place work. You can also do stuff like deny requests if there isn't capacity (because you know all the scheduled work for the next quanta).
If the author is here: your CSS is breaking the code example. If you remove "white-space: wrap" from the rule `code[class="language-"], pre[class="language-"]`, it seems to do the trick (though I'm not sure if it breaks things on other pages).
I largely agree, however I do believe that it is necessary to reserve capacity for lower latency jobs if the variance in job durations is large.
For example, suppose you have a burst of 1 hour latency jobs, each of which processes in 10 minutes. It will not take many of these to consume all available workers.
If that burst is followed by a single high priority, 10s latency job. Whelp, that jobs latency objective will not be met, since the soonest that a worker will free up to take this work is 10 minutes.
So I think the ideal worker pool design does include some amount of reserved capacity for low-latency work.
A general purpose workers can of course grab low latency work if it's idle! But the reverse is not true - an idle low-latency worker should not be picking up any long-running job.
># If workers are quiet, job1 will be run first in 10 minutes
># If workers are busy, job2 will be run first in 11 minutes
># If workers are too busy, both jobs will exceed their max latency.
So... priorities for tasks in a background queue.
I agree explicit latency tolerance is often a great way to do this - it lets you know what you can relax and reschedule, and if it's far enough in the future you can predict load / scale preemptively. Plus it works without having to deal with defining who gets what priority integer (or when to subdivide floats). But it degrades to the same behavior as priorities.
Thanks to all commenters for sharing their experiences and constructive opinions. It shows that this post is incomplete and far from being perfect. So, I just wrote a post-scriptum to improve it a bit for future readers.
Just quickly skimmed this, but it seems the conclusion is wrong:
A job needs two attributes to define when it should be started: run_at and max_latency. That means the job worker only needs to order them by run_at + max_latency, and takes the first. It seems both flexible and simple.
Just considering two jobs (run_at=10,max_latency=15), (run_at=11,max_latency=13), it's clear that following that approach, the first task would be unnecessarily blocked by the second, or you'd run jobs earlier than run_at specified.
i guess the messages should be sorted by that tuple (run_at, max_latency), and only if they miss the run_at, then run_at+max_latency comes into play. Or something even more twisted.
Well duh: you use prioritised queues precisely when there is a capacity limit. In lots of cases, the facility is specifically mandated to achieve as close to full capacity as possible.
Which is not to agree with the claim that latency queues and priorities can't achieve latency goals. Your hard requirements establish a minimum viable capacity, and you fill in the bubbles with softer work. Priorities let you distinguish between hard and soft, and to offer fairness among soft.
What is missing from this picture is idleness. For example, suppose I have a SLO 10 sec job A and SLO 5 min job B. If I only get a few Bs sporadically, I may want to define queue X=A only, and queue Y=A,B to use the idle compute to process more As. In the wild, this is a delicate balancing act.
Even if you get lots of Bs, if B takes 15s to run, and you ever get n_worker requests for B at the same time immediately followed by a single A, you lose, even with plenty capacity to spare.
You need either dedicated workers for low latency tasks or some sort of preemption to meet SLOs with such heterogeneous tasks.
Right you need to think about the queue wait time you are willing to tolerate, in addition the time it takes a job to run. If you arent willing to wait in queue then you will need to have idle capacity.
This does not work if your upstream server can only handle X concurrency per second (think ML GPU) and you need to timeout the job before processing it.
It's useful to have queues based on job type because it allows you to stream messages onto the queue as they come in, and then batch pull the work off for processing (many processes are more efficient--possibly vastly so--when run as a batch).
Its interesting to see how the Rails world still thinks in terms of the number of processes listening to a queue, instead of thinking in the cloud-native, elastic, serverless terms.
There's always an autoscaling delay, but Rails itself (and the community) don't seem to fit into the serverless paradigm well such that these questions around how to design your queues come up.
I think a lot of Lambda developers or Cloud Run developers would instead say "well my max instances is set to 500, I am pretty sure I'm going to break something else before I hit that", you know? Especially when using the cloud's nice integrations between their queues and their event-driven serverless products its super easy to get exactly as much compute as you need to keep your latency really low.
Yeah when your Rails monolith's image is several GiBs, uses roughly the same amount of memory, and takes almost a minute to cold start, autoscaling has a lot of inertia and gets pretty expensive.
The fixed priority systems the article talks about trade off optimal "capacity" utilization for understandable failure dynamics in the overcapacity case. When you're over capacity, the messages that don't go through are the messages with the lowest priority. It's a nice property and simple to implement.
What the article proposes is better known as deadline scheduling. That's also fine and widely used, but it has more complicated failure dynamics in the overcapacity case. If your problem domain doesn't have an inherent "priority" linked to the deadlines, that may be acceptable, but in other cases it may not be.
Neither is inherently better and there's other approaches with yet different tradeoffs.