Background job queues and priorities may be the wrong path

AlotOfReading · on Oct 15, 2023

The article is a bit unclear because it's lacking the proper vocabulary. Priorities and deadlines (what the article calls "SLOs") are both valid ways to approach scheduling problems with different tradeoffs.

The fixed priority systems the article talks about trade off optimal "capacity" utilization for understandable failure dynamics in the overcapacity case. When you're over capacity, the messages that don't go through are the messages with the lowest priority. It's a nice property and simple to implement.

What the article proposes is better known as deadline scheduling. That's also fine and widely used, but it has more complicated failure dynamics in the overcapacity case. If your problem domain doesn't have an inherent "priority" linked to the deadlines, that may be acceptable, but in other cases it may not be.

Neither is inherently better and there's other approaches with yet different tradeoffs.

dehrmann · on Oct 16, 2023

I've seen enough versions of these systems and reached the conclusion that the best you can do is shift lower priority work off-peak, but you still need to be overprovisioned, or else a p99 even will stall the system with too much work. People will talk about priorities, but on the scale of one day, almost everything is high priority.

hinkley · on Oct 15, 2023

IN multitenant situations this is even more complicated. Because I may have a time slice of the machine I’m “entitled” to and within that I have priorities with which I want to allocate things. If one of my tasks can’t complete, I hope it’s to the benefit of another of my priorities. Not someone else’s.

I suspect you intimately need a little language for this. Decoding priorities for frames in a video stream are already complex and those are trivial compared to the sorts of scenarios Conway’s Las introduces.

karmakaze · on Oct 15, 2023

Yeah, I don't see how deadlines/SLOs does any better when there's insufficient capacity to meet those of queued jobs at their concurrency.

> I don’t think it’s possible to meet a latency target through prioritization when there is a fundamental lack of capacity.

This seemingly implies that it would be achievable with target deadlines/SLOs. At capacity I don't see a better solution than give each priority (or target latency) a defined minimum resource allocation.

arakageeta · on Oct 15, 2023

Under deadline scheduling, every pending job _eventually_ has highest priority as time elapses. (Assuming new jobs can’t arrive with deadlines in the past.) Every job is eventually serviced.

The “pain” experienced in an overload situations is spread among all late jobs. Contrast this with fixed-priority scheduling, where lowest priority jobs will be starved completely, until the overload is resolved.

karmakaze · on Oct 16, 2023

That's where I was saying in the case of insufficient capacity, the available capacity is divided among the priorities in non-equal but non-zero portions. The SLO/deadline method effectively would be doing something similar as everything would be overdue and the most overdue get highest priority to run. The only difference is that the there's no unequal portioning unless there's additional logic to say x overdue of job A is more important than x overdue of job B, which amounts to setting priorities in the end.

Dylan16807 · on Oct 16, 2023

> The “pain” experienced in an overload situations is spread among all late jobs. Contrast this with fixed-priority scheduling, where lowest priority jobs will be starved completely, until the overload is resolved.

Though this is often not a good way to spread out the pain.

It's probably much worse for the 15s job to be a minute late than for the 8h job to be a minute late, but basic deadline scheduling will treat them the same. So you want sharing, but uneven sharing.

rewmie · on Oct 15, 2023

Thanks for sharing some of your insight onto the problem domain. Do you happen to have any reference for laymen like me to onboard onto the subject? It's an awfully interesting topic but when I stumbled upon it I tend to fall back to JIT learning and troubleshooting, which is far from the best position to be.

AlotOfReading · on Oct 15, 2023

This is a class of problems known as scheduling. The Wikipedia page is a good place to start [0]. It focuses on task schedulers rather than message queues, but the same principles apply. For a deeper theoretical basis, Wiley has a good book [1]. Most undergrad curricula will have a module on this as well, so you can find info in most comp sci textbooks.

[0] https://en.m.wikipedia.org/wiki/Scheduling_(computing)

[1] https://onlinelibrary.wiley.com/doi/book/10.1002/97811189844...

jayunit · on Oct 15, 2023

The article from @jph (who commented it at the top level here) discussed a month ago https://news.ycombinator.com/item?id=37532439 might be helpful, too!

andrelaszlo · on Oct 15, 2023

Another important aspect that seems to be overlooked in a lot of these discussions (specifically for Rails, but probably relevant in other contexts as well) is that if your jobs are very heterogeneous in terms of their degree of parallelizability then it will be very difficult to do effective provisioning and capacity planning.

Let's say you have two types of jobs: one that is highly parallelizable (Eg request to third party API), and one that doesn't parallelize well (video transcoding using 100% of available CPU). Then you'd want a lot of threads assigned to each CPU for the first type, but very few for the second type. If they're on the same queue, you can have a few threads and be blocked by job type one most of the time, or have a lot of threads and get tons of context switching when the second type of job dominates. Either way your throughput will be very bad.

My current idea is to partition job types first by their degree of parallelizability (as in Amdahl's law), then by priority if necessary.

Has anyone tried this?

deredede · on Oct 16, 2023

Do not use the same queues for I/O jobs and compute jobs, they are not competing for the same resources.

andrelaszlo · on Oct 16, 2023

Totally agree.

jph · on Oct 15, 2023

There are better ways to manage and measure queues, IMHO, by using basic queuing theory such as 'ρ' for utilization, 'ε' for error rate, 'Aθ' for activity step time, and Little's Law.

My queuing theory notes and notation are here:

https://github.com/joelparkerhenderson/queueing-theory

timeagain · on Oct 15, 2023

In a similar vein, something about queuing that has annoyed me as a developer for multiple large FANG corporations is poor thinking about queue metrics. The TLDR is that metics provided by the queue itself are rarely helpful for knowing if your service is healthy, and when it is not healthy they are not very useful for determining why.

Most queue processing services that I have seen have an alarm on (a) oldest message age, and (b) number of messages in the queue.

In every team I joined I have quickly added a custom metric (c) that subtracts the time of successful processing from the time that a message was /initially/ added to the queue. This metric tends to uncover lots of nasty edge cases regarding retries, priority starving, and P99 behavior that are hidden by (a) and (b).

Having 100000 messages in the queue is only an issue if they are not being processed at (at least) 100000/s. Having a 6-hour-old message in the queue is concerning, but maybe it is an extreme outlier, so alarming is unnecessary. But you can bet your bottom dollar that if your average processing latency spikes by 10x that you want to know about it.

The other thing that is nice about an end to end latency metric is that (a) and (b) both tend to look great all the way up to the point of failure/back pressure and then they blow up excitingly. (c) on the other hand will pick up on things like a slight increase in application latency, allowing you to diagnose beforehand if your previously over-provisioned queue is becoming at-capacity or under-provisioned.

hbrundage · on Oct 15, 2023

Makes sense!

I was just talking with a Temporal solutions engineer this week and this metric is their recommended one for autoscaling on. Instead of autoscaling on queue depth, you scale on queue latency! Specifically for them they split up the time from enqueue to start, and then the time from start to done, and you scale on the former, not the total ("ScheduleToStart" in their terms).

ickyforce · on Oct 15, 2023

Time from enqueue to start isn't a good metric - it completely disregards the queue size. Enqueuing 1M jobs won't change this metric as it only updates once the job reaches the front of the queue, and when the 1Mth job does that the situation is already over.

I had much better results with a metric that shows estimated queue time for jobs that are getting enqueued right now (queue_size * running_avg_job_processing_time / parallelism).

lifeisstillgood · on Oct 15, 2023

>>> I was just talking with a Temporal solutions engineer

Aha! Just as the second season of Loki dropped. Makes sense now

Less sarcastically - this ties in with the article i guess. runat time is the enqueue, and then you are arguing for two latencies - time enqueue to start and start to complete.

timeagain · on Oct 15, 2023

Exactly! Other queue metrics have too many false-positives and false-negatives.

j45 · on Oct 15, 2023

Slow working processes are completely normal and required in the real world. It’s not an extreme outlier.

Enterprise processes can wind between many intermediaries. Hours, days, weeks, maybe even months.

paulsutter · on Oct 15, 2023

If you want to write any complicated concurrent code, the simplest and best way is one long polling loop state machine. You might use a thread to call a blocking API, but the majority of the logic should be in a polling loop.

I used to love chained callbacks when I was 16, and later I thought threads were the greatest, and I've written a bunch of device drivers that operate at different IPLs.

But 20 years ago a cofounder made me realize that a long polling loop is easier and faster to write, and much easier to understand than threads. That insight has made countless projects simpler and easier and I recommend considering it. You may be surprised, as I was.

jupp0r · on Oct 15, 2023

Ideally, you don't have blocking APIs at all. There isn't any reason nowadays. Every blocking API in your system is there for historic reasons (ie it's >20 years old) or because somebody made a terrible design decision in the last 20 years.

I am obviously referring to operations that do actual IO or wait for work on other threads only.

taeric · on Oct 15, 2023

I don't know that I agree. Blocking logic is far easier to understand, as it lets you ignore partial operations. It also lets you specify, in a more direct manner, where data is buffered.

reactordev · on Oct 15, 2023

One would argue the best approach is non-blocking written in a blocking fashion for clarity. async/await instead of Promises and Thens.

Most languages have this now. For the ones that don’t, they have pointers, so writing a thread pool event loop is trivial.

taeric · on Oct 15, 2023

Maybe, though that is a bit of having your cake and eating it, too.

I am very open to the argument that you should be able to do both. I'm even open to the idea that this is closer to preemptive multitasking. Where the manual version basically lost, and it is taking time for some of us to realize it. :)

okeuro49 · on Oct 16, 2023

> Ideally, you don't have blocking APIs at all. There isn't any reason nowadays.

If you're working with the JVM, you can write simpler blocking code that is also performant through virtual threads.

F-W-M · on Oct 16, 2023

What do you mean by a long polling state machine? Is it just a state machine that gets fed via long polling http requests?

akvadrako · on Oct 16, 2023

No, he means a run loop which polls for events, often by using poll() or select(), but it could also just be sleep() and then a SELECT statement.

Then it contains logic to handle events.

F-W-M · on Oct 16, 2023

Ah thanks. Didn't hear 'long polling' in that context yet.

dehrmann · on Oct 16, 2023

Half of what you're describing is an async event loop, but it isn't strictly tied to a state machine.

corytheboyd · on Oct 16, 2023

I have yet to try it IRL, but I’ve always wondered if having a single hyper-scaled worker pool against a single queue would be best suited for a typical web app Do Things Async layer. Add a job timeout so that a single rogue process can’t saturate workers… okay and now add a long_jobs queue for the things that need to run longer… and one more for these other things that need to run longer than that… and… shit.

sargun · on Oct 15, 2023

Deadline based scheduling actually lets you do super clever things like "time-shifting" -- you have 10 jobs, that each take 10 seconds, and the deadline is 300 seconds out -- you can fit them in between other jobs. In addition, if you know the requirement of the job in terms of computational needs, you can determine really efficient collocation patterns.

IMHO, the problem is that it's really hard for people to think in terms of "I want my job that takes 10 CPU seconds to be done in 300 wall clock seconds". In turn, what batch processing frameworks do, is they can estimate these things, and figure out where to place work. You can also do stuff like deny requests if there isn't capacity (because you know all the scheduled work for the next quanta).

paulgb · on Oct 15, 2023

If the author is here: your CSS is breaking the code example. If you remove "white-space: wrap" from the rule `code[class="language-"], pre[class="language-"]`, it seems to do the trick (though I'm not sure if it breaks things on other pages).

a12b · on Oct 16, 2023

Thanks !

hamandcheese · on Oct 15, 2023

I largely agree, however I do believe that it is necessary to reserve capacity for lower latency jobs if the variance in job durations is large.

For example, suppose you have a burst of 1 hour latency jobs, each of which processes in 10 minutes. It will not take many of these to consume all available workers.

If that burst is followed by a single high priority, 10s latency job. Whelp, that jobs latency objective will not be met, since the soonest that a worker will free up to take this work is 10 minutes.

So I think the ideal worker pool design does include some amount of reserved capacity for low-latency work.

A general purpose workers can of course grab low latency work if it's idle! But the reverse is not true - an idle low-latency worker should not be picking up any long-running job.

Groxx · on Oct 15, 2023

># If workers are quiet, job1 will be run first in 10 minutes

># If workers are busy, job2 will be run first in 11 minutes

># If workers are too busy, both jobs will exceed their max latency.

So... priorities for tasks in a background queue.

I agree explicit latency tolerance is often a great way to do this - it lets you know what you can relax and reschedule, and if it's far enough in the future you can predict load / scale preemptively. Plus it works without having to deal with defining who gets what priority integer (or when to subdivide floats). But it degrades to the same behavior as priorities.

a12b · on Oct 17, 2023

Author here.

Thanks to all commenters for sharing their experiences and constructive opinions. It shows that this post is incomplete and far from being perfect. So, I just wrote a post-scriptum to improve it a bit for future readers.

https://alexis.bernard.io/blog/2023-10-15-background-job-que...

adrr · on Oct 15, 2023

But what if I run background jobs to protect resources that can't easily scale wide like DB writes or calls to SAAS services that are API throttled.

ralferoo · on Oct 15, 2023

Just quickly skimmed this, but it seems the conclusion is wrong:

A job needs two attributes to define when it should be started: run_at and max_latency. That means the job worker only needs to order them by run_at + max_latency, and takes the first. It seems both flexible and simple.

Just considering two jobs (run_at=10,max_latency=15), (run_at=11,max_latency=13), it's clear that following that approach, the first task would be unnecessarily blocked by the second, or you'd run jobs earlier than run_at specified.

alex-lx · on Oct 16, 2023

I guess the actual way is like `where run_at <= now() order by (run_at+max_latency)`

svilen_dobrev · on Oct 15, 2023

i guess the messages should be sorted by that tuple (run_at, max_latency), and only if they miss the run_at, then run_at+max_latency comes into play. Or something even more twisted.

markhahn · on Oct 15, 2023

Well duh: you use prioritised queues precisely when there is a capacity limit. In lots of cases, the facility is specifically mandated to achieve as close to full capacity as possible.

Which is not to agree with the claim that latency queues and priorities can't achieve latency goals. Your hard requirements establish a minimum viable capacity, and you fill in the bubbles with softer work. Priorities let you distinguish between hard and soft, and to offer fairness among soft.

dudeinjapan · on Oct 16, 2023

What is missing from this picture is idleness. For example, suppose I have a SLO 10 sec job A and SLO 5 min job B. If I only get a few Bs sporadically, I may want to define queue X=A only, and queue Y=A,B to use the idle compute to process more As. In the wild, this is a delicate balancing act.

deredede · on Oct 16, 2023

Even if you get lots of Bs, if B takes 15s to run, and you ever get n_worker requests for B at the same time immediately followed by a single A, you lose, even with plenty capacity to spare.

You need either dedicated workers for low latency tasks or some sort of preemption to meet SLOs with such heterogeneous tasks.

dudeinjapan · on Oct 16, 2023

Right you need to think about the queue wait time you are willing to tolerate, in addition the time it takes a job to run. If you arent willing to wait in queue then you will need to have idle capacity.

itake · on Oct 15, 2023

This does not work if your upstream server can only handle X concurrency per second (think ML GPU) and you need to timeout the job before processing it.

jounker · on Oct 16, 2023

Article telling you to avoid queues and priorities advises you to implement a priority queue.

ndriscoll · on Oct 15, 2023

It's useful to have queues based on job type because it allows you to stream messages onto the queue as they come in, and then batch pull the work off for processing (many processes are more efficient--possibly vastly so--when run as a batch).

hbrundage · on Oct 15, 2023

Its interesting to see how the Rails world still thinks in terms of the number of processes listening to a queue, instead of thinking in the cloud-native, elastic, serverless terms.

There's always an autoscaling delay, but Rails itself (and the community) don't seem to fit into the serverless paradigm well such that these questions around how to design your queues come up.

I think a lot of Lambda developers or Cloud Run developers would instead say "well my max instances is set to 500, I am pretty sure I'm going to break something else before I hit that", you know? Especially when using the cloud's nice integrations between their queues and their event-driven serverless products its super easy to get exactly as much compute as you need to keep your latency really low.

andrelaszlo · on Oct 15, 2023

Yeah when your Rails monolith's image is several GiBs, uses roughly the same amount of memory, and takes almost a minute to cold start, autoscaling has a lot of inertia and gets pretty expensive.