Anyone run a multi-tenant SaaS and handle fairness with jobs “fairly”?
Occasionally we use to have all workers tied up on a single customers long running tasks, we mitigated by using a throttler we wrote that can defer a job if too many resources are in use by the customer, but it’s not ideal.
I’d love a priority based, customer throttled (eg max concurrent tasks) queue.
We can prioritize by low/medium/high using separate queues, and could make a set of queues per customer; but that is starting to explode how many queues we have and feels unmanageable.
Yes. I've seen it in all kinds of teams. Anything that allows a developer to retrieve some data after a local server restart essentially gets treated by devs as a system of record, regardless of the intricacies or guarantees involved.
My personal experience is that abuse of queuing/messaging systems along this axis is rampant. Engineering leaders must keep a close eye on how these types of mechanisms are utilized to ensure things don't go off the rails.
I've seen far too many serious data loss events that boil down to "we lost our AMQP queue". It's critical that developers understand the limitations of the systems that run their code rather than just jumping aboard that "SQL is for old people" hype train.
We implemented this with additional DB checks. For example: we put only one job to queue at a time per customer, the rest are in database until one that in progress is not processed.
Actually most of the prioritization could be implemented through additional DB. With SQS in most cases you need persistent reflection of job to keep its status, process times, results. You can put to queue only few items that are highest priority to guarantee that workers are busy next 10-30 minutes.
Thanks for the suggestion, ill reflect on that; we were also considering an "overflow" queue that would receive jobs if there was a recent high insert rate of a customer's jobs to the main queue, but that didn't solve the "cost" problem of a job potentially being large.
Was hoping to avoid having "side car" infrastructure for this, but I don't think I can escape it :)
Occasionally we use to have all workers tied up on a single customers long running tasks, we mitigated by using a throttler we wrote that can defer a job if too many resources are in use by the customer, but it’s not ideal.
I’d love a priority based, customer throttled (eg max concurrent tasks) queue.
We can prioritize by low/medium/high using separate queues, and could make a set of queues per customer; but that is starting to explode how many queues we have and feels unmanageable.