There is a subtler issue. Even if your average input and output rates are OK, the lumpiness (stochastic variations) in input and processing rates can cause queues to build up.
In the simplest example, with "well behaved" arrival and processing rates, and a single server (an M/M/1 queue), the average queue length is 1/(1-mu) where the utilization mu = avg arrival rate / avg processing rate. So as the arrival rate approaches the processing rate the avg queue length becomes infinite. In reality, you want to keep the average utilization below 80% to keep queue lengths reasonable.
Alongside a link to Google's explanation of why they do it, that's a very reasonable and helpful reply. "Changing the queuing method from the standard first-in, first-out (FIFO) to last-in, first-out (LIFO) [...] can reduce load by removing requests that are unlikely to be worth processing"
> Most services process queues in FIFO (first-in first-out) order. During periods of high queuing, however, the first-in request has often been sitting around for so long that the user may have aborted the action that generated the request. Processing the
first-in request first expends resources on a request that is less likely to benefit a user than a request that has just arrived. Our services process requests using adaptive LIFO. During normal operating conditions, requests are processed in FIFO order, but when a queue is starting to form, the server switches to LIFO mode. Adaptive LIFO and CoDel play nicely together, as shown in figure 2. CoDel sets short timeouts, preventing long queues from building up, and adaptive LIFO places new requests at the front of the queue, maximizing the chance that they will meet the deadline set by CoDel. HHVM3, Facebook’s PHP runtime, includes an implementation of the Adaptive LIFO algorithm.
For instance, a bunch of clients all make a request to a server at the same time, briefly saturating the server. If all the clients have the same timeout without jitter, they will all try again together at the same time once the timeout expires, saturating the server again and again. Jitter helps by « spreading » those clients in time, thus « diluting » the server load. The server can then process these requests without saturating.
The basic idea behind that is also used in all sorts of networks where you have multiple stations sharing the same medium with everyone being able to freely send stuff. To solve this, if a "collision" is detected, stations then use a random timeout before they send again in the hope that the next time there won't be another collision.
If you use a linked list it’s like always adding at the head of the list and always removing from the head.
Let’s say you have two clients for your server one sends a reliable 1 request per second, and the other sends 10,000 requests every hour in a burst.
The lifo will basically result in the 1/sec client always getting their requests answered and when you get a burst from the other client most of their requests will get dropped. Assume your server can handle 100 reqs/second with a 1 second time out.
This is a strange article because it doesn’t even mention the simplest and most common form of backpressure, which is to make requests synchronous. ie. what you see in TCP and standard Unix sockets/pipes.
The article makes it seem like you have three options: (1) Buffer, (2) drop, or (3) “control” the producer (and gives examples of “telling” the producer to slow down.)
But the simplest thing to do is for a producer to not send a second request until the first one is done. If you have an upstream system you have to pipeline requests into, just don’t complete/ack a client request until the upstream system has acknowledged it. So your slow database server that only supports 10 concurrent writes can be limited 10 writes at a time, and your clients will see their request block until the database server serves their request.
The really hard part, and the reason why you’d need ad-hoc (often out-of-band) “signaling” to control the producer, comes when you decide you want to have unbounded concurrency. It’s tempting, because synchronous requests are slow! You can’t start the next request until the previous one is acknowledged! How inefficient! But unless you have a true need to do otherwise, it’s also the simplest and most reliable way of doing things. It’s how things like file copying work straight out of the box on just about every operating system: Read from one place, write to another, but don’t just keep reading forever: Writes block until they’re fully received, then you read the next block. Add some buffers as needed to make things a bit more efficient, but the abstraction is still synchronous.
TCP Windowing is not exactly out of band signaling but is used to adapt sender receiver patterns through negotiation it looks a round trip times and adjusts accordingly which is a form of backpressure when its in a closing the window mode.
I like the article, but I am not sure that I agree with the terminology: I would not call "buffering" a form of back pressure. Imho there is really only one type of back pressure: the one the author calls "control". The other 2 are just ways to "release" pressure.
I agree. The author has interpreted “backpressure” to mean pressure from behind, but my understanding of the term is the other direction, like water pooling up at a partially blocked drain. Water is prevented from entering the pipe because the pipe is at capacity.
Backpressure is an implicit signal from consumer to producer that there is not enough capacity. It propagates backwards from flow like water in a network of pipes.
In their example, the conveyor belt keeps adding chocolates because there is no backpressure. The effect is that Lucy engages in load shedding, and the producer of the chocolates has no say or insight into what happens when that load has to be shed. If there was backpressure, the producer gets to choose what happens.
Yup, this is what WHATWG's Streams spec[0] (linked in the article) says. It defines backpressure as a "process of normalizing flow from the original source according to how fast the chain can process chunks" where the reader "propagates a signal backwards through the pipe chain".
Mozilla's documentation[1] similarly defines backpressure as "the process by which a single stream or a pipe chain regulates the speed of reading/writing".
The article confuses backpressure (the signal used for regulation of the flow) with the reason backpressure is needed (producers and consumers working at different speeds). It should be fairly clear from the metaphor, I would have thought: With a pipe of unbounded size there is no pressure. The pressure builds up in a fixed-size pipe when consumer is slower than producer, which in turn slows down the producer. (Or the pipe explodes, or springs a leak and has to drop data on the ground.)
Edit: I think the article also misses one of the most important ways to release pressure. And that is scaling throughput either horizontally (e.g. by adding more servers) or vertically (e.g. by optimization of software)
The first is not always possible (not all systems are elastic or can be, due to money/resources), and the second is not really a way to handle back-pressure. You could have back-pressure in the most optimized system.
Note that I meant "handling (forward) pressure (or preventing back-pressure) can be done by simply having your system be performant enough. Of course this is not a dynamic property you can adjust at runtime, just wanted to add for the sake of completeness, because it should be part of the mindset. Sometimes a database falling over simply needs a few queries optimized.
> You could have back-pressure in the most optimized system.
Most batch processing systems. But you don't want back pressure in interactive or real-time systems, like graphics, gaming, audio, cars, planes, or even just real time collaboration systems (e.g. Figma). In all these cases back pressure is to be avoided.
WebSockets is an interesting case where the underlying transport (TCP) provides backpressure for free, but the way the API was designed in browsers throws it away. For example, it's trivial to fill your device's memory by opening a large local file in your browser and attempting to stream it to a server in a tight WebSocket send loop.
I'm not sure if there was an alternative when WebSockets was designed. Did we even have promises yet?
This sort of thing is solved nowadays with WhatWG streams. They're a bit verbose to work with but I've been impressed with the design.
In case anyone is interested, I wrote an async/await stream library for JavaScript/Node.js which supports backpressure management. It's heavily tested and used as part of SocketCluster (pub/sub SDK) https://socketcluster.io/ which is also heavily tested.
controlling the producer is such a hard problem, even with exponential back off and backoff times in the response headers, you still get at minimum 2x throughput increase from the producers during a retry storm
problem is that the most common backpressure techniques like exponential back-off and sending a retry-after time in the response header have constraints on maximum backoff time they can do, in some scenarios that is much much less than the normal.
for example, imagine a scenario where a customer explores 10 items on Amazon, and then finally places an order, so 10rps for the product page and 1 rps for the order page. if order services goes down, slowly the customers get stuck on the order page and even with backpressure, your RPS keeps on growing on the order page. exponential backoff doesn't help as well
while dropping requests is a good idea, but that action is not designed by default every time, systems go into metastable state and you need the ability to control the throughput on producer side
you could solve it by keeping a different layer in between like load balancer or some gateway layer that is resilient against such throughput spikes and will let you control throughput on your service and slowly scale up the throughput as per your requirements (by user or by random)
for frontend devices, it gets exponentially harder to control the throughput. having an independent config API that can control the throughput is the best solution that I came across
Isn't the ideal solution to make the throttled system faster? Like autoscale horizontally and/or vertically, sharding or just writing better code? Everything in this article is about to cope with back pressure but solving is frequently possible.
Serverless systems are pretty decent at scaling quickly these days. The problem is rarely lack of servers in my experience, though. You usually run out of database connections or some other bottleneck first.
>Like autoscale horizontally and/or vertically, sharding or just writing better code?
Sometimes you just have a server, not some elastic horizontal setup, or a budget for vertical scaling.
And "writing better code" is an optimization thing, not a general way to handle back-pressure (you can have back-pressure in a fully optimized program too, and even for non-optimal code you don't want to just have to iterate and refactor/improve the code every time there's back-pressure).
In some ways, yes, a solution to your system is throughput constrained is to remove the constraint.
But, there's a cost to that, it may be development time, capital, or operational expense. Certainly, sometimes you spend a few minutes replacing a bad sort with a much better sort and get immediate and large benefits. But often it's the case where rare conditions cause loads that are too expensive to handle immediately.
Having backpressure setup and monitored in advance also comes at a cost, but allows you to make specific choices about how the system works in overload, and allows you to monitor the overload conditions, and hopefully/maybe gives you some feedback about your overall capacity.
For example, many systems have worse throughput when overloaded. Having a strict concurrency limit prevents throughput loss from context switching, and the errors (or simply effects from queueing) can propigate as backpressure. Monitoring active threads gives a sense of available capacity, monitoring time spent with threads at max or depth of queue / number of requests rejected gives a sense of unmet demand.
In practice, yeah, this is the fix in most systems because in a microservice context it would feel gross to have a slow consumer reach out and throttle a fast producer.
It’s still important to understand though because autoscaling has its own back pressure that you run into eventually. Accounts may have quotas or regions might be out of a given instance type, or out of a type at your preference spot price.
It is in general never possible because if your load can scale infinitely, then there won’t ever be enough compute to satisfy it. In practice, you may be able to apply some assumptions to bound the amount of compute necessary. It is still almost always good to have a plan on how to shed load if your assumptions fail. Systems built without this are prone to failing catastrophically.
it doesn't matter. as long as there is a rate mismatch over a sufficiently long timescale, queues will build up and consume memory, latency will increase, and depending on how things are structured overall throughput will go down because of scheduling overhead and memory pressure.
this is actually an inherent problem _within_ horizontally scaled services that have cross dependencies.
"sufficiently long timescale" is long enough to figure out a solution :)
But seriously the question was "ideal solution". Not everyone can do it but just right-sizing compute and removing or offloading bottlenecks is entirely possible. So is just throwing in the towel and discarding excess load. Reddit does that all the time and their IPO was a big success. You can also scale to accommodate massive headroom by just throwing money at it. There are plenty of systems where response times are mandatory and any back pressure mitigation that slows traffic is unacceptable.
The general pattern is a fixed set of resources that are consumed/retired at a fixed maximum rate, where the optimal design consistently gets as close to the maximum rate as possible without exceeding the resource limit. There are (at least) two other places in software where backpressure is used, both commonly found in the database world:
Storage I/O scheduling, which unlike the network case is often not interrupt-driven. As you approach the IOPS rate limit for the system due to high priority tasks, the rate of IOPS scheduled for low priority tasks like background write-back in databases is dynamically reduced to maintain headroom for high priority tasks. This is implemented as backpressure on low-priority tasks.
In query processing, a query as a user would understand it can materialize upwards of millions of sub-operations on the same server that can run concurrently given adequate system resources. The number of sub-operations that can be in-flight and retired per second is approximately fixed and shared across all concurrent user queries. For large queries, you incrementally materialize these sub-operations at a rate based on the instantaneous capacity of the system to handle new sub-operations. This is backpressure based on execution slot (and related memory) availability.
Writing software for barrel processors takes this to the extreme. The entire software design principle is to consistently generate fine-grained concurrent threads of execution at runtime that are close to the hardware concurrency limit globally (which is very high) but never exceeds it. Highly optimized code quickly gets pretty weird but it is basically all backpressure mechanics to maintain throughput.
Here's something I do regularly: I have B bits of data, where B is multiple orders of magnitude larger than the RAM I have available. I can process chunks of B in parallel with a near linear improvement in throughput, but still at an overall lower rate than I can read it from storage.
In other words, I/O is faster than CPU for this task.
A naive design where I read the data as fast as possible on a dedicated thread, and dump it into an unbounded queue that a thread-per-core consumes, will quickly run out of memory.
By putting a limit on queue depth, the queue can communicate back to the storage reader that it can't accept more data. This is backpressure. The reader in turn can decide whether to e.g. wait, slow down, discard etc. as appropriate for the use case.
You won't/can't have it in a direct function call invocation style of programming. E.g. if you have a control loop like "call A, pass result to B, pass result to C" then it's impossible for A to be "too fast."
Network calls are the biggest source of asynchronously queued execution, but you can find models where you have it on a local machine too with multiprocessing. A trivial silly non-network single-machine example might be something like unpacking compressed files than doing [thing] with their contents - maybe you have enough CPU to do them in parallel, but you don't want to blow up your disk by unpacking all of them with no throttling. Even in your lexer/parser example if you wanted to parallelize those steps with a queue in between them, in theory you could have such a huge input that you ran out of memory... in practice, nah, that's not very likely the way you'd do it, or a problem you'd have.
Sometimes "just drop things" or "just make the slow part faster" still aren't really easy/feasible/acceptable even without distributed systems.
I dunno if I'd really call it something like this like the linked article, though "But other forms of backpressure can happen too: for example, if your software has to wait for the user to take some action."
Ehhh I think labeling user input as backpressure, because the software is waiting for _input_, is somewhere between confusing and inaccurate. When I have seen backpressure discussed in my day job, it has always involved a (theoretical or real) slow consumer, and therefore some queue in front of that consumer. I agree that "networking" or not, is irrelevant.
Unix pipes have built in back pressure. A very simplified view, assuming blocking calls and single threads: if the process on the left side of the pipe produces data too fast and fills up the buffer, the operating system won’t give it any additional CPU cycles, until the program on the right side of the pipe caught up with reading and freed up the buffer.
Things become more complicated when multi-threading and / or asynchronous IO gets involved. If the producer on the left side of the pipe has multiple threads, e.g., one writer thread and multiple worker threads, producing data for the writer thread. Then it has to invent its own signaling between the different threads, to avoid throwing away data, when its internal buffers fill up. This is effectively a kind of back pressure.
In your example with the lever and the parser: if the lexer is multi threaded, you either need unlimited buffer for the tokens or some signaling for the threads, to slow down when the parser cannot keep up. This example is a bit artificial, since most languages don’t allow parallel lexers. But with parallel compilers and one (incremental) linker, this becomes more realistic.
In the simplest example, with "well behaved" arrival and processing rates, and a single server (an M/M/1 queue), the average queue length is 1/(1-mu) where the utilization mu = avg arrival rate / avg processing rate. So as the arrival rate approaches the processing rate the avg queue length becomes infinite. In reality, you want to keep the average utilization below 80% to keep queue lengths reasonable.