I like the Circuit Breaker as a stability pattern.
It is also explained very good in "Release It!" by Michael T. Nygard ( http://pragprog.com/book/mnee/release-it )
This seems like a problem that only manifests in environments with heavyweight concurrency primitives: threads, processes, etc.
If you're going to hit this kind of problem often, I'd suggest using a language with better concurrency primitives, either actors (Erlang/Akka in Scala) or a reactor (node.js).
Granted, it's often not up to a late-stage maintainer which language was used, but the state of the art in language design is giving us much more power to deal with the kind of problems this "design pattern" solves.
I don't understand - how does using actors/reactor allow you to skip using circuit breakers?
You might implement it differently, for example the circuit breaker might just be a wrapper around your original function `Request => Future[Response]`. But I don't see how using actors saves you from thinking about the consequences of throwing a normal workload at a failing service.
It doesn't. Most of these design patterns apply to Java where these patterns are not necessarily natural.
If you're using an actor-model language then you probably implemented that pattern without thinking about it because it's just part of handling each state of an actor. Especially in erlang where you have a timeout keyworkd and where the OTP framework exposes more complex error recovery patterns.
I've definitely implemented patterns like this with akka futures. Fundamentally concurrency and figuring out when to send test-requests at the resource was the hard part, not java or scala.
In fact, unnatural as they are, I often had to use Java-style patterns to break through Akka patterns. Like it or not, the "Java-style patterns" are a lot closer to the metal than futures or actors are.
Using a reactor or actors doesn't completely mitigate the problem, but it does make the symptoms considerably less harmful.
Some hypothetical examples:
Java/Ruby: 8 threads, mean service time: 50ms (20 requests/s). A system like this has upper-bound throughput of 160 requests/s. If one request is allowed to timeout, say 5s, this reduces the effective throughput of the system to 7 threads as one is "locked" servicing the timed-out request. It doesn't take many requests like this to significantly degrade performance, and the only remedy (absent circuit breakers) is to throw a ton more hardware/workers at the problem = $$$.
Consider the problem in, say, node.js. If a single request times out, that request will cause problems, but it won't have _nearly_ the same capacity-starving effects as the thread-bound example above; the request will simply timeout, but other requests won't be starved because the number of in-flight concurrent requests isn't limited by the thread/process count of the system.
I'm realizing as I write this that I'm straying a little off-topic here, but the point is, using process/thread-based concurrency in a high-performance system where failure is likely is a bad idea. It's just too easy to get the kind of failures Fowler describes: "What's worse if you have many callers on a unresponsive supplier, then you can run out of critical resources leading to cascading failures across multiple systems."
Your typical synchronous servlet container (Tomcat, Jetty, etc.) all maintain a thread pool and dedicate a single thread to each request for that request's lifetime. These threadpools can easily hold hundreds of threads on every-day hardware (vs. something like forking unicorn where 8 Ruby processes consumes quite a bit of memory).
This works well for many workloads. It allows a straightforward blocking-IO model, but you don't typically worry about a few slow requests bogging everything down (which you do if you only have 4-8 unicorn processes). I'd say in many apps, the database becomes a bottleneck before the pool runs out of threads.
For Tomcat-8 "The default HTTP and AJP connector implementation has switched from the Java blocking IO implementation (BIO) to the Java non-blocking IO implementation (NIO)."
Keeping average response time low is only part of the point of a circuit breaker. Another goal might be to reduce load on the back end service to give it a chance to recover.
How do actors or a reactor handle the specific issue of "remote calls to software running in different processes, probably on different machines across a network"? The examples you gave assume that everything can be done in the same process, which isn't always possible. (Eg, a network-based service which uses dedicated GPU hardware for performance reasons, which is too expensive to duplicate for each machine.)
The recovery method will be specific to the remote resource. Eg, if it's often congestion related then a retry with an exponential delay might work. A circuit breaker is just one of many possible fault detection and recovery modes. What language has built-in support for all of them?
Also, I've used this idiom in a different context. Some of my code parses a file containing multiple records, one per line. Some of the records are unparseable, for rare but valid reasons I won't get into. The proper action in that case is to report the problem, skip the record, and go on to the next.
But sometimes people use the wrong format entirely. If (say) 95 of the first 100 records are invalid, then it's very likely that the wrong file was passed as input. In that case, the "circuit breaker" should trip, rather than incorrectly trying to parse the rest of the file, which might have occasional "valid" records. My implementation has a low-level reader, plus an intermediate circuit breaker to add the specific failure policy.
Since your suggested solution doesn't seem to match the problem, and because the referenced idiom applies to more than concurrency, I respectfully disagree with your assessment.
Isn't this independent of the concurrency primitive? It is a fundamental characteristic of making a distributed call to a resource that the application in question needs to act on the response or the failure of in a timely manner (i.e. not fire-and-forget).
Even if you are using actors or the reactor pattern at some point they need the equivalent of a circuit breaker in them. It just pushes it to a lower layer.
Unless you're just advocating avoiding distributed calls which require a timely response whenever you can, which I agree with as much as is feasible, but obviously can't always be avoided.
In the book, iirc the example for Circuit Breaker was a timeout of Oracle db connections in a connection pool, due to a firewall in the middle. You'd have to read the book to get the story done well, but basically it was a resource contention issue because a socket was cut off in the middle without closing at the connection end. Sockets, by nature, don't time out. That's why Nygard says in the book that firewalls are a way of violating TCP/IP standards.
Connection problems are not limited to concurrency-heavy environments. Your USB device might have a flaky cable or some issue is happening over the network. No matter the language you'll have to deal with that.
I agree that at least in Erlang/OTP timeouts are first class citizens (not sure about Akka) and so that kind of pattern is implemented more naturally.
As a lover of Erlang, I don't see why this pattern wouldn't be useful in Erlang. (Though, to some degree it's already there, as (IIRC) a node unresponsive to heartbeats will eventually be brought down, and messages will fail to send immediately.) It's perfectly possible for gen_servers to get wedged and exhibit exactly this behavior…
Nice to see "Release It!" get a shout-out from Martin Fowler. That book is one of the most valuable and terrifying reads I've ever encountered for those of us working on big, complex systems.
If I understand correctly, the goal is to avoid too much calls from being blocked for $timeout seconds at the same time, by failing fast when too much calls have _already_ failed.
Am I correct that blocked calls can still accumulate before $threshold calls timeout ? e.g. if 1000 calls are initiated before 5 of them timeout, there are still 1000 blocked calls.
It seems that this problem can also be resolved by limiting the number of in-flight calls: If too many calls are already waiting for a response, reject further calls.