Hacker News new | past | comments | ask | show | jobs | submit login
Lessons learned writing highly available code (medium.com/imgur-engineering)
182 points by _jomo on Oct 2, 2015 | hide | past | favorite | 59 comments



Many good points there, but I think a lot of people could be misled by the advice about timeouts and retries. In my experience, which is a lot more than the author's two years, having each component retry and eventually fail on its own schedule is a terrible way to produce an available and debuggable system. Simple retries are fine and even necessary, but there should only be one part of the system making the often-complex decisions about when to declare something dead and give up. That way you don't end up with things that are half-dead or half-connected, which is a real pain in the butt.


in my experience, which is also a fair bit more than two years, that is deeply not-great advice for what the author's talking about, which is available systems. Systems with one-true-deciders need to serialize through the decider, which tends to lower availability and limits scalability. And, the decider can die or make wrong decisions, or get partitioned outside of a perfectly functioning set of shards or whatever.

It is, no lie, far harder to make a system that can stay available and does the right thing during the partition. Even though the research has been around for decades, we're still in baby steps in actual functioning systems. Even Riak, the poster child for this sort of thing, was straight up last-write-wins for years while they were making fun of everyone else. Hard problems!


I think you're confused about what "one part of the system" means. I don't mean one process running on one piece of hardware. I mean one software module with many instances - itself a distributed system, with one purpose in life (determine liveness) so it can be simple and verifiable. There's no serialization, and it does not lower availability or scalability as you claim. In fact, it preserves those properties far better than "everyone doing their own failure detection a hundred different ways" does.


oh, not confused at all! I use erlang, like the OP. Right now I have several erlang nodes that are doing a few hundred thousand http connections concurrently. Looking at the logs, it looks like about 10-20 of those concurrent processes are dying every second for various reasons (upstream crappiness, network errors, parse errors, sunspots for all I know right now), and are getting resurrected by their supervisors. Every blue moon, an entire network target will go out and the restart frequency will be too high, and the supervisors themselves will commit suicide and be restarted by their supervisors.

Sometimes a machine will go down and come back up. Looks like the one I'm on has an uptime of about 90 days, which seems low, but doesn't really matter. If too many of them go down at the same time, I'm notified. Haven't ever been notified.

You don't have to do 'failure detection a hundred different ways' (what does that even mean?) if you just crash, and build your systems to go with the flow, independently, like ants.


Well, somebody has to detect the failures, and restart failed processes, etc. so you can just "go with the flow". If you can "just crash" that's great, but it doesn't lead to much helpful advice for people who actually design the underlying systems.


Sorry I wasn't clear. That's exactly the advice. Design systems so that they can crash harmlessly in the small, and availability and scalability come to you quite naturally.

I agree the (distant) second best alternative is to do as you suggest: to try to rely on a paxos implementation that keeps a stable leader and tries to keep track of liveness. That's pretty hard too! Besides all of aphyr's jepsen test series, check this shit out: https://github.com/coreos/fleet/issues/1289 . Tough! Tough even when you think you know what you're doing. And of course no help against freak occurrences where you can't elect a stable leader any more, or you flap leaders, or the election gets stuck.

Currently the only known-good implementation of raft is in Idris. Do you know Idris? I sure as hell don't. Fortunately I know ants. Ants can die in great numbers and still move the leaf. In practice, in production, thousands of terabytes, even on AWS, my ants move and never bother me. Let it crash.


Consensus algorithms and failure-detection algorithms are two different things. For example, a gossip-based failure detector is statistically likely to converge in well bounded time despite partitions, without providing the absolute guarantees of a consensus system. Your criticisms are specific to the category that's irrelevant to this discussion. Please, if you've only built on top of such systems and merely dabbled in their implementation, don't try to bluff your way through.


Could you go into greater detail about how the only known-good implementation of raft is in Idris? Never heard of raft before you mentioned it (i don't deal with distributed systems day-to-day), but I am reading the paper now.

and what's ants?


First, a correction -- the only known-good implementation of Raft is actually in Coq (https://github.com/uwplse/verdi), which predates Idris (which intends to be an easier-to-use Coq).

So the problem with distributed consensus algorithms is that they are hard to understand. It didn't help that Lamport wrote his original paper playfully, using a complex and unfamiliar metaphor. But as a result, many implementations of the relevant algorithms tend to miss complex edge cases. Even famous ones that many large companies rely on have either had meaningful serious bugs or have been misunderstood and misused by downstream applications.

There are a couple of ways to try to fix this. The common way is to try to write a bunch of unit tests. This doesn't work. Unit tests test only those things that your tests manage to cover, and you will probably not think of all of the edge cases.

The next most common way is to use something like QuickCheck, which automatically generates millions of cases and spends as long as you want (days, hours, weeks) hammering your code. This is much better, but still nondeterministic.

The better way is to go fully deterministic, and prove out that your algorithm works, either by exhaustively checking all possible interleavings (code paths) with a model checker, or by mathematical proof.

Historically, the model checkers (e.g. Spin or Groove) have used a pseudolanguage that you describe your algorithm in, which is then exhaustively run to completion. This approach can prove that you are on the right track, but since you cannot run those pseudolanguages in production, they are not the complete solution, because you must then translate the pseudolanguage into source code of your chosen language. This is nontrivial and very frequently there are subtle transcription errors.

An alternative approach is to use a model checker that uses your native language directly; e.g., Erlang has the brilliant Concuerror program, which instruments your actual code. This is great because if it can verify that your code works properly, then you are done; no transcription is needed. Nevertheless I don't believe there are yet any Concuerror-assisted public distributed consensus algorithms, even in Erlang. I would love to be mistaken on this point.

The last approach is to take a proof assistant language like Coq or Idris, and formally prove out the properties of the algorithm using mathematical proof techniques. This is probably optimal, because the exhaustive model checkers can, with complex/slow enough algorithms, run forever or run out of memory trying to test all the cases. However, Coq and Idris are not exactly popular languages and at this time it's not easy to implement line of business applications with them. So although there is a proven, correct implementation of Raft in Coq that guarantees linearizability, good luck accessing it. If you don't use Coq, then you're forced to transcribe it to your chosen language, which, as before, is error prone and does not result in a proven implementation.

It would be possible to mechanically transcribe Coq/Idris code into a more common language while maintaining its provability, but to my knowledge that hasn't happened yet. More likely is that Idris and its successors inherit mainstream language features and start making inroads.

Note also that maybe you don't care about being correct. For example, in the spirit of moving fast and breaking things quickly or whatever, at least one major notable VC-funded project in the news has taken the approach of just increasing timeouts in order to mask bugs in their implementation. And 99.999% of the time that will probably work fine, just as centralizing your database into one king-hell instance and just dealing with downtime every blue moon will probably be fine too. Own your own reliability alpha.


ants (the bug) is an analogy for erlang workers


But isn't this having one part of the system responsible for retries (the tree/hierarchy of supervisor processes)?


> Even Riak, the poster child for this sort of thing, was straight up last-write-wins for years while they were making fun of everyone else.

Disclaimer: I've only used Riak for toy projects, never on a real production app, but...

I think that's being overly simplistic. They've had vector clocks from basically the beginning (though I gather they've been superseded now) and have favored the strategy of offloading conflict resolution to the code that's reading the value. Without implementing that conflict resolution, then it might default to last-write-wins, but that's not the same thing as "straight up last-write-wins" since they give you the tools to implement something more intelligent.


I was referring to the shipped default. You could always change to client-resolution, but then, how's the client supposed to know? I would have looked forward to their increasingly-good-looking CRDT ideas if that company hadn't undergone uncommanded descent into terrain.


Yeah. I run ~1 autonomous retry then alert/log and pass the decision upstream.

I tend to find more than 1 retry with a clear timeout is better. [e.g. I know after Y seconds that X is dead to the world and it can be safely re-ran from the top]


Health checks are orthogonal to retries. It's not one or the other, you should have both (along with sane limits on retries so you don't cause a cascading failure with a persistently-down server).


No, it's not one or another, but they're related. Or should be. I've had to debug systems with dozens of different retry/failure intervals that had to relate to one another in very complex ways for failure detection/response to happen properly. It sucked. Systems where the different layers/modules handle simple retries but give up only according to one set of times and rules are much more robust and pleasant to work with.


If you are interested in highly available systems, I highly recommend the Pragmatic Programmers book "Release It!: Design and Deploy Production-Ready Software". The author includes some amusing but educational horror stories, such as debugging deadlocks in Oracle's JDBC drivers when "can't happen" Java exceptions thrown connection close() were silently eaten.

https://pragprog.com/book/mnee/release-it

Another clever tip I read (not from this book) was to use prime numbers for recurring events and retry wait times. The idea was that prime numbers would help avoid unwanted system stampedes or synchronization.


I second that book. The airline story was particularly shocking and illustrated well the need for its techniques.


The article mentions exponential retry periods for that.


> 6. Prefer battle-tested tools over the “new hotness”.

Well, that won't go over well with the HN crowd, I bet! It's also the most important point in my opinion. Just because a new tool solves an old problem doesn't mean it won't have all new problems of its own. Can probably go even further:

For any popular tool or framework less than 1 year old, it has critical problems that aren't known yet. If it's not a popular tool, it will always have critical problems that aren't known.


I agree completely here. Many of my dev1s want to use the latest greatest stuff, but you will most likely not find those technologies in large stable production systems unless to facilitate some one-off service or piece of architecture that does not need SLAs.

I remember a slew of small tech companies having issues when they decided to build out there app stack with nginx fronting a nosql hotness which inevitably led to overload scenarios because of the lack of constraints/scalability concerns.

Please note : this was years ago, and I know that nginx is a great web server and mongo is a great nosql db, but without a focus on scalability, you can end up with a large headache.


It's why I stuck with client-server architecture and battle-hardened code over web apps wherever possible. Never regretted it. ;)


Exponential back off is a good idea, but to make it even better you'll want to add some random jitter to the delay period. Especially in the case of deadlocks this helps avoid another deadlock when two things would have retried at exactly the same time.


Absolutely. I made some graphs of different kinds of jitter approaches at http://www.awsarchitectureblog.com/2015/03/backoff.html. Simply backing off with delay does very little to spread out spikes of work, and can lead to longer MTTRs if the cause of the issue was overload.


Would it be possible for the server to simply return a queue position to the contending clients? For instance, "There are 37 clients ahead of you. Try again in 37ms." (assuming each write is averaging 1ms)

I suspect if the client and server worked together in this fashion you could get much closer to the linear completion time seen with no backoff and also closer to linear work (since each client would try exactly twice in an ideal world where the server accurately predicted when it should try again).


Sorry. I accidentally down voted your comment when trying to up vote it on my phone! The article was very interesting.


I think most exponential backoff schemes I've looked at treat the 'backoff delay' as a range and do randrange(0, current_backoff) or similar which does what you say.

Truncated exponential (where you cap the upper limit of the retry delay to some maximum) is also often a good idea, to prevent a short service outage from spiking the retry timers to crazy numbers.


Doesn't exponential back-off mean that you select the time until retry uniformly at random from the interval [0, ..., 2^n-1] in the n-th round of failure, or something along those lines?


I'm guessing MySQL doesn't support this (hence the need for a cron job), but postgres lets you set a statement_timeout on the connection. It will force kill queries that go beyond that timeout. I worked on an app not too long enough that occasionally would have some queries go off the rails and start blocking everything. We set up postgres to just kill off anything taking 30s automatically, and then were able to root out the issues without worrying about everything blocking on these broken queries and taking down our systems.


Author here. We thought about using statement_timeout, but we didn't like the lack of flexibility when we do have long-running queries that aren't necessarily deleterious to performance. Instead we opted to use two different users ("cron" for long running jobs, and a normal read/write user pair for normal) that the long query killer script will kill at different timeouts, effectively implementing a per-user timeout rather than a global timeout.

I bet the best approach might be to have the statement_timeout be the largest of all of your per-user timeouts (in case your watchdog script fails, can't connect, etc. for whatever reason).


I've not tried it, but according to PostgreSQL documentation, this should be possible (and recommended) to set on a per role basis?

> The ALTER ROLE command allows both global and per-database settings to be overridden with user-specific values. (...)

> The SET command allows modification of the current value of those parameters that can be set locally to a session; it has no effect on other sessions. The corresponding function is set_config(setting_name, new_value, is_local). (...)

> Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions.


Clever hack.


If that's set on the connection, could a misbehaving/poorly-written client still give you trouble (obviously it would be better to not have any other teams connecting directly to your DBs, but have to you work with what you find in place, sometimes...)? Or is this something specified at the DB side of the connection, not the client?

My practice with long-running query killer scripts is to have them ignore queries that are known to be long by running those queries on a specific user/machine(s) and then hoarding and protecting those credentials.


For MySQL, this is one of many things handled by excellent Percona Toolkit: pt-kill


Those are good advice and they are battle hardened experience.

I would like to add a few points.

1. Design the program as if it has crashed the last time. Always start up to recover from the last crash state. A normal shutdown is just simply the degenerate case. This would make restart and retry of program so simple.

2. Do retry with RANDOM exponential backoff to spread out the load. You don't want all the retries end up at the same intervals.



re 1 - better design a program with no state beyond configurations.


Easier said than done, if you don't have any state you're probably not doing something all that interesting.

The trick is to keep your state in the right place, the answer usually being to punt state management to a database/message queue/etc. - you definitely want to avoid, say, stateful frontends.


At Amazon (AWS) I had started to refer to some of our control plane components as "eventually reliable"; somewhat of a pun with a good dash of truth.


How about (automated) testing? Seems the author missed an important one.


It's necessary, but insufficient. You should have automated testing for each individual component. The problem is that by itself, automated testing won't give you a reliable distributed system. You need to assume that components will fail in ways that won't be caught by your tests, and design your architecture accordingly.

(Among other problems, automated testing won't catch a hardware failure, or someone tripping over the power cord, or a cosmic ray flipping a bit and corrupting memory.)


> automated testing won't catch a hardware failure, or someone tripping over the power cord

While it's never 100%, you can still test quite a lot of those conditions.

We developed a TCP proxy[0] that allows us to test such cases: Datastore X being down / timing out / being excessively slow.

[0] https://github.com/shopify/toxiproxy


Netflix approach this with their "chaos engineering" -- characterise fault behaviour by triggering known faults.


sure but that's a larger topic than just the "best practices" article from OP. large-scale distributed systems are very difficult to test in an automated way and mostly rely on real production "tests" (ie, real traffic) to test comprehensively.


I think thanks to the famous Jepsen series vendors of distributed database systems are slowly acknowledging that this kind of testing (even if it is very difficult to perform) is nevertheless a must. But for application software it is still a luxury.


Start a group of VMs / Docker containers. Write some script to randomly reboot one or a few of them?

Tests are not that hard.

Debug the issues are harder.

    To the original post author's comment about gdb/go:   You can't use gdb to debug this type of problems.


Among experienced teams, most failures aren't caused by single-node/single-service errors. They've already designed & tested for that case, and the ability to handle them is baked into the architecture.

The interesting failures are caused by a cascade of errors - someone writes an innocent bug, which causes a single-node fault, which exercises some pathway in the fault recovery code that has unintended side-effects, which results in an unexpected condition elsewhere in the system.


What you suggest is a start and can certainly be helpful in uncovering some failure modes of the application, but it is by no means complete answer. Debugging can be hard but at least it is tractable - you start with the effect and slowly work your way towards the cause. But anticipating which of the changes in environment (including but not limited to: sudden spikes in traffic, changes in statistical distribution of data, subtle partial hardware failures, peculiarities of the systems software etc.) will result in failures is impossible. What is the worst is that all causes can be relatively benign when taken in isolation but when they are all present they interact in some devilish way resulting in a total breakdown of your service. This facebook engineering blog post comes to mind as an example: https://code.facebook.com/posts/1499322996995183/solving-the.... So even if you test rigorously you still need to monitor everything and have some smart people on call to look into the issue when some obscure metric goes haywire.


That one is important. I give details in this comment:

https://news.ycombinator.com/item?id=10322298


Indeed - I'd argue that's the most important one.


Those are nice. The exponential backoff is the exception: it was the cause of much degradation in TCP. Just left unnecessary holes in service. Amazon ditched it in exchange for jitter for same reason but different problem. If using backoff, I suggest doing non-exponential where you tie it to likely disruption times w/ some error margin.

Another commenter mentioned testing. That's a good idea. Here's two articles on that:

https://queue.acm.org/detail.cfm?id=2800697

http://www.drdobbs.com/testing/testing-complex-systems/24016...


Fixed-interval + jitter can work, exponential + jitter can work too. Be super careful with retries that don't involve jitter, depending on your system architecture you may end up with highly correlated bursts of retry load. You need to be careful of intermediate levels in your system when retrying. And be able to reason about what's going to happen under common failure modes (you won't able to anticipate the rare ones or freak occurrences, but you can defend in depth as best as you can).

If a low-level dependency starts to flake out (bad code/config/data push, unreliable host or network, load spike causing your storage to melt), retrying at intermediate stages can amplify the load and turn a small/medium-sized problem into a cascading failure. You can partially mitigate by retrying 0 or 1 times and then just give up, especially if you control all your callers or can afford that flakiness in your error budget.

Otherwise, depending on your needs you could maintain a retry budget (don't unless you really know you need it): instead of a flat K retries before giving up, enforce a process-wide per-operation cap on your retry rate. When the system is healthy, overall retry rate will be low so you have "room" to issue all K retries if needed. But if your load spikes, or your downstream dependencies start flaking out for some reason (your "fault" or otherwise), you'll avoid a positive feedback loop of death by quickly hitting your retry rate cap and shedding load which only would have made things worse and was destined to fail anyway.

EDIT: A few other remarks:

* OP's general point about putting limits on everything is really important, and closely related to the general wisdom in distributed systems development: everything can and will fail, so make the failure modes obvious and explicit. You get general robustness by avoiding runaway queries, but you also force discussions at design/development time like "what happens when this queue fills up" or "what happens if you're never able to take that lock" (though you should try to avoid needing to take locks ;)

* Instrument the hell out of the paths through your system and add lots of metrics. Logging is great, do it wherever it makes sense and you can, but a lot of the time you'll be able to get more insight (or the same amount of insight with less time spent doing ad-hoc log analytics) with something simple like counters whose value you can track over time and plot on a graph somewhere. Examples: classify downstream dependency latency into buckets, count of times you hit a code path (maybe some users require a more complex DB query than others or you're coding defensively, never expect a case to happen, but want to know if it does), cache hit rate, top sources of traffic in your system. Eventually you'll want to optimize bottlenecks and without data to identify them and prove they're fixed, you're flying blind.


Another rule of mine is to avoid languages/frameworks with large, complicated run-times. They introduce a huge cognitive load and go against the rule of keeping things simple.


Imgur is the only site I regularly see offline or overloaded.


I wouldn't call imgur highly available. More like highly unavailable.


They are rarely down in my experience. Then again most of the work is just serving images from AWS so they shouldn't be down.


Stopped reading after story about cron script killing long running db queries - it means this system is a huge mess, where killing request is more simple solution than finding which part of code generates it. I saw such systems and I'm sure it's hopeless state and only as sarcasm I can read "thankful to architecture principles" here.


Why not both? It's not unheard of for a deploy with index issues to cause a load spoke in production. Yes, you need to know about the query to fix it, but also killing runaway queries agree they're obviously orphaned can mean the difference between an outage and a slight site degradation in response times.


Because if you killing queries, you can forget about data consistency. And when data is in such condition, we should never call it "good enough way".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: