Hacker News new | past | comments | ask | show | jobs | submit login
How to sleep at night having a cloud service: common architecture do's (danielsada.tech)
382 points by dshacker on Nov 12, 2019 | hide | past | favorite | 69 comments



One thing missing here is to avoid synchronous communication. Sync comms tie client state to server state; if the server fails, the client will be responsible for handling it.

If you use queue-based services your clients can 'fire and forget', and then your error handling logic can be encapsulated by the queue/ consumers.

This means that if you deploy broken code rather than a cascading failure across all of your systems you just have a queue backup. Queue backups are also really easy to monitor, and make a great smoke-signal alert.

The other way to go, for sync comms, would be circuit breakers.

My current project uses queue-based communications exclusively and it's great. I have retry-queues, which use over-provisioned compute, and a dead-letter for manually investigating messages that caused persistent failures.

Isolation of state is probably the #1 suggestion I have for building scalable, resilient, self-healing services.

100% agree with and would echo the content in the article, otherwise.

edit: Also, idempotency. It's worth taking the time to write idempotent services.


Queues introduce entire other dimensions of complexity. Now you've got to monitor your queue size (and ideally autoscale when the queue backlog grows), and have a dead letter queue for messages that failed processing and monitor that. Tracing requests is harder b/c now your logs are scattered around the worker fleet, so debugging becomes harder. You need more APIs for the client to poll the async state, and you need some data store to track the async state (and now you've got to worry about maintaining and monitoring that data store). It's a can of worms that should be avoided when possible.

The only way to know whether or not to accept this kind of complexity is to think about your use cases. Quite often it's fine (and desirable) to fail fast and make the client retry.


A few other things to consider when working w/ Queue/Message Bus systems.

Back pressure complexity. Sharing a queue service, leading to a potential central point of failure/capacity issue. Message schema management.


> Queues introduce entire other dimensions of complexity. Now you've got to monitor your queue size (and ideally autoscale when the queue backlog grows), and have a dead letter queue for messages that failed processing and monitor that.

Wouldn't you need similar mechanisms without a queue? It seems to me queues give more visibility and more hooks for autoscaling without adding additional instrumentation to the app itself.


The queue sits behind a service. If you don't do the work in the service, and do it in a queue instead, you've got more infrastructure to manage, monitor, and autoscale.


> Now you've got to monitor your queue size (and ideally autoscale when the queue backlog grows), and have a dead letter queue for messages that failed processing and monitor that.

These are both trivial things to do though. I don't see how it's any more complex than monitoring a circuit breaker, or setting up CI/CD.

> Tracing requests is harder b/c now your logs are scattered around the worker fleet, so debugging becomes harder.

Correlation IDs work just as well in a queue based system as a sync system.

> You need more APIs for the client to poll the async state, and you need some data store to track the async state (and now you've got to worry about maintaining and monitoring that data store).

Not sure what you mean. Again, in the Amazon Cart example, your state is just the cart - regardless of sync or async. You don't add any new state management at all.


It's certainly fairly simple to use queues for straightforward, independent actions, such as sending off an email when someone says they forgot their password. It's less obvious to me how your proposal lines up with things that are less so. Such as a user placing an order.

So I'm having trouble envisioning how your system actually works. At least in the stuff I work on, realistically, very few things are "fire and forget". Most things are initiated by a user and they expect to see something as a result of their actions, regardless of how the back end is implemented.


> It's less obvious to me how your proposal lines up with things that are less so.

Usually you have a sync wrapper around async work, maybe poll based.

As an example, I believe that Amazon's "place in cart" is completely async with queues in the background. But, of course, you may want to synchronously wait on the client side for that event to propagate around.

You get all of the benefits in your backend services - retry logic is encapsulated, failures won't cascade, scaling is trivial, etc. The client is tied to service state, but so be it.

You'll want to ensure idempotency, certainly. Actually, yeah, that belongs in the article too. Idempotent services are so much easier to reason about.

So, assuming an idempotent API, the client would "send", then poll "check", and call "send" again upon a timeout. Or, more likely, a simple backend service handles that for you, providing a sync API.

Going from Sync to Async usually means splitting up your states explicitly.

For example: Given two communication types: Sync (<->) Async (->)

We might have a sync diagram like this:

A <-> B

A calls B, and B 'calls back' into A (via a response).

The async diagram would look like one of these two diagrams:

A -> B -> A

or:

A -> B -> C

Whatever code in A happens after what the sync call would have been gets split out into its own handler.

If your system is extremely simple this may not be worth it. But you could say that about anything in the article, really.


> If your system is extremely simple this may not be worth it. But you could say that about anything in the article, really.

I don't know. The level of complexity this introduces seems to be way higher than anything in the original article.

E.g. for placing something in cart, its not only the next page that is reliant upon it, but anything that deals with the cart - things like checkout, removing from cart, updating quantities, etc. Adding to cart has to be mindful of queued checkout attempts. And vice versa. It sounds way messier than the comparatively isolated concepts such as CI, DI, and zero downtime deploys.

Async communication certainly seems desirable across subsystems that are only loosely connected. E.g. shopping, admin, warehouse, accounting, and reporting subsystems. But by using asynchronous comms you're actually introducing more state into your application than synchronous comms. State you should be testing - both in unit & integration tests (somewhat easy) and full end-to-end tests (much more expensive).

I'm sure Amazon has all sorts of complexities that are required at their scale. But you can heavily benefit from the techniques in the OP even if you aren't Amazon scale.


> The level of complexity this introduces seems to be way higher than anything in the original article.

I don't find it very complex at all. You send a message to a service. You want to get some state after, you query for it.

> but anything that deals with the cart - things like checkout, removing from cart, updating quantities, etc. Adding to cart has to be mindful of queued checkout attempts.

How so? Those pages just query to get the cart's state. You'd do this even in a sync system. The only difference is that on the backend this might be implemented via a poll. On subsequent pages you'd only poll the one time, since the 'add-to-cart' call was synchronous.

> But by using asynchronous comms you're actually introducing more state into your application than synchronous comms.

I don't see how. Again, with the cart example, there is always the same state - the 'cart'. You mutate the cart, and then you query for its state. If you have an expectation of its state, due to that mutation, you just poll it. You can trivially abstract that into a sync comm at your edge.

    def make_sync(mutation, query):
        mutation()
    
        while not expected_state(query()):
            # handle retry logic/ timeouts


Your solution seems to assume only one thing will be accessing what is being mutated at once. If another thread comes in and gets a cart (e.g. maybe the user reloads the page) and they aren't waiting on the operation to be processed anymore. If you remove it from the queue after a few seconds of failure then fine. But if the point is "self healing" it presumably hangs around for a while.

You have to deal with this to some extent in any webapp that has more than 1 OS thread or process. But if you're keeping actions around for minutes or hours instead of second you're going to have to account for a lot of weird stuff you normally wouldn't.

If you really wanted something like this, I would think you would want a concept of "stale" data and up-to-date data. If a process is OK with stale data, the service can just return whatever it sees. But if a process isn't OK with it (like, say, checkout), you probably need to wait on the queue to finish processing.

And since the front end may care about these states, you probably need to expose this concept to clients. It seems like a client should be able to know if it's serving stale data so you can warn the user.

Maybe I'm mistaken.


Yeah. And this kind of system often leads to large delays without clear causes, so people re-tap many times and get into weird states.

On the extreme end of doing this well, you have stuff like Redux, which effectively does this locally plus a cache so you don't notice it. Redux has some super nice attributes, there are some definite advantages, but it is many times more complicated than a sync call.


> I don't know. The level of complexity this introduces seems to be way higher than anything in the original article.

HTTP isn't synchronous, we just often pretend it is. You can pretend messages are synchronous using exactly the same semantics and get exactly the same terrible failure modes when requests or responses are lost or delayed.


The article says it is open source and that it will accept pull requests


I like guides like this that can help beginners bridge the gap between hobby and professional quality development.

I’ll add one more tip, the one I think has saved me more sleep and prevented more headache than any other as I’ve developed a SaaS app over the last 5 years.

It’s simple: Handle failure cases in your code, and write software that has some ability to heal itself.

Here are a few things I’ve developed that have saved my butt over the years:

1) An application that is deployed alongside the primary application, tails error logs and replays failed requests. (Idempotent requests make this possible)

2) many built-in health checks like checking back pressure on queues and auto-throttling event emitters when queues get backed up

3) Local event buffering to deal with latency spikes in things like SQS.

I hope to eventually write more about these systems on our blog but I never seem to find the time


4) Sometimes it's better to fail fast and try again than spend time writing extensive error handling and retry logic.


But unless you inform the audience of those times, they will continue being ignorant of when to use your advice, as if you'd never given this advice at all.


As a beginner, until I heard that advice a little while ago it hadn't occurred to me. I disagree advice that doesn't clarify everything has no value. It transforms an unknown unknown into a known unknown that you can fiddle around with or Google further to learn about.


> write software that has some ability to heal itself.

IMO, this is the biggest change that helps me sleep at night and starts with treating all servers as cattle. For me it means every server can be rebuilt and deployed at the press of a button which leads to having failed health checks automatically redeploy servers.


That's good advice but it's hardly simple.


You know the one thing that has helped me out the most, an error reporting service AND then addressing _every_ error.

That is to say, my service should emit zero 500 errors.

Then my reporting is easy to interpret and consistently meaningful. I don't have to worry about bullshit noise "oh that's just X it does that sometimes."

Sleeping at night is a lot easier when you have less keeping you awake.


This. I have a really hard time measuring it, but ever since we really worked on error reporting our week-end sleep factor has greatly improved.

For a complex system though, don't under estimate how hard this is to do though ... - Every cloud service needs to be routed to a common service - All of your software, every language, even that cool Go experiment - All of the third party software - logs all have to agree on a format, JSON is not always an option.

Finally ... justification of time spent fixing things with no observable side effect(s). Most cloud stuff is reliable against first orders of failure and so are tolerant to a lot of stuff, it's designed that way. But once the wheels come off, and they will come off, ... buckle up if you haven't been fixing those errors. If you aren't clean on second order failures, you're in for a rough ride.


We use AWS, and one benefit of their hosted ElasticSearch is that they can build you a lambda that syncs Cloudwatch logs to ES, handling a variety of different formats. So we have our beanstalk web requests + some lambda infra + our main web backend etc. all synced to ES with very little effort.

You do have the downside that they don’t have eg nicely synced structure, but that also has the upside that the structure is closer to what the dev is used to so nobody ever needs to go back to CloudWatch or any other logs to get more details or a less processed message. The other downside is you have to write a different monitor for each index, though this has the upside that you can also have different triggers per index. In our small team we just message different slack channels which makes for a nice lightweight opt in/out for each error type.

It’d definitely be tricky to get everything aligned in eg the same JSON format, but this sort of middle ground isn’t too hard and still has benefits - you just need to be already syncing in any format to CloudWatch - which if you’re in AWS you probably are.


Totally agree. In my experience "that's just X it does that sometimes" have been symptoms of some of the scariest bugs in the system we've been working on. A couple of examples:

- a caching issue that was a "just X" on a single server, but took product search (and by extension most of the business) offline if two servers happened to encounter the same problem at the same time.

- a "just X" on user logins, which turned out to be a non-thread-safe piece of code that resulted in complete outage of all authn/authz-related actions once demand hit a critical point.

On top of that, having a culture where there are errors it's okay to not fix is tremendously damaging to team values. I've not seen a team with this attitude where the number of "just X" errors wasn't steadily increasing, with many of the newer ones being quite obvious and customer-affecting problems.


I like to keep a slack channel (and saved kibana search) for 500s for this exact reason. System-wide we should have no 500s, and when they happen I like to tackle them immediately. I also have daily reports for other various errors, like caught exceptions, invalid auths, etc just so I can see where things aren't going quite right in case it's indicative of something weird going on.


Just about everything mentioned here is well-handled by Google App Engine. I still think it’s the way to go for most projects, but I don’t think they’ve marketed themselves well lately. I’m sure there are other good providers too; I don’t see the downside to using PAAS.


GAE is incredible and poorly marketed. Its the only serverless product I know of that allows me to use whatever server framework I want (flask, rails, spring) but be blissfully ignorant of the underlying VMs. I spent a week looking all the other major alternatives out there, and I don't think GAE has any real competitors. Its just a different kind of serverless...in a really good way.

Having said that, it has some serious shortcomings: baked in monitoring (at least for Python) is much worse than, say, Datadog + Sentry. Additionally, Google doesn't have any great relational serverless databases (which is what I personally want for a regular webapp) -- they do have some solid non-relational databases. Also, no secret store...its very tricky to securely store secrets inside GAE.

To me, the perfect platform for a webapp is GAE + Aurora + some undiscovered secrets store.


Recently they’ve introduced Berglas which has been quite nice in handling secrets. You can store things in env variables as just secret names and it “transforms” them transparently for you into real secrets at runtime. And you can keep your env vars safely in version control.

So at least one problems solved.


Are there any particular downsides you have with storing secrets as environment variables? It's working in my app, albeit configuration is done via the web UI [of elastic beanstalk] to keep secrets out of SCM.


Storing secrets in env vars is very common in practice, although it presents a slightly bigger attack surface than using something like Hashicorp's Vault to just pull the secrets into memory.

You can sometimes find debug pages etc for apps and runtimes set up that will show all set environment variables, or have crash monitoring software that will capture env vars and send them elsewhere by default. Those risks can be managed, but having sensitive information not set in the process environment is more 'secure by default'. It also means in the event that someone finds a way to remotely execute code in your process (eval() on an unsantized input, anyone?) it's much harder to dump out secrets.


What's your issue with CloudSQL or Spanner?


What's Aurora?


Aurora Serverless database from AWS.


I came here to post this same thing. You get all of this for "free" from GAE. I built a $100MM company with three engineers on GAE, and we not only slept fine, we'd all go camping together offgrid.


> A 4 9’s means you can only have 6 minutes down a year.

4 9's is 52 minutes of downtime a year. Keep in mind that single region EC2 SLA is only 99.99%. And if you rely on a host of services with an SLA of 99.99, yours is actually worse than 99.99. So if you want to actually get to 99.99, your components have to be better than this, meaning you will have to go multi-region. So achieving this is actually way harder than this simple step.


This is a very salient point. If your service relies on N other services, each with a SLA of 99.99%, the chance of a single request having at least one failure is:

    1 - .9999^N 
Which means if you make 10 requests, you go from 99.99% to 99.9% or from 52 minutes to 8.77 hours of downtime a year.

In most cases you're likely to be making a lot more than 10 service calls.


Depends on if those 9s are in series or in parallel. In series it multiplies to produce lower availability but in parallel they give you higher availability.


> AWS will use commercially reasonable efforts to make the Included Services each available for each AWS region with a Monthly Uptime Percentage of at least 99.99%, in each case during any monthly billing cycle....

So to achieve 99.99% within a region, every component should have at least 3 nodes and to better it deployment should go multi-region which will escalate the costs quickly.

Most application in reality don't even need four 9s so this works b beautifully for everyone. I work in outsourcing industry and in bad old days we had huge penalties and many rounds of explanations even for applications with no redundancy requirements ;).

But it's just Amazon credit nowadays and no one blinks and eye so it's win win the all.


3 nodes of a component in parallel would give you 99.9999% for that component.


Yes but not in AWS land. Committed SLA for availability of entire region is still 4 nines irrespective.


Hmmm. That's good to know.

So in that case you have to replicate across three regions to get 6 nines. So one component needs 9 copies running around the world to have 6 nines for the component.


Preety much. As I said above it works because most internal apps within the Enterprise don't even need 2 nines.


Updated, Thanks!


It's still meaningful to discuss 99.99% on top of things that are around 99.99%.

For example, let's say you have a service on AWS and all your clients are on AWS. If AWS is down, you are down but so are your clients. But your clients want you to be up 99.99% of the time that AWS is up. As long as both sides are aware of the implications, this is fine.

As long as you're within the same order of magnitude, it can make sense. If a customer wanted me to be up 99.99% of the time on top of a service that is only up 99.9% of the time, I would push back.


I'd recommend using an APM product off the shelf to get a lot of the mentioned functionality in the article (Monitoring, Tracing, Anomaly Detection). I would definitely _not_ recommend trying to roll all that yourself, unless you have a ton of time and resources.

There's a few good ones out there, we use Instana and it's working really well.


This is all good advice for the app tier, but in my experience the most painful outages relate to the data store. Understand your read/write volume, have a plan for scaling up/out, implement caching wherever practical, and have backups.


I wanted to lay down some of the common things in the app tier, I think data stores get complex really fast really quickly, it's not easy replicating and sharding quickly unless you've got some experience with it under your belt or you use a tool.


Good article. I would add one thing to this - pick a database that scales horizontally and is distributed. CockroachDB, Elasticsearch, Mongo, Cassandra/Scylla are all good choices. If you lose one node, you don't have to be afraid of your cluster going down, meaning you can do maintenance and reconfiguration without downtime. If your load is low or bursty you can even get away with running these on some small servers such as t3 (probably minimally t3.larges). Running a cloud managed database is also a good option.


Yes, and together with that, I recommend putting all state in the distributed database (or distributed file storage for large blobs). This allows you to gracefully handle crashes, stop and restart servers, etc. because you don’t lose any state in the process.


Having only one mirror is scary. If one goes down, its like murphy's law kicks in. So you want at least 3 things to go wrong in order to take down your system, 2 is not enough. Also have redundancy everywhere if your checker agent stops working for example. You want 2 of everything and at least 3 of those that should never fail.


As a solo founder, I have almost everything mentioned in this article set up, except CI/CD. I can certainly see its value, but being able to easily take down parts of my production system and replace them with instrumented variants is very useful to me when things go wrong. I find that CI usually gets in the way of this. Maybe it's just a bad habit that I need to ditch :)


I think the first step is having CI but not reacting to it. For a while I shared your view, but after just "enabling it in the background" nowadays it's really useful. Even if it's just an "it compiles" check.


Depends on the kind of instrumentation you're talking about. I think metrics should always be collected, and I love having software where reasonable logs are on by default, and unreasonable levels of logging can be enabled/disabled at runtime without toggling.


Another big benefit to CI/CD is being able to work on multiple branches in parallel. That means, finishing one branch, sending that to CI/CD, then working on another, sending that to CI/CD, and so on.


> being able to easily take down parts of my production system and replace them with instrumented variants is very useful to me when things go wrong

Sorry what did you mean by this?


When some service is misbehaving, I have a script to take it down and replace it instantly by an instrumented version with more logs and whatnot.


Service reconfiguration is a thing and, in systems that handle dependency management well, tends to not be too painful. (I'm in the process of writing a TypeScript library for doing exactly this, designed for NestJS but usable outside of it.)


Got it. So it's like you're running the same service, but with a DEBUG flag on?


Haven’t seen anything related to third-parties service that your cloud service relies on. I’m talking mostly about APIs that you can use that might crash at some time. Any recommendations on that part?


This is a great list. I feel a little happy with myself that I knew about most of these.

Except for identifying each request, I had never heard about that. It's so simple yet so brilliant, gotta start doing it.


Good article. Fire drills are worthy of mention. Simulate parts going down, practice recovery.


Nice article, may I ask what tools you used to produce the illustrations?


OneNote! :) I'm a SDE in OneNote


Pretty funny that HN traffic seems to have killed the site.


Great article Sada :). Hope OneNote is treating you well!


Hey thanks Vishaal! Having fun everyday :) We miss you over here.


I applaud the author for sharing their notes. But also, this is why HN (and general upvote-anything-that-looks-interesting forums) sucks. If you are actually defining architecture, you should not be reading these kind of blog posts. I get that they are interesting to the layman, but so is The Anarchist's Cookbook. Don't make whatever you read in The Anarchist's Cookbook.

And I'm crabbing about this because I am easily susceptible to Anarchists Cookbooks. I have had to implement X tech before, and googled for "How do I X", and some blog post came up saying "For X, Use Y". I'm too lazy to read 5 books on the general concept, so I just dive in and immediately download Y and run through the quick-start guide. After spending a while getting it going and getting past the "quick start", I wonder, "Ok, where's the long-start? What's next?" And that doesn't exist. And later, after a lot of digging, it turns out Y actually really sucks. But the blog post didn't go into that. I wasted my time (my own fault) because I read a short blog post.

A lot of people live by Infrastructure as Code, and so they will reach for literally anything which has that phrase in its description. But you don't need it to throw together an MVP, and a lot of the IaC "solutions" out there are annoying pieces of crap. I guarantee you that if you pick any of them up, you are in for months of occasionally painful edge cases where the answer to your problem is "You just weren't using it the right way."

In reality, if you want to be DevOps (yes, I'm using DevOps as an adjective, ugh) you should probably develop your entire development and deployment workflows by hand, and only when you've accomplished all of the basic requirements of a production service by hand (bootstrapping, configuration, provisioning, testing, deployment, security, metrics, logging, alerts, backup/restore, networking, scalability, load testing, continuous integration, immutable infrastructure & deployments, version-controlled configuration, documentation, etc), then you can start automating it all. If you've done all of these things before, automating it all from the start may be a breeze. If you haven't, you may spend a ton of time on automation, only later to learn that the above need to be changed, requiring rework of the automation.


Yeah, in reality I was weary about adding the links, these solutions are often created after a problem arose in the system. But I wanted to provide a few places to see what's the initial paths for a beginner to find out more. I can't count the times when I wasn't as knowledgeable approached a conversation about Chef and Puppet that didn't make sense to me, even after I read what Chef and Puppet did.

It could be, in the other side, that you really don't grasp the need for these technologies until you've had to manually implement one. Things like the UUIDs are basic to hook up to anything.

I can often relate this feeling with nutritional advice. "If you want to be more healthy, eat more advocados", if you only eat advocados, but don't understand what's behind it, or the premise behind it, you'll probably get fat. But if someone tells you "Advocados are a good way to supplement your fats without blah blah blah" and you understand that there isn't a unique solution, then you'll probably be healthier.


Blame the economics of the internet. Quick starts are all you need for eyeballs.

I enjoyed the article. If I'm going to sleep easy at night I'm also going to do a lot more study than just read a blog post here and there.

I agree about the "where's the long-start guide?" sentiment though. I think that all the time. You see people say here is how to do event sourcing! Which is nice for their toy sized domain and toy architecture they have concocted. It's usually totally unworkable or woefully incomplete in a production setting.

That being said, finding a multitude of reference architectures, plus a handful of "war stories", plus a in-depth long-form educational resource about the topic is usually enough to get you to something workable and real-world.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: