1. Business level constraints (time, human, fiscal and other resources, stakeholders) trump technical constraints every time. Identifying these should be step zero in any design process.
2. A business-level risk model assists with appropriate design with respect to both security and availability and should ultimately drive component selection.
3. Content seems very much focused on public IP services provided through multiple networked subsystems. While this is a very popular category of modern systems design, not all systems fall in to this category (eg. embedded), and even if they do many complex systems are internal, and public-facing interfaces are partly shielded/outsourced (Cloudflare, AWS, etc.).
4. Existing depth in areas such as database replication could perhaps be grouped in a generic fashion as examples of fault tolerance and failure / issue-mitigation strategies.
5. Asynchronicity and communication could be grouped together under architectural paradigms (eg. state, consistency and recovery models), since they tend to define at least subsystem-local architectural paradigms. (Ever tried restoring a huge RDBMS backup or performing a backup between major RDBMS versions where downtime is a concern? What about debugging interactions between numerous message queues, or disparate views of a shared database (eg. blockchain, split-capable orchestration systems) with supposed eventual consistency?)
6. Legal and regulatory considerations are often very powerful architectural concerns. In a multinational system with components owned by disparate legal entities in different jurisdictions, potential regulatory ingress (eg. halt/seize/shut down national operations) can become a significant consideration.
7. The new/greenfield systems design perspective is a valid and common one. However, equally commonly, established organizations' subsystems are (re-)designed/upgraded, and in this case system interfaces may be internal or otherwise highly distinct from public service design. Often these sorts of projects are harder because of downtime concerns, migration complexity and organizational/technical inertia.
I very much want to hear the words "failure isolation" during a systems design interview. Usually as the answer to "Why did you break that functionality out into a separate service?". The answer should involve "independent scaling" and "failure isolation".
Honestly, a big reason for separate services is cultural. It tends to be easier to assign accountability and ownership when you are operating a service (monitoring, independent deployment, KPIs fall out more naturally). With a monolith you have to work really hard to prevent a diffusion of responsibility.
Yes, to some extent this means you are "shipping your org chart", but that's mostly inevitable at scale anyway.
I tried to explain this once in an interview and the interviewer was not happy. They didn't like the idea of deploying a dozen different services. They also didn't like my ideas around keeping the services relatively simple and scaling out as needed. I can say I'm glad I didn't take that job.
Having the wrong service boundaries can obscure the data flow in ways that can be hard to fix. You end up with high fanout or ugly caching layers where a simpler structure might have been possible before process boundaries were introduced.
Best to give it some time, figure out where the important interactions are and separate things that don't interact heavily.
>Your interviewer might have just failed in explaining the problem constraints properly.
The simpler and more common explanation is that the interviewer was looking to have his biases reflected by the interviewee. In my experience it is difficult for a certain kind of mind to distinguish between "This person is stupid/crazy/wrong for the job" and "This person knows something I don't".
I really don't buy this, its mostly a cultural fad. You can design your system as multiple services interacting via a strict API even in a monolith. Failure isolation is a pattern of how to handle errors, whether they come across as a 40x HTTP status code, a JSON field called "error" or a method call throwing an exception. And a single system enables you to work with simple method calls calling across services instead of all the complication SOA can bring with signaling errors/serialization over a network layer. Not to mention all the operational overhead that comes with managing more hosts and more services.
You make the assumption that failures will be covered by defensive programming techniques. Unexpected failures will take down a monolith. One bad db query can deadlock all consumers. The greater the separation between components, the more easily you can isolate all failures, not just expected ones.
I've been calling it "fault domains" but pretty much the same thing. Also, "designing for failure modes". Things working at scale is to be expected. How to gracefully fail is what is hard.
If you're looking for "key" (dare I say "buzz") words or phrases, you're only going to get people who are gaming the system -- unless you're exceptionally lucky. That's assuming you're a large-enough employer to actually do this type of thing.
If you're not a large employer, I can elaborate, but I'll have to tread a bit carefully, hence my reticence to do so.
I of course would follow up with "Why?" and "How?". And I'm really not looking for those exact words, but the fact that nothing even close was on this post made me want to succinctly bring up the topic.
Say you have a single section of your code that's very compute intensive. If it's part of a larger monolith then you would spin up multiple instances of the entire monolith to keep up with traffic. To take a concrete example, say you have an app that does login, some social stuff, some billing, pulling from an external API and finally your special sauce - rendering a PDF of what your user has created. The last step is a bottle neck but none of the others are. You could increase the number of instances of renderers and your entire system would chug along.
You've asked the right question though. Is independent scaling "necessarily" better? No, it's a trade-off.
* To get this feature you've sacrificed the simplicity of having a single codebase leading to a single binary that can be deployed with ease.
* You've added extra components within your system like load balancers.
* What would earlier have been a function call is now RPC with network and serialisation overhead.
* Your system might be a little more difficult to debug unless you collect logs in one place and it's possible to follow a single user across multiple components.
* It's possible for your system to fail in ways that weren't possible before. For instance there might be a network issue between two specific nodes and it's hard to figure that out unless you have proper monitoring in place.
Of course there are advantages other than failure isolation and independent scaling but I haven't gone into those.
Note that your example seems to only matter if I use differently spec'd instances for different services. If they all have the same ratio of e.g. memory to cpu to disk, I'm not sure what slicing my instances across services really gets me. If you slice it across services, you add a few instances, and your system chugs along. If you don't, you still add a few instances, and it still chugs along.
I could maybe see if you have some services that take huge amounts of memory and others that take huge amounts of CPU. If you have a standard "monolithic app instance", and had to scale up your CPU by 10x but memory only by 1x because of your pdf renderer, you will likely be wasting large amounts of memory. But unless you have huge disparities in memory vs cpu use (and services don't vary in use together) I don't really know what sort of cost savings you can get here - wouldn't this be the actual value add of being able to independently scale - less cost because you can more accurately hit your resource (cpu vs memory vs disk and so on) requirements for your project?
In contrast, from my experience, separating your web/app/database/cache layer from eachother tends to be extremely beneficial for independent scaling because they almost always vary widely in how they consume which resources (memory vs cpu vs disk and so on). They also tend to be written with this in mind and so it is basically free to do so.
An aside, but many of your downsides apply not to just services, but also having to scale an application horizontally. If the name of the game is scaling, then many of these you will pay for regardless of the question of monolith vs. services.
All that above aside, I definitely feel the other benefits. But I really don't get it - every article I read about services seems to mention independent scaling, when it seems like a fairly suspect benefit. Maybe I just haven't worked on the correct project.
Coming from Square (which is mostly SOA, but with an old monolithic service), we had quite a few services which really needed to be separated for performance reasons:
- One needed a large in-process cache in order to deliver good performance; it would have consumed to much memory on each instance of the monolith.
- Some services used large ML models, which also would have consumed too much of memory on monolith instances.
- A lot of our payment-related services had hourly or daily batch jobs. Anything with big resource spikes probably shouldn't share a machine with latency-sensitive code (like online payment processing or just web handlers).
- Related to the above, some jobs had to be done by a master instance. If the monolith did them, they would have disproportionately affected a single instance of the monolith.
CPU vs memory is one area, but also being blocked on IO and network saturation. An endpoint that is essentially a glorified proxy is going to scale differently than one that does real CPU work. Doing all of that in a synchronous platform us going to require massive memory for the threaded IO, or it's going to be really expensive on an event loop. So build two separate systems, put them on specialized hardware (make sure you have 10G network on the proxy).
Each thing takes some resources. Application performance tends to get worse over time, usually because the focus is on adding features, rather than improving efficiency (sweeping generalization). So eventually you'll find yourself in a spot where your monolithic app doesn't have much headroom on existing machines. You switch to a new auth system that's 20% more expensive, and the whole system is running out of memory.
This is a contrived example. I mostly agree with you. I think you do hit a point at the application level where the big monolithic app is a little bit to big, and that's a tough spot to be in.
If you can keep things small, light and efficient mono will work forever. But always keep in mind that it can outgrow your instance sizes. So prepare to either move up to the next size instance (tough if you're running your own hardware), or start thinking about splitting out parts of the app.
Resource utilization is generally a function of traffic. Why is adding more machines to a specific service's load balancer pool better than adding more machines to the monolith's load balancer pool? You're saving maybe a little bit of memory by not loading unnecessary code, but that seems inconsequential given OS level caching.
One possible argument is that you can reserve capacity by service, so a computationally expensive but unimportant endpoint doesn't accidentally gobble up the resources you added to alleviate the starvation that was slowing down a different endpoint.
But you could also do that with smarter load balancing - deploy a monolith to all hosts, but partition traffic between separate pools depending on endpoint at the load balancer level.
I don't think performance isolation is a good argument for microservices. I don't think failure isolation is either - the interactions are likely to create more, trickier failures than the isolation will prevent.
The real argument is about scaling the organization. Much easier to work on and frequently deploy small codebases with small numbers of commits and commiters per day, communicating across team boundaries via Thrift IDL files, than to have thousands of engineers on one codebase and thousands of potentially breaking changes introduced between every deploy.
Scaling your app isn't just about turning knobs. It's about making design tradeoffs based on your usage pattern. Most apps are very framework dependent and mixing and matching frameworks within a monolith is a nightmare. Having more flexibility around platform allows you to scale independently.
This seems like a great example of how seductive it can be to over-fragment a system, and suddenly have hard-to-debug microservices for all those little pieces, when really all that's needed in a case like this is a front-end app that does login, sessions, billing, everything except the rendering work which goes off to a queue-based worker service.
Surely, you understand I'm not literally only looking for buzzwords. I would certainly ask deeper questions on what they are trying to accomplish and why.
In my experience as a small-business ops guy, other devs just aren't interested. I try to get them to learn a bit and avoid the personnel SPOF, but they have their own bailiwicks to lounge in.
It's in my own interest to spread the knowledge around, because I like holidays-that-are-holidays.
Does "system" here mean "system of internet services"? I'm designing large systems and hope to learn more - but none of my systems have servers. Anywhere.
Haven't read the entire guide yet. But, I hope it has a few lines somewhere about over-engineering a solution. Yes, fault-tolerance, Asynchronousism, individual scalability are virtues you want, but not for a super simple problem that needs functional work. I've been in so many discussions with people that talk about all these virtues and speend too little a time on making that core function do what it is supposed to do.
Brilliant work. I may convert this into MkDocs "formatted" project using the Materials theme. I've done the same thing for the Open Guide to AWS which I'm still working on. It vastly improves the readability and accessible of the information.
I am only learning Erlang now (through the futurelearn course[1], posted here, about a month ago now). One of the reasons it interests me is that it was designed from the start to support high-availability concurrent, distributed processes. It's a functional, dynamic language meaning that you can reprogram a system at runtime if you want (see: gen_server).
Like I said, I'm only starting and I don't know how real Erlang systems are built. However, I suspect that they tend to eschew the orthodoxy of treating a relational database as a single source of truth, with stateless app servers (these two features are the core of all the systems in the OP's thing) and embrace distributed, redundant statefulness. If this can be done (without becoming impossible to reason about) I suspect it represents an optimal server system, in terms of resource usage, availability, and probably even in performance in a world where most organizations' datasets can fit entirely in RAM, from when they are a twinkle in someone's eye to their eventual dissolution.
I should stress that "questioning orthodoxy" is something of a hobby, which probably biases me.
This looks like a great guide, thanks! Makes me wonder how effective things like Google's app engine are in autoscaling your web apps. "Serverless" code seems too good to be true.
A missing area is identity management. Most likely this should be separated from your system (e.g. don't have a table somewhere with username, password in it).
In consumer facing systems, OpenID Connect (better) as practiced by Google, OAuth is used by most others.
In enterprise software, SAML is the common parlance.
That leads naturally to questions about API authorization (are API calls made on behalf of system users? If not, start probing further).
Always enlightening to start asking questions about identity management very early on in designing systems.
1. Business level constraints (time, human, fiscal and other resources, stakeholders) trump technical constraints every time. Identifying these should be step zero in any design process.
2. A business-level risk model assists with appropriate design with respect to both security and availability and should ultimately drive component selection.
3. Content seems very much focused on public IP services provided through multiple networked subsystems. While this is a very popular category of modern systems design, not all systems fall in to this category (eg. embedded), and even if they do many complex systems are internal, and public-facing interfaces are partly shielded/outsourced (Cloudflare, AWS, etc.).
4. Existing depth in areas such as database replication could perhaps be grouped in a generic fashion as examples of fault tolerance and failure / issue-mitigation strategies.
5. Asynchronicity and communication could be grouped together under architectural paradigms (eg. state, consistency and recovery models), since they tend to define at least subsystem-local architectural paradigms. (Ever tried restoring a huge RDBMS backup or performing a backup between major RDBMS versions where downtime is a concern? What about debugging interactions between numerous message queues, or disparate views of a shared database (eg. blockchain, split-capable orchestration systems) with supposed eventual consistency?)
6. Legal and regulatory considerations are often very powerful architectural concerns. In a multinational system with components owned by disparate legal entities in different jurisdictions, potential regulatory ingress (eg. halt/seize/shut down national operations) can become a significant consideration.
7. The new/greenfield systems design perspective is a valid and common one. However, equally commonly, established organizations' subsystems are (re-)designed/upgraded, and in this case system interfaces may be internal or otherwise highly distinct from public service design. Often these sorts of projects are harder because of downtime concerns, migration complexity and organizational/technical inertia.