Hacker News new | past | comments | ask | show | jobs | submit login
Amazon’s distributed computing manifesto (1998) (allthingsdistributed.com)
247 points by werner on Nov 16, 2022 | hide | past | favorite | 53 comments



This is one example of the CEO making something happen that essentially birthed AWS.

Bezos, of all people, was like "make it happen." And it did. It was basically work for no reason except future proofing. Having someone up the food chain OK this much work for the future (and no hard dollar benefit) is highly unusual.

And besides that they've done some incredible things with their infrastructure, like authorization. Distributed authorization is really hard, but at AWS it's completely invisible. Remove a permission from an IAM role and it moves through AWS really, really fast. It's totally magic. Anyone who was abused by CORBA knows how hard that is to do well.

Their newer stuff (like Cognito) is sort of weird, but other things are surprisingly solid given how big AWS is. Small shops have trouble shipping feature complete software, and BigCorps can be even worse. AWS has gotten really good at it.


I wonder if we are far too quick to bestow CEOs with credit for something other people effected. Sure the CEO is the one to sign off on everything, but the question to ask is could any other person in the CEO role at that time, have done any different? I don't know the details in this particular case, but I'll go out on a limb and say the CEO quite likely did not proclaim "make this happen". Business was growing at an unbelievable pace, their systems were probably stressed to the max, their development was likely choked and the technical team comes up and says this is what we need to do otherwise we can't handle more than this much traffic. What choice does the CEO have? He says, send me your proposal and broadcasts it.

As for AWS, as far as I remember, Bezos was initially against the idea. The idea was the brainchild of one Andy Jassy who along with Rick Salzell convinced a reluctant board into trying this out. They realized that they had been unintentionally building this cloud platform for some years now in order to provide sellers with computing resources. Opening up to public users was just a small sales move. Whether they do it or not, they were going to continue to invest in their cloud platform and nothing would change as far as their technical direction was concerned, so the board finally relented.


Distributed authorization is indeed hard! IAM is one of the few (maybe the only) AWS service that isn't regional and it's because permissions must propagate globally for correctness' sake. As a distributed systems junkie, I'm shocked that other folks aren't as interested in authorization systems because they really push the boundaries of what we can do with data consistency at scale.

It's unfortunate that only Amazon themselves can add new permissions to IAM to secure their services. Why can't our applications add new permissions to IAM and query those? This is going to be a shameless plug, but it was this very problem that caused my cofounders and I to quit our jobs and start a company. Together (and now with a community of hundreds of users and contributions from a few well-known companies) we built SpiceDB[0], which is the culmination of state of the art distributed systems and authorization technology developed open source instead of behind closed doors at a hyper-scaler. We were mostly inspired by the internal system at Google, which is actually more powerful than AWS or Google Cloud's IAM services, despite a fork of it actually powering GCP's IAM.

[0]: https://github.com/authzed/spicedb


It's easy till you break IAM itself with your policy complexity and random services start dying because other AWS components few layers deep cnst get IAM to parse


Intriguing, can you share details or overview why it failed for you. Will be kind of gotchas for me


Essentially there's a maximum size of IAM policy, which AFAIK is not documented properly anywhere - get close to it or exceed it and you start getting random failures everywhere.


Character limits & the number of applied policies are all publicly documented https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_i.... Im not aware of any evaluation complexity limits and have never run in to that sort of problem in my ~10 years of dealing with IAM.

I expect you ran in to this sharp bit "You can add as many inline policies as you want to an IAM user, role, or group. But the total aggregate policy size (the sum size of all inline policies) per entity cannot exceed the following limits." Calculating the sum would be a pain as a user.


We didn't use inline policies much, but we had many policies linked across different objects, and the error message never pointed properly and we somehow didn't stumble upon the docs you mention (that's going into my notes :D).

I no longer work on that project, but it was considerable blocker when I was leaving as Sagemaker notebooks started randomly failing to start depending on role they were launched with.


Yeah, I can see that happening. There are combinations of roles etc that might hit the limit.

Do you remember what was failing? That would give some insight into how these get evaluated.

I know that S3 does evaluation differently than the other services, which gave me some insight into the process. Unfortunately I forgot what the insight was (doh).


The service that hit it was Sagemaker Notebooks, or specifically underlying EC2 instance (which you normally don't see as customer, afaik) - it failed trying to attach a network interface to the instance, because of IAM failure mentioning something rhyming with blown stack (been over a year since, so I don't recall details)


Can you (or anyone) say a bit about how the auth service is implemented from a distributed systems perspective? For example is it some kind of custom KV store?


in AWS, authentication and authorization happens within the application.

For the purposes of authorization, services integrate with a library that handles retrieving and caching policies based on caller identity. services create a context that includes all of the relevant metadata (service, operation, resources, etc.) and the library evaluates the policy and says allow or deny.

Doing it all in application means that if the control/distribution systems for auth go down most things that are in motion will remain in motion, and that deployments of the authentication/authorization code deploy out at a per-service granularity which also scopes blast radius.

There's some pretty obvious pain points (doing anything as a library means update the world for new features) but it has nice degradation properties and is relatively straightforward to grok as a service owner.


Well, it's really tough, because (1) every operation has to check if the calling entity is authorized, (2) changes need to propagate super quickly, and (3) performance needs to be pretty much realtime.

At some level every API call is authorized (and tracked).

To be honest, this is one of the secret sauces that makes AWS go. Someone once told me that they're not doing anything exciting, just caching, but I'm pretty sure they didn't really know what was going on.


I think this is a very interesting and clear manifesto, and almost certainly the right thing for Amazon at the right time (and presumably was part of what led to AWS). However, at one of the previous companies I worked at we got an ex-Amazon person as CTO and he grew the company from 50 engineers to 500 in 2 years, and pushed microservices everywhere. Very impressive, and I think all-microservices made sense at one level to handle that sort of growth, but it doesn't make sense technically in a lot of cases.

Essentially I think we've gone too far: service-oriented architectures turned into "micro" services, which come with a lot of complexity and distributed systems issues. I think for most small companies monoliths are right, for medium-sized companies (say 50+) it makes sense to carefully introduce a few separate services, and only for large companies (say 300+) does many services (which may or may not be "micro") start to pay off. I've heard it said that "microservices solve a people problem, not a technical one", and I think that's true.


This sounds similar to Flexport's CTO situation (he came from Amazon) and attempted microservice-ification of everything. Except it sounds like they weren't even able to get wheels up and are still floundering after years of planning and attempted execution.


It is most likely true. If you take away all constraints related to planning, communication, prioritization, collaboration, development efficiency etc, most of the arguments against monoliths goes away. What remains are considerations for memory, bandwidth and similar that could motivate a breakdown.


As one of the main designers of the original system (but who had left by the time this architectural change was done), that is an interesting read. Always good to see the things that we missed in 1994/1995, even though we believed we were thinking far, far ahead.


I'm sure it would have been nice to have that tech in 94 and yet at the same time I get the feeling it had to play out the way it did for Amazon to succeed. Without the first part of the journey Amazon would not have gone on to build AWS.


The rest of us in 1994 were doing Sun RPC calls, while getting started with DCOM and CORBA, actually quite interesting Amazon's bet on distributed computing given the landscape back then.


So interestingly, they made that bet internally, standardised their own platform and then released some sort of abstracted on demand cloud compute services. However the tools they developed for themselves and the SOA style of development would have been valuable to others too. Google did the same. Netflix did the same. None of this stuff really emerged as a product. I'd argue it still hasn't. IF it had, maybe we'd be doing things a bit differently now. But then I guess proprietary RPC based architectures sort of failed along the way when we look at the list you mentioned.


The didn't fail per se, hence why we now have gRPC, after WebServices, Jini, RMI, .NET Remoting, XML-RPC, JSON-RPC,...

Every generation keeps re-inventing them.


Shel and I would not have touched DCOM nor CORBA with a 22m fishing rod.


CORBA was interesting, but the authorization side really strangled it. I worked with Tivoli back in the day, which was pretty much the largest production CORBA application in existence. CORBA allowed them to be super flexible when implementing methods, but the auth was brutal. In the end they had to cache all the auth information everywhere just to get decent performance out of it.


I've read the stories about what you guys built. It's pretty epic. And not like you were trying to do cool stuff, it was literally based on a need. That's what's amazing. Just manipulating software and infrastructure to do something it wasn't particularly made for just yet.


Sure, but plenty of us did, Nokia Networks infrastructure had plenty of CORBA for several years, and so did many CERN research projects processing HLT data.


"And with every few orders of magnitude of growth the current architecture would start to show cracks in reliability and performance, and engineers would start to spend more time with virtual duct tape and WD40 than building new innovative products. At each of these inflection points, engineers would invent their way into a new architectural structure to be ready for the next orders of magnitude growth."

That last part, to me, is the key to success: getting the whole business to do things in a new way. That is fucking hard. If you can get your business to do it, you have an invaluable superpower. The more things that you can reinvent, faster, gives you more and more superpowers. It's one thing to change your architecture. But also imagine getting every employee to change how they deal with vacations, suppliers, customers, finance, or involving entirely new industries. The easier it is to adapt and change, the longer you survive and the more you thrive. Evolution, baby.


I wish there was a way to quantify the externalities of "success" of this kind. How many developers had to burnout? How many relationships had to suffer, or never even had a chance to bloom because "success" didn't leave time for anything else? And also to be considered are the downstream effects of a culture of such "success", like how Amazon's warehouse employees are treated.


"But also imagine getting every employee to change how they deal with vacations"

Interesting example. Why would changing distributed computing architecture have an impact on vacation policy?


I'm saying architecture is just one way of changing an organization. Other ways of changing an organization, separate from anything technical, might include changing people's schedules or vacation policy, or who you hire, or where, or how. Another would be how you store parts, make orders, assemble products. Or starting work in an entirely new industry.

Maybe you work at a company that sometimes works with the government. As a result, the whole company might develop a hiring process which is very slow, very detailed, and excludes certain people from being hired. But probably only a very small number of employees actually have to conform to those government requirements. You can apply them to all new hires "for simplicity", but it makes it harder to hire for non-government positions. So changing how you hire, to make it easier and faster to hire people of a wider background, benefits your organization. If your org can't easily make those changes, it will be disadvantaged.


Oh, I got it, you meant the "architecture" of the firm (i.e. the org chart).


> All of this was being done before terms like service-oriented architecture existed.

I feel like the first time I heard the term was early 2000's, and wasn't it a mainframe thing first? Dunno, just wondering.

Anyhow, it's nicely written, very concise, and worth noting how the original author focuses more on "What kind of realistic options do we have?" than winning the A vs. B vs. C argument in one fell swoop.


SOA as buzzword started with DCOM and CORBA distributed computing, then it evolved into the XML spaghetti of XML-RPC and WebServices.

Ironically when .NET was launched, Microsoft's vision was web services everywhere, with orchestration servers like Bizztalk.

We got there eventually, only using REST (aka JSON-RPC) and gRPC instead.


Things like Java RMI existed beforehand and there was the elements of industry moving towards server-based partitioning of services - the big difference is none of it was formalized and there was little consistent language of which to speak about these paradigms. At the beginning yes, people would discuss having one mainframe call another mainframe but today that would be SoA.


Yeah, the buzzwords have changed, but some version of the concept has been in the air at least since I was learning Delphi in the 90s


I remember seeing the term around the mid to late 2000s. But it was also used primarily in the context of enterprisy J2EE, weblogic servers and various IBM hardware that made the everything way more complicated than it needed to be.


It was definitely pre-2000. First “SoA” firm I worked for, I started at in 99, and they had been doing it for 2 years already and most of the crew brought if from a prior gig.


Although not using the same buzzwords, we had the same architecture deep into the 1980s in Inmos/Transputer/Occam-land.


> We propose moving towards a three-tier architecture where presentation (client), business logic and data are separated. This has also been called a service-based architecture. The applications (clients) would no longer be able to access the database directly, but only through a well-defined interface that encapsulates the business logic required to perform the function.

It is really interesting to see a recent(ish) trend away from this three tier design and back towards tighter coupling between application layers. Usually due to increased convenience & developer ergonomics.

We've got tools that 'generate' business layers from/for the data layer (Prisma, etc).

We've got tools that push business logic to the client (Meteor, Firebase, etc)


For what it's worth, Amazon's architecture for the core retail business has, if anything, moved even further up in abstraction. Tighter coupling is something that simple usecases can afford. Large scale but low-complexity can be closely coupled. High-complexity can't be.

The thing about Amazon's systems is that they are horrendously complex. In ~2016 I was working on the warehousing software, and it was a set of some hundreds of microservices in the space, which also communicated (via broad abstraction) to other spaces (orders, shipments, product, accounting, planning, ...) which were abstractions over 100s of other microservices.

So what I observed at the time was a broad increase in abstraction horizontally, rather than vertically. This manifesto describes splitting client-server into client-service-server; the trend two decades later was splitting <a few services, one for each domain> into <many services, one for each slice of each domain>, often with services that simply aggregated the results of their subdomains for general consumption in other domains.

I'm sure things have only gotten more complicated since then (in particular, a large challenge at the time was the general difficulty in producing maintainable asynchronous workflows, so lots of work was being done synchronously or very-slightly-asynchronously that should have been done in long-running or even lazy workflows).


A big part of the difference has to be that if you have a small number of developers (esp. n=1) and you can deploy everything at once, then those layers just get in the way of fast change. It seems Amazon were optimising for the ability to distribute data because they had big volume, and hide its form so they could change it without having to change lots of applications.

Of course, there’s some cargo culting around services where people jump to that architecture before they need it, but for most apps YAGNI. it’s cool that their architecture was driven by clear needs “just in time” to allow them to continue to scale


Nowadays you separate service by business capability and not by "layer". Layers just lead to a dependencies and dependencies lead to bad reliability and terrible development speed.


What Amazon were describing here is simply the division between a frontend web gateway service (or, in modernity, client-delivered SPAs); an API backend service to serve the XHRs of the web-gateway / SPA; and some kind of DBMS where user-visible query schema is separable from storage architecture via e.g. views. I don't think there's any modern system that doesn't have those things, no?


Certainly you can build a server-side rendered web application without a strict separation between frontend and backend and you absolutely should do so if you can. The common separation in frontend and backend microservice is only because JavaScript is so terrible that it's worth the effort to use a different language for backend, but at the same time you can't go all the way because frontend-tooling for backend languages (i.e. Java) is even worse. Introducing this technical separation generally only causes more complexity, inefficient network communication and bad developer experience. It is a historic wart that will hopefully go away over time. As for "a DB where query schema is separated from storage via views": The usual pattern nowadays is to not share data wherever possible (by building self-contained microservices that are aligned with business capabilities instead of layers), have a private database per microservice (in which case it is pointless to do this view indirection) and then provide a stream of business events to other microservices who build their own replicated data model from that, thus decoupling their own data model from external influences. I haven't seen any modern company use views to decouple the schemas but I suppose it is the obvious solution in a 1998 world where everyone shares the same database. If you add asynchronous replication to that, it is basically identical to the modern event-based replication.


It sounds like you're talking mostly about CRUD OLTP systems. Amazon in 1998 didn't actually have very many of those!

Consider instead what 1998 systems engineering looks like in the context of a Big Data OLAP data-warehouse (one where having denormalized replicas of it per service would cost multiples of your company's entire infra budget), where different services are built to either:

1. consume various reporting facilities of the same shared data-warehouse, adding layers of authentication, caching, API shaping, etc.; to then expose different APIs for other services to call. Think: BI; usage-based-billing reporting for invoice generation; etc.

2. abstract away Change Data Capture ETL of property tables from partners' smaller, more CRUD-y databases into your big shared data warehouse. (Think: product catalogues from book publishers), where the service owns an internal queue for robust at-least-once idempotent upserts into append-only DW tables.

At scale, an e-commerce storefront is more like banking (everything is CQRS; all data needs to be available in the same place so that realtime(!) use-cases can be built on top of joining gobs of different tables together) than it is like a forum or an issue-tracker.

There's a reason Amazon was the company to define the Dynamo architecture: their DW got so big it couldn't live on any one vertically-scaled cluster, so they had to transpose it all into a denormalized serverless key-value store (and do all the joins at query time) to keep those Big Data use-cases going!


It’s interesting to think about how much of a perspective shift this must have been, especially the service oriented bits. Interestingly, I think it might not have even been made completely in the authors’ minds at the time of this proposal. (Which is understandable, of course. It’s a proposal, not a retrospective on already accepted ideas.)

For example,

> In the case of DC processing, customer service and other functions need to be able to determine where a customer order or shipment is in the pipeline. The mechanism that we propose using is one where certain nodes along the workflow insert a row into some centralized database instance to indicate the current state of the workflow element being processed.

definitely doesn’t seem to reflect the hiding of a database behind an interface. (From a workflow node’s perspective, rows in that centralized database should be an implementation detail it has no knowledge of.)

Then again, this is part of their pitch for workflow processing, not service-oriented architecture.


Anecdotally, it was at least 2015 before the DC processing system was actually mostly operating against service-oriented interfaces (when I left in 2016 we had a few old tools left that still talked to the databases directly :/ ).


> Currently much of our database access is ad hoc with a proliferation of Perl scripts that to a very real extent run our business.

There are companies started later than 2010 where this was still the case. Interesting to think about how shipping things quickly is so different than scaling them up.


Seeing that Werner Vogels has submitted this entry, I wonder if he can comment how long it took to actually build this out. When did a form of this service oriented architecture work in production at Amazon?


I don't think this is really from 1998.


The blog post is recent, but it describes much older work, so I think the “(1998)” tag is right.

“Distributed Computing Manifesto

Created: May 24, 1998”


The blog post (what the OP link takes you to) is from 2022, but the manifesto itself (the substance of the post) is from 1998; so both dates should be used:

> Amazon's distributed computing manifesto (1998) (2022)


Or:

> Amazon's 1998 distributed computing manifesto

Title has already been changed from TFA to include 'Amazon', and it's still 2022 so no need for that.


[flagged]


What makes it so cool?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: