It shouldn’t, but it does. As a civilization, we’ve eliminated resilience wherev...

mschuster91 · 2025-10-20T13:06:58 1760965618

> Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.

Doesn't help either. us-east-1 hosts the internal control plane of AWS and a bunch of stuff is only available in us-east-1 at all - most importantly, Cloudfront, AWS ACM for Cloudfront and parts of IAM.

And the last is the one true big problem. When IAM has a sniffle, everything else collapses because literally everything else depends on IAM. If I were to guess IAM probably handles millions if not billions of requests a second because every action on every AWS service causes at least one request to IAM.

TheNewsIsHere · 2025-10-20T13:38:19 1760967499

The last re:Invent presentation I saw from one of the principals working on IAM quoted 500 million requests per second. I expect that’s because IAM also underpins everything inside AWS, too.

mschuster91 · 2025-10-20T13:52:56 1760968376

IAM, hands down, is one of the most amazing pieces of technology there is.

The sheer volume is one thing, but... IAM's policy engine, that's another thing. Up to 5000 different roles per account, dozens of policies that can have an effect on any given user entity and on top of that you can also create IAM policies that blanket affect all entities (or only a filtered subset) in an account, and each policy definition can be what, 10 kB or so, in size. Filters can include multiple wildcards everywhere so you can't go for a fast-path in an in-memory index, and they can run variables with on-demand evaluation as well.

And all of that is reachable not on an account-specific endpoint that could get sharded from a shared instance should the load of one account become too expensive, no, it's a global (and region-shared) endpoint. And if that weren't enough, all calls are shipped off to CloudTrail's event log, always, with full context cues to have an audit and debug trail.

To achieve all that in a service quality that allows for less than 10 seconds worth of time before a change in an IAM policy becomes effective and milliseconds of call time is nothing short of amazing.

quesera · 2025-10-20T14:44:04 1760971444

No harshness intended, but I don't see the magic.

IAM is solid, but is it any more special than any other distributed AuthN+AuthZ service?

lanstin · 2025-10-20T15:07:58 1760972878

Scale is a feature. 500M per second in practice is impressive.

UltraSane · 2025-10-20T15:16:33 1760973393

The scale, speed, and uptime of AWS IAM is pretty special.

quesera · 2025-10-20T16:30:53 1760977853

IAM is very simple, and very solid.

The scale, speed, and uptime, are downstream from the simplicity.

It's good solid work, I guess I read "amazing" as something surprising or superlative.

(The simple, solid, reliable services should absolutely get more love! Just wasn't sure if I was missing something about IAM.)

mschuster91 · 2025-10-20T16:56:52 1760979412

It's not simple, that's the point! The filter rules and ways to combine rules and their effects are highly complex. The achievement is how fast it is _despite_ network being involved on at least two hops - first service to IAM and then IAM to database.

quesera · 2025-10-20T17:38:31 1760981911

I think it's simple. It's just a stemming pattern matching tree, right?

The admin UX is ... awkward and incomplete at best. I think the admin UI makes the service appear more complex than it is.

The JSON representation makes it look complicated, but with the data compiled down into a proper processable format, IAM is just a KVS and a simple rules engine.

Not much more complicated than nginx serving static files, honestly.

(Caveat: none of the above is literally simple, but it's what we do every day and -- unless I'm still missing it -- not especially amazing, comparatively).

UltraSane · 2025-10-20T21:35:22 1760996122

IAM policies can have some pretty complex conditions that require it to sync to other systems often. Like when a tag value is used to allow devs access to all servers with the role:DEV tag.

quesera · 2025-10-21T20:38:17 1761079097

In my (imagined) architecture, the auth requester sends the asset attributes (including tags in this example) with the auth request, so the auth service doesn't have to do any lookup to other systems. Updates are pushed in a message queue style manner, policy tables are cached and eventually consistent.

2d8a875f-39a2-4 · 2025-10-20T13:18:02 1760966282

> we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive.

You are right. But alas, a peek at the AMZN stock ticker suggests that the market doesn't really value resilience that much.

Cthulhu_ · 2025-10-20T15:00:47 1760972447

Stocks stopped being indicative of anything decades ago though.

Trasmatta · 2025-10-20T12:55:50 1760964950

The irony is that true resilience is very complex, and complexity can be a major source of outages in and of itself

lanstin · 2025-10-20T15:04:03 1760972643

I have enjoyed this paper on such dynamics: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...

It is kind of the child of what used to be called Catastrophe Theory, which in low dimensions is essentially a classification of folding of manifolds. Now the systems are higher dinemsional and the advice more practical/heuristic.

foinker · 2025-10-20T13:15:30 1760966130

No shot that happens until an outage breaks at least an entire workday in the US timezones. The only complaint I personally heard was from someone who couldn't load reddit on the train to work.

geodel · 2025-10-20T14:05:57 1760969157

Well by the time it really happens for a whole day Amazon leadership will be brazen enough to say "OK, enough of this my site is down, we will call back once systems are up so don't bother for a while". Also maybe responsible human engineers would fired by then and AI can be infinitely patient while working through insolvable issues.

BeFlatXIII · 2025-10-20T16:21:00 1760977260

No complaints from folks who couldn’t load reddit at work?

foinker · 2025-10-20T17:02:03 1760979723

Reddit was back by the time work started, so all good there lol

agos · 2025-10-20T13:23:05 1760966585

when did we have resilience?

BoredPositron · 2025-10-20T13:40:33 1760967633

Cold War was pretty good in terms of resilience.