Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It shouldn’t, but it does. As a civilization, we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive. So everything is resting on a giant pile of single point of failures.

Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.



> Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.

Doesn't help either. us-east-1 hosts the internal control plane of AWS and a bunch of stuff is only available in us-east-1 at all - most importantly, Cloudfront, AWS ACM for Cloudfront and parts of IAM.

And the last is the one true big problem. When IAM has a sniffle, everything else collapses because literally everything else depends on IAM. If I were to guess IAM probably handles millions if not billions of requests a second because every action on every AWS service causes at least one request to IAM.


The last re:Invent presentation I saw from one of the principals working on IAM quoted 500 million requests per second. I expect that’s because IAM also underpins everything inside AWS, too.


IAM, hands down, is one of the most amazing pieces of technology there is.

The sheer volume is one thing, but... IAM's policy engine, that's another thing. Up to 5000 different roles per account, dozens of policies that can have an effect on any given user entity and on top of that you can also create IAM policies that blanket affect all entities (or only a filtered subset) in an account, and each policy definition can be what, 10 kB or so, in size. Filters can include multiple wildcards everywhere so you can't go for a fast-path in an in-memory index, and they can run variables with on-demand evaluation as well.

And all of that is reachable not on an account-specific endpoint that could get sharded from a shared instance should the load of one account become too expensive, no, it's a global (and region-shared) endpoint. And if that weren't enough, all calls are shipped off to CloudTrail's event log, always, with full context cues to have an audit and debug trail.

To achieve all that in a service quality that allows for less than 10 seconds worth of time before a change in an IAM policy becomes effective and milliseconds of call time is nothing short of amazing.


No harshness intended, but I don't see the magic.

IAM is solid, but is it any more special than any other distributed AuthN+AuthZ service?


Scale is a feature. 500M per second in practice is impressive.


The scale, speed, and uptime of AWS IAM is pretty special.


IAM is very simple, and very solid.

The scale, speed, and uptime, are downstream from the simplicity.

It's good solid work, I guess I read "amazing" as something surprising or superlative.

(The simple, solid, reliable services should absolutely get more love! Just wasn't sure if I was missing something about IAM.)


It's not simple, that's the point! The filter rules and ways to combine rules and their effects are highly complex. The achievement is how fast it is _despite_ network being involved on at least two hops - first service to IAM and then IAM to database.


I think it's simple. It's just a stemming pattern matching tree, right?

The admin UX is ... awkward and incomplete at best. I think the admin UI makes the service appear more complex than it is.

The JSON representation makes it look complicated, but with the data compiled down into a proper processable format, IAM is just a KVS and a simple rules engine.

Not much more complicated than nginx serving static files, honestly.

(Caveat: none of the above is literally simple, but it's what we do every day and -- unless I'm still missing it -- not especially amazing, comparatively).


IAM policies can have some pretty complex conditions that require it to sync to other systems often. Like when a tag value is used to allow devs access to all servers with the role:DEV tag.


In my (imagined) architecture, the auth requester sends the asset attributes (including tags in this example) with the auth request, so the auth service doesn't have to do any lookup to other systems. Updates are pushed in a message queue style manner, policy tables are cached and eventually consistent.


> we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive.

You are right. But alas, a peek at the AMZN stock ticker suggests that the market doesn't really value resilience that much.


Stocks stopped being indicative of anything decades ago though.


The irony is that true resilience is very complex, and complexity can be a major source of outages in and of itself


I have enjoyed this paper on such dynamics: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...

It is kind of the child of what used to be called Catastrophe Theory, which in low dimensions is essentially a classification of folding of manifolds. Now the systems are higher dinemsional and the advice more practical/heuristic.


No shot that happens until an outage breaks at least an entire workday in the US timezones. The only complaint I personally heard was from someone who couldn't load reddit on the train to work.


Well by the time it really happens for a whole day Amazon leadership will be brazen enough to say "OK, enough of this my site is down, we will call back once systems are up so don't bother for a while". Also maybe responsible human engineers would fired by then and AI can be infinitely patient while working through insolvable issues.


No complaints from folks who couldn’t load reddit at work?


Reddit was back by the time work started, so all good there lol


when did we have resilience?


Cold War was pretty good in terms of resilience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: