The last re:Invent presentation I saw from one of the principals working on IAM quoted 500 million requests per second. I expect that’s because IAM also underpins everything inside AWS, too.
IAM, hands down, is one of the most amazing pieces of technology there is.
The sheer volume is one thing, but... IAM's policy engine, that's another thing. Up to 5000 different roles per account, dozens of policies that can have an effect on any given user entity and on top of that you can also create IAM policies that blanket affect all entities (or only a filtered subset) in an account, and each policy definition can be what, 10 kB or so, in size. Filters can include multiple wildcards everywhere so you can't go for a fast-path in an in-memory index, and they can run variables with on-demand evaluation as well.
And all of that is reachable not on an account-specific endpoint that could get sharded from a shared instance should the load of one account become too expensive, no, it's a global (and region-shared) endpoint. And if that weren't enough, all calls are shipped off to CloudTrail's event log, always, with full context cues to have an audit and debug trail.
To achieve all that in a service quality that allows for less than 10 seconds worth of time before a change in an IAM policy becomes effective and milliseconds of call time is nothing short of amazing.
It's not simple, that's the point! The filter rules and ways to combine rules and their effects are highly complex. The achievement is how fast it is _despite_ network being involved on at least two hops - first service to IAM and then IAM to database.
I think it's simple. It's just a stemming pattern matching tree, right?
The admin UX is ... awkward and incomplete at best. I think the admin UI makes the service appear more complex than it is.
The JSON representation makes it look complicated, but with the data compiled down into a proper processable format, IAM is just a KVS and a simple rules engine.
Not much more complicated than nginx serving static files, honestly.
(Caveat: none of the above is literally simple, but it's what we do every day and -- unless I'm still missing it -- not especially amazing, comparatively).
IAM policies can have some pretty complex conditions that require it to sync to other systems often. Like when a tag value is used to allow devs access to all servers with the role:DEV tag.
In my (imagined) architecture, the auth requester sends the asset attributes (including tags in this example) with the auth request, so the auth service doesn't have to do any lookup to other systems. Updates are pushed in a message queue style manner, policy tables are cached and eventually consistent.