Hacker News new | past | comments | ask | show | jobs | submit login
AWS IAM Roles, a tale of unnecessary complexity (infosec.rodeo)
255 points by wglb on Nov 11, 2022 | hide | past | favorite | 87 comments



The author was doing the right thing with explaining it only makes sense when you know the history of how we got to where we are - but there is a bit more to it. It is all a series of band-aids. Band-aids all the way down. As somebody who has been involved with AWS since 2012 I've seen them all get added incrementally in response to explosions in usage and complexity and customer unhappiness and frustration.

Explicit allows being all you can do in an IAM policy were easy(ish) when there was a handful of AWS services and API actions. As there were more and more services and policy actions etc. they became unwieldy. Enter Permission Boundaries where you could wrap a few explicit denies around them. Kubernetes RBAC is nearly at the same place and could now really use those too - but I digress.

Also, early on in my AWS journey, even two accounts (one non-prod and one prod) was only done half the time and viewed as a best practice to think about - people genrally just opened one AWS account. But when IAM wasn't enough (i.e. there wasn't enough granularity on the resources or the conditions exposed etc.) the answer became separate AWS accounts as the only, or at least the easiest, way enforce these authorization boundaries/separations with a blunt instrument you could trust. It also helped to keep your bills straight before they would do things like break them down by Tag.

Then you often needed cross-account role assumptions to deal with the inevitable cases where things or people needed access between these accounts.

Then the explosion in AWS Accounts led to AWS Organizations to provision and manage them all. And it built in Service Control Policies and OUs as a tool/layer to help further manage/constrain IAM policies/permissions (IAM policies, Permission Boundaries and SCPs now being in a venn diagram with each other these days).

But AWS Organizations managing heaps of accounts was also too painful to use and get right and so they brought in AWS Control Tower to help make setting up Organizations easier.

So, in short, this all makes sense when you understand the inability to totally rewrite/refactor important complex systems used by customers (breaking backward compatibility) and instead trying to keep solving all the challenges with an steady stream of incremental band-aids that you can announce at re:Invent.


What in your opinion is a good/best implementation then? We're currently engaged in re-designing our authorization flow, and I was planning to use a model quite similar to AWS IAM (policies, etc.). Contrary to this article, I actually thought the IAM model was simple in that we can tie each entity's access down to a set of policies that we can independently develop, store, and process for providing access.


If you are developing a simple service that only needs access to an S3 bucket and a DynamoDB table etc. at runtime then it is pretty straightforward to write the explicit allows to the right resources afforded in the 'base' AWS IAM.

Where things usually get tricky is for the CI/CD pipelines and/or the administrative users that need way more access. It is very hard to scope to true least privilege - including for things like lots of ECS/Fargate where you don't want people to mess with each other's Task Definitions/Tasks/Services hosted within the same AWS account for one example. The various AWS services are very hit-and-miss for how well you can scope resources and what conditions they offer you.

Security will say "no resource star" which is a best practice but quite difficult to get right in most larger accounts. Permission Boundaries help in being able to flip the conversation to "lets list all the things these users/pipelines shouldn't be able to do instead of what they should to then constrain the more wildcard-y/star-y IAM policy we need." But even those are getting harder these days because there is soo much you don't want your average operator/admin to be able to do too.

Usually people throw up their hands and over-provision generally or give each team their own AWS account and overly permissive access within it - but ring-fenced to their own stuff at least as a risk trade-off. Though I think the pendulum may be swinging back from a bazillion AWS accounts (with all those problems) back to trying to solve the IAM problem with additional new tools (CIEMs etc) that will help them to manage IAM as-a-service with a pretty UI or by letting you scope down users/roles to only the activity they have done within the last 7 days etc.

There is a great line that "complexity isn't created or destroyed - it is just made somebody else's problem" - do you want to make these an AWS IAM problem or an AWS account-management problem? A pipeline/automation problem or a heavily-staffed security team who can write great IAM policies/PBs/SCPs problem? A SaaS vendor we can procure a CIEM from problem? etc.


IAM, like all things AWS, has it's own API and rich CLI. So, we pulled this problem outside of IAM entirely.

We created a policy system that allows us to define these individual minimized policies based upon specific services that we've created. We have a tool that can then combine these small bite sized policies into a larger policy while combining compatible actions and resources giving you a resultant policy that is equivalent but often much smaller than the logical combination of all the individual policies.

You can use this the resulting policy in a variety of ways. It's very easy to just make a custom role, set this as an inline policy, and then use some custom tools to keep the policy updated.

In some cases, we went with a "policy.d" directory in a project source tree that contains symlinks to all the small specific policies it's using, and some deployment commands that use these symlinks to create a resultant policy document. If you want to add or remove a policy to a project, it's as easy as adding or removing a symlink. Likewise, it makes it much easier to audit which policies are actually attached to the project.


Thanks for your response, but what I was looking for is your opinion on what is a better solution if IAM has all these issues. We're just starting the implementation, so this is very timely.

In terms of our situation, we provide fine grained access to distributed resources, mainly data elements: think records/fields. An example is to define which user, group, and role can access which records and which fields within each record and to what extent (e.g., can't access SSN at all, can only get last 4 digits of phone number, can see first/last name, etc.).

I really liked the policy approach of IAM so my plan was to let data owners define policies that are then applied to users, groups, and roles. At run time our coordinator engine will check levels of access to each query (that could be to one data store like Postgres or Salesforce or a federated query spanning multiple data stores). By assigning a set of policies (with IAM's effect/action/resource/condition model), we can make this happen in a flexible way as I see it. Effect also has "deny," so that would be very useful for a majority of situations.

A hierarchical model like Google's as mentioned in the article doesn't seem as flexible as this IMO.


I’ve found this blogpost very helpful in thinking about designing a permission system: https://tailscale.com/blog/rbac-like-it-was-meant-to-be/


Ahh sorry. AWS IAM is a good model it just struggles to scale to ~13,000 possible API actions in the platform as it’s grown. The models all have trade-offs - but the fact Kubernetes pretty much used the original AWS IAM model for their RBAC so many years later shows it is a good one…


One potential solution: https://github.com/cerbos/cerbos. It's a standalone service (deployed alongside your app) which evaluates access decisions at runtime against contextual/arbitrary data on the principal and resources.

In your case, your resource could be a "record" for more global yes/no decisions, or perhaps as a "field" for more granular cases. Things like "can only get last 4 digits of phone number" could be achieved through attribute-based conditions set within the policies.

> I really liked the policy approach of IAM so my plan was to let data owners define policies that are then applied to users, groups, and roles

An advantage of Cerbos is that policies are defined and deployed separately from business logic in (yaml/json) config files, so no changes are required in code when policies need updating.

> At run time our coordinator engine will check levels of access to each query

Can't wrap my head around this particular part - is this checking if an entity can or cannot run a particular query, or specifically based on the "things" the query is returning?

(as a disclaimer I should mention that I work there, although Cerbos itself is Apache 2 licensed and completely free to use)


It will really depend on your workflows, what services you use, what risks you're trying to manage and what trade-offs you're willing to accept on usability vs security.

For some scenarios, resource-based policies can form the foundation of your auth flow. If you look at the flows described in the docs [1], resource evalutionn is simpler. You still need to solve the problem of effectively managing all of those resource policies and limitations on where they can be applied [2]. That might be an easier problem to solve then dealing with trying to express everything as an identity policy. You're then less concerned with wider permissions at the IAM level and move the responsibility to the owner of the resource.

[1] https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_p... [2] https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_a...


In my opinion issues start to arise with dynamic provisioning on the stack level. I do not mean cloudformation stacks specifically, but rather stacks as a "bag of heterogeneous resources deployed together, with inter-dependencies on each other and a singular purpose".

As long as you have a constant number of stacks consisting of ec2 (even with individual resources autoscaling), lambdas, whatever really, you can write an IAM policy for that. It might be tricky but generally doable.

As soon you get into a random number of stacks you also get into dynamic IAM generation and that is really hard. Add IAM adjustment for used-based inputs and sprinkle with cross-account access and there you have it: an endless stream of new IAM headaches.


Simple vs. complex is often a measure of the types of system the user "wants" vs. the range of cases that the domain might present for solution.

e.g.

* I just want to do something simple => Solution space presented is too complex

* I have a complex use case => Solution space is easy to map my problem into


Yes, it is band-aids all the way down. In my experience it is band-aids all the way down and few, very few know all the band-aids, and why they were placed there. The turn-over within the teams, and the "good enough" thinking does not help either.


Unfortunately there’s no alternative. Band-aids all the way down is the reality of successful computing platforms.


Until they get replaced one day and then go out of business.

This usually takes 5-10 years longer than customers would like.


Another challenge is to keep in mind that it all has to be granular, performant, and work at scale.

Wait until your situation has you dancing around the 6K character limit for policy documents.


The last I looked, there was a different character limit for inline (2k) vs attached (10k) policies, as well as a character limit for all aggregated policies that applied to a single principal+resource+request.

The API forbids you from exceeding the character limit for individual policies, but the latter limit is only something you can encounter at "run time" or when a request occurs. I asked our account rep at the time what would happen if the sum of all policies was larger than that character limit, they said some arbitrary policy statements would be dropped.


> Explicit allows being all you can do in an IAM policy were easy(ish) when there was a handful of AWS services and API actions. As there were more and more services and policy actions etc. they became unwieldy.

How does adding more AWS services to the platform make following least-privilege unwieldy? Surely your workload does not need permissions to each new service, so new services and new IAM permissions being available is a no-op.


Because now the central ops team has to keep updating the policies to permit access to new services as they're released or when someone complains they can't use one


The challenge is for your IT/Ops folks to develop policies that allow developers to create IAM roles that delegate appropriate permissions to services, without giving those developers permission to accidentally grant too much access to automated processes that could, by malice or misconfiguration, create security vulnerabilities via elevated access.

When new services come along, a new set of rules needs to be designed to even allow devs to try that service out.


As someone that builds platforms for a living, the core reason I believe is that AWS is now a spaghetti monolith. It was somewhat a natural consequence because the type of service AWS provides which is infrastructure. It evolves very slowly because its impact to its tenants are hard to predict. Therefore it’s logical to apply bandaids everywhere.

AWS might be in the trenches of the biggest tech debt in terms of impact known to human kind.

How to solve this? I think lifecycle management. Define lifespan of services so that these can be replaced eventually.


Another way to do it is by coupling price cuts to migrations to the new service.

(This only works if the old service is secure and maintainable enough to run indefinitely.)


And even if you manage IAM right, if your setup is complex enough you might one day find out random AWS components stop provisioning or working because you just blew the stack in IAM policy engine with a too big policy

Damned of you do, damned if you don't...


involved... you're talking like you worked on the IAM team yourself :P

thanks for the great post


I worked on the cloud/DevOps team of a big customer for more than 5 years then as an AWS SA for almost 5 after that. Now I work for an AWS security partner.

And thanks glad you enjoyed it :)


Almost ALL manuals for setting up something on AWS, third party tools, and sometimes AWS official tools, effectively have a step that requires admin permissions, like being able to attach arbitrary policies to an instance.

At that point you can either:

- give everybody Admin access

- involve the same 2 or 3 trusted people in processes that they shouldn't really be involved

- dig down into permissions to try to build something yourself that apparently the people writting the manuals gave up on doing

Looks very bleak, anybody has a different view in this? Seems unlikely that big organizations work like this on AWS.


Here's an example of an effective setup:

* Have a separate AWS account for each developer that allows for experimentation without risking shared environments. Engineers can have admin access on their own account to enable rapid prototyping and experimentation.

* Have separate AWS accounts for separate services. Ideally a separate account per service/stage pairing. If you're even more mature in your cloud operational journey, you may go for one account per service/stage/region or per cell in a cellular architecture if you're really advanced. If you're just a small startup, maybe just go for a separate account for beta and prod stages.

* Because running services have their own account, the blast radius of an engineer modifying permissions is quite limited already. However, changes to infra (including IAM changes) should all be done through a CI/CD pipeline. Engineers can only change infra by submitting a PR to update the IaC definition and passing through your peer review and automated checks. Tools like AWS CDK with self-mutating pipelines where the pipeline itself is modeled via IaC are great for this.

* Use a higher level of abstraction to manage IAM permissions like AWS CDK. Manually figuring out the permissions needed is a nightmare and an exercise in frustration.

* Try to keep AWS credentials ephemeral with whatever third-party services you're working with.

* If you're using AWS CodePipelines, run all your pipelines in a separate DevOps account where all the pipelines live. Set a role that engineers can use on this account that lets them debug pipelines without mutating. If you're using Github actions or something else, you obviously need to guard permissioning of that other service and try to use decent practices for getting AWS credentials. In the case of Github, for example, use OIDC to assume ephemeral credentials to an IAM role rather than saving secrets of an access/secret key pair.


Major caveat. You’re going to want to have at least usage monitoring tools and ideally usage limiting tools setup before you give developers their own individual AWS accounts… at the sort of scale where sharing a dev account stops being viable, accidentally creating and leaving expensive resources around to rack up a bill becomes far too easy


This is the way. I’ve seen this happen countless times. It’s happened to me too. It’s happened to colleagues.

The worst case I’m aware of from first-hand knowledge was a large cluster of resources that got deployed for a product demo by a sales engineer and forgotten about. Turned into a nice ~$100,000 surprise in the quarterly budget.

Netflix built a tool for managing IAM permission requests as an auditable workflow, called ConsoleMe: https://github.com/Netflix/consoleme


Yup. The scope of the discussion was around permissioning/security, so I didn't get into billing, but you're absolutely right.

You should have CloudTrail, billing alarms, and dashboards all setup. It may also be a good idea to setup automatic spring cleaning that nukes resources every two weeks or so unless they have special tags to mark retention.


I came to realize that AWS is built for people who's full time job is AWS. It's like Jira, to get the most out of it you need a Jira workflow/configuration expert. You can make do as a smart cookie with shit to get done, but ultimately you'll be making compromises by virtue of not having time to dedicate to fully grokking it.

So I guess my experience is that the answer is supposed to be option 3 on your list, but that only happens if you have someone to delegate as the AWS expert and give them time to actually do that role.


This is also why just 'use the cloud' is not really good advice, as with any broad and scaled system you can't just tack it on to a project you happen to be doing. Inversely, not using the cloud because you need to know what you're doing isn't a good argument either; if you are going to build something that has to scale, be a bit elastic, and want to build on existing knowledge, you will end up using a larger system that requires domain-specific knowledge either way.

AWS isn't a hosting company (and that includes let-us-pretend-we-are services; looking at you: lightsail), just like Azure and GCP aren't. But even if you get a hosting company (Vultr? prgmr? DO?) you'll still need to know how to build virtual machine images, how to deploy them, and how to cycle them. None of that knowledge will be something native to whatever project you are building.

The only thing that gets somewhat close is the Deno and Vercel style of stuff, but even then you'll need to know how those work in order to make your project fit.


No this is exactly how it works. There’s technically ideal policy and a bunch of exceptions which are behind a “break glass in case of emergency” account. Realistically nothing works however many hours you spend in IAM unless you use the emergency account so that’s what everyone does.


Right.

In practice, you manually bootstrap your CI/CD, and if you really want to be secret squirrel secure, you can audit that.

Everything going forward gets to be commited via code, peer review, least privledged access, and is automatically deployed by the robot.

At that point, you might hand out some break glass credentials for a few items, or when your CI/CD goes down, but usage of that should alert and be manually reviewed.

You now have a pretty decent setup, as everything is checked.


That's what we say we do. Reality is that most of the time the entire team is logged in via delegated admin accounts working out what the automation fucked up or deleting resources which don't want to go away when the automation craps out or futzing with things that are broken or missing in Terraform.

Fully driving a cloud via code is a complete lie.


I don't think it's a lie; just a pain in the arse.

I've worked in, and defined, secure environments where your only option is you need to raise another PR. For me this often degenerates into the PR spinning up a copy in another account, and the PR history being "test, test, test, ffs, test".

Like this: https://github.com/secureweb/homelab/commits/parca

The more practical way around the problem is to have non-production accounts where you can delegate admin, without the risk of the production and tooling accounts, and use that to make sure your code works before you actually raise the PR and merge.

It's prudent to remember that such security should be risk based, so whilst there's risk that you leak non-prod credentials to a crypto farmer, the risk of your teams being non-productive is greater.

Largely agree with you though. I usually just run k8s environments and make concession to use the cloud stuff when it makes sense. Not being able to run locally is akin to punching yourself in the nuts. Creating a namespace in k8s and deploying ephemeral tests either locally or in a dev cluster is better experience in my opinion.

But you do have a lot of folks who treat their cloud provider like their favourite sports team.


We have a 20:1 ratio of developers and platform engineers where developers cannot do anything in AWS except via Infrastructure as Code. Works very well. Yes, you do need to make sure you only platform things you actually need, and not try to abstract the entire service catalog. But if you just pick 4 RDBMS options, 2 Document-oriented options, 2 queues, 2 PubSubs, 1 object store, 1 traffic management option, and one catch-all observability option, you really don't need to do all that much over time. Granted, this doesn't work for small-scale stuff or immature organisations.


IAM is a big pain. When deploying "something new" (say a lambda executing another lambda that accesses dynamo...) I spend more timing screwing around with terraform to configure IAM roles, permissions, and other garbage than I do actually debugging the code. Many junior developers don't understand it at all. "It works with my credentials!" (Of course it does: you're running it with admin access in your own dev account.)


We have a pretty solid set of organisation roles for devops, dev and others across several tens of (maybe hundreds) of accounts. Our accounts are divided into prod and nonprod accounts, so if you want the devs to have less permissions in prod that’s perfectly possible.

The only thing we have to be careful is is cross-account assumerole chaining.

But really it’s a lot of work, and I guess a competetive advantage, and quite specific to every organisation, so nobody ever made something like that open source.


Automated processes can help. But I’m not sure how you get around this universal feature of computing platforms. Someone has to have access to grant restricted access.

If this feels wrong to you, then you’re probably right that AWS is not the appropriate level of abstraction for your problem.


Rant: I'm glad the author called out GCP as well, but the idea that it's the user's fault for granting too many permissions is absurd. Has the author tried doing this at scale? Try a super simple example: I want a user to have access to query a schema in BigQuery. Can you guess the magic combo of permissions that is going to take? Because the documentation is not going to correctly tell you. Thus after spending way too much time trying to get it done one ends up widely over-permissioning to just get the damn thing to work. You could argue their model is sound and it's just that how they aggregated permissions into roles is absurd or even just argue that they need better defined roles and I would accept that, but still it's quite frustrating as is. I mean just look at the length of the reference: https://cloud.google.com/iam/docs/understanding-roles


if I understand correctly, the author is an infosec auditor who reviews security of those aws/gcp setups for their job? Then their job is the opposite of an engineer: the engineer is desperately trying to get shit done without managing IAM madness fulltime (hence admin admin everything), an auditor cannot give two craps about usability on the other hand, and will try to impose an idealistic model of perfectly tuned, strictly as-needed-basis permissions with no regard to how much time it would take to work them out in practice, and to maintain, because that's not their job.


Yea I am not a fan of the Cloud and many of its complexities but if I have to be honest, when you have 5k employees working on AWS resources: yes, this complexity is actually requested by customers. It was built to meet their granular needs. Same thing with policies.

To the author of the article: AWS is too complex? Then it is not for you.

That said, various cloud providers have different levels of complexity and it is definitely up for debate and criticism.


This resonates in an interesting way; like, I remember thinking S3 and then EC2 were pretty inspired simple concepts, and then they just kept getting more and more complicated over the years. I feel like if I were to start using them today as the sole engineer I am I'd be really daunted by the extreme complexity they've seemingly added. In particular, VPC to me looks like an over-engineered mess designed to replicate the historical baggage of network hardware and addressing when all they really needed to do was add nested security groups (as well as allow multiple security groups, which they did finally do at some point) and call it a day, but if you are some ridiculously large company that has a million legacy network engineers and want to build some super complicated stack of domains I guess you like doing it with virtual hardware? But it sucks because cloud (very much "to me") was the thing that allowed small teams--or even one sole bad-ass engineer--to rapidly build scalable infrastructure, NOT something that would be used by giant companies to build empires :(.


> AWS is too complex? Then it is not for you.

But then neither are GCP and Azure. I only know three other decently reliable cloud providers from there, but you also won’t get as much community support nor legal adjustment like GCP does for EU hosting.

You might be right, but that only further explains why it’s such a pain at this point.


One of the core value propositions of Linode/DigitalOcean is exactly this – reduce complexity but that also comes with reduced flexibilty and features.

The fact that Linode/DigitalOcean exists and relies on this core tenant is a testament to quite the reverse: AWS/GCP/Azure are complex to meet the needs of their customers or they wouldn't allow Linode/DigitalOcean to exist and consequently absorb their customers. They both co-exist on the axis of complexity. In some ways, it is a loose proof/evidence that AWS/GCP/Azure is complex to serve their customers, not because of incompetence or some other reason.


AWS has its lightsail range of products which offers exactly this. A much more simplified and opinionated set of products and much fewer knobs to turn.


The problem is all of these providers should have simple roll ups.


Roll ups have their own problems. In particular, you really don't want them changing over time. I have banned all of the "managed policies" at my company because AWS is free to add new permissions and you have no way of knowing.


Well that can be solved with versioning.

Rollups can simply be a suggested collection of permissions. When they change, view a diff, and accept what you need.


The CDK has made managing IAM so much easier for applications. It’s one of the main reasons we moved from Terraform to CDK.


We did the opposite because there was so much obfuscation about what exactly CDK was doing behind the curtains with respect to "small" things like IAM. We needed to know exactly which role was created or modified, etc, and we just couldn't get that with the basic interfaces that CDK provided. Writing those roles, users, groups, policies, attachments out explicitly into their own resource statements made things so much more clear, especially with respect to the relationships to other resources, and less risky


You can, but it takes a little bit of wringing your hands.

You can inspect the entire construct tree that is created (minus anything created by Custom Resources, which are opaque to the CDK) via the Aspects visitor pattern. It allows you to audit the entire tree of resources.

The Annotations framework allows you to add flags (notes, warnings, and errors) to any object in the construct hierarchy.


You can be pretty granular in CDK, just skip the L3 constructs and build everything using the L2 one or if you hate yourself go for the L1 which is manipulating the Cloudformarion template withing CDK



How does CDK compare to Pulumi? It looks like a similar concept. I haven’t tried out either.


CDK 'compiles' CloudFormation templates basically making it much easier to write using TypeScript/Python/Java/C# instead of JSON/YAML.

The real thing is does though is give you higher-level object-oriented constructs with best practices baked in. It has much more sensible defaults baked in and, almost ironically, the fewer parameters you pass to these classes the more opinionated CloudFormation comes out.

The example that blew my mind is if you don't specify a password for RDS it provisions an AWS SecretsManager Secret, generates a random password and puts it in there and then tells the RDS to use that Secret. If you do specify a password it doesn't do that stuff. Lots of stuff like that - it turns encryption on by default and creates the keys if you don't specify, it creates private subnets and a NAT gateway for VPCs if you don't specify.

It was basically "its too hard to fix the service APIs or their CloudFormation so we'll fix the problem outside of / on top of them with a tool users run on their laptops or in their pipelines to deal with generating the thousands of lines of CF boilerplate that are required to really do the right thing these days.

Of course you can be very explicit in most of these constructs and the more explicit you are about what you want the less of its opinions happen.


Unfortunately, a large problem with this lipstick on a pig approach that Amazon took with CDK is that the moment something fails to work you are right back to combing through Cloud Formation scripts. The abstraction is so leaky that I don't recall a single CDK project that I worked with where I didn't have to inspect the CF output and painstakingly map back to the CDK sources.


Thanks for that info. Apparently, I'm falling behind enough that I didn't even realize that Terraform had its own CDK as well.

I personally agree with you about CloudFormation. I dislike everything about it.

I think the people who build the Terraform and Pulumi ecosystems are plain and simple better architects than AWS employees.

I make that relatively blind assumption because I've done an interview for AWS. They optimize their hiring process for hiring blindly loyal drones. They want you to memorize company values and relate everything you've ever done to each and every value. They make you read them off a list one-by-one. It felt so dystopian to me.

AWS seems like the last place a creative, talented developer would want to go. The phone screen would scare most of them off.


We had to move to Pulumi from CDK. CDK's reliance on Cloudformation makes it unusable because CF is not a production ready service - it is extremely slow and its failure modes are absurd.

Pulumi does less "magic" than CDK, though they're working on higher level constructs, but ultimately it's just the better technology.


> CF is not a production ready service

This is an absurd and simply untrue statement. Many teams at Amazon deploy their production AWS infrastructure with CF.


Well, I disagree. I've seen CF get into bad states that can't be recovered without support going in. It shouldn't be controversial to say that it's much slower than alternatives as well.


The difference is that AWS support can recover them and will help you. In that way it is a service not a tool. They are also the support team for the services that are being provisioned/managed so it is "one throat to choke".

On the contrary, I have seen many many destructive terraform applies that really messed everything up - without a helpful support team to call (unless you are paying Hashicorp) that can just get you out of the jam.

Yes it is a bit slower and often you need to wait for it to rollback when something goes wrong - but 9 times out of 10 that rollback succeeds. I'll take that trade-off...


CDK makes Cloudformation easy to use. Tight integration between cdk and services. Lots of reusable bits and parts. Forces you into component (construct) mindset to handle complexity.

Pulumi bypasses CF by making and breaking the resources directly. The focus on cross cloud and multiple cloud means you're thinking in higher order abstraction but you lose the finer knobs unless you break the indirection.

CDK has ways to integrate directly into CodePipeline and use its own abstractions directly (such as with ecr or s3 deployments). Pulimi half expects you to have solved your CI/CD on your own.

Reliability is a different problem. System shits its pants halfway through a cdk deployment? CDK rolls back and figures out the issues. Crash during Pulimi? Unknown state but you can rerun it and usually get back up and running.


The author does a good job showing how AWS got to this point, but I think the majority of their argument is that they don't care for the semantics. Lines like:

>>A consequence of this change was to further remove any coherent meaning for the term “role”

and

>>So what is an IAM role then? Simply put, it’s a principal (i.e. an identity) with no long-lived credentials, that can be impersonated for arbitrary purposes.

Show that this is the main grievance. And to that I say - I hope you never try to mix Azure workloads at the same time as you are working on other cloud providers, as the nomenclature differences will drive you crazy.

I do agree that the GCP model is more sensible (unless you have to 'unlearn' the AWS style first) but my major grievance is the combination and intersection of lots of different permission layers in AWS - you can have SCP's acting as a hard explicit deny at the top level, then your policies attached to roles/users, and then some services (such as S3) can have resource based policies as well. Even the best engineers can forget a setting or two and have to spend time troubleshooting, especially across a lot of accounts.


Author keeps saying that Roles are identities,but they are exactly opposite. Role is a set of permissions token issued for that role will have. Whether identity can issue token for that role or not is governed separately. Role is not identity , nor service accounts nor principal.



From today's inbox seems like they're taking heed. I expect the following makes sense to someone:

"We have updated our back-end permissions policy to simplify on-boarding to self-managed AWS CloudFormation StackSets. Starting October 31, 2022, AWS CloudFormation StackSets no longer requires sns* permissions in AWSCloudFormationStackSetExecutionRole as a prerequisite for getting started with self-managed StackSets.

StackSets requires you to create a service role named AWSCloudFormationStackSetExecutionRole that trusts the customized administration role for each target account. Previously, you needed to provide sns:, s3:, and cloudformation:* permissions to AWSCloudFormationStackSetExecutionRole as prerequisites to allow StackSets to manage and provision resources on your behalf. Now, StackSets has removed the explicit need for sns:* permissions as prerequisites in AWSCloudFormationStackSetExecutionRole. Please note, you will still need to provide s3:* and cloudformation:* permissions in AWSCloudFormationStackSetExecutionRole."


Ignoring all the warts of how schizophrenic the different AWS layers are, I personally like the policy document format.

I wouldn’t mind a similar style at the OS/kernel level for control groups and other permissions of that type. Having a list of principals and allowed operations would be easier to reason about (imho) than systemd unit files, or selinux, or filesystem permissions, or firewall rules, or the rest of the mess which is Linux security.

I think Fuchsia is doing something exactly like that?


Unix groups are much, much easier to correctly configure than everything you mentioned, and at least an order of magnitude simpler as well.


To be fair selinux is a bar so low you'd have to dig to do worse.


The problem is that no one hierarchy fits every org. Sideways relationships that cross hierarchies exist, other exceptions exist.

I find directed-graph-based approach in spicedb to be a better primitive than hierarchy with weird exceptions.

Directed edges are a much easier way to think about relationships even if they are logically equivalent to hierarchies.

https://github.com/authzed/spicedb


A few years ago I was tasked with migrating a database from one AWS org to another and thanks to IAM we were able to allow this without me accessing anything else in the other org and migrate the database without any more support from the other org. Some time later I found a key in the database I migrated which turned out to give me full access to their AWS... So, be careful with it and check or even remove old keys after a few years.


Interestingly enough, this is one of the very use case for IAM roles. IAM roles' credentials have a maximum life of 12 hours.

The article is complaining about the thing that exists to avoid the issue you point out.


Implementation of EVERY service on AWS is unnecessarily complicated. Not an exaggeration. Making things simple takes considerable consideration and effort, and that costs money. AWS is a nano-margin business. As long as something works and doesn't fall apart, they ship it.


Anyone complaining about AWS has obviously never had GCP inflicted on them…


GCP is an order of magnitude better at IAM. And the article is actually exactly about that, with comparison.


I would like to disagree with that; GCP seems to do a much better job at permission granularity, but does a "punch me in the face" level worse job at IAM role binding to principals

With AWS, if I have "role-A", used by "user1", and I now wish to bind it to "user2", do you know how many times I have to make reference to "user1"? ZERO

With GCP, binding is authoritative, so if I don't copy-paste the "members" from the previous incantation of that roleBinding over, buh-bye to "user1". Stark-raving insanity


I've worked on plenty of multi-account projects, and never felt a need to use Organizations. So far as I can tell from this, it allows you to attach permissions to roles in multiple sub-accounts - why wouldn't you be able to use CDK for that?


Organizations also consolidates billing, which means you can (potentially) get usage discounts.

There's a lot of reasons to use Organizations, but I really don't want to write an essay. But it allows things such as segregated security accounts, centralized logging, etc. It can help your security posture a bunch.


Interesting! Technically I guess a lot of that is possible with raw CDK (centralized logging would be, for instance, if you created appropriate roles for each of the managed accounts), but it never hurts to have it done for you automatically! Sounds like it provides sensible default templates for things you _could_ roll by hand if you were feeling masochistic :)

Plus, as the sibling comment said, it predated CDK so was probably the only option at the time.


You could, but Organisations predates heavy adoption of CDK. I imagine it also guarantees that the roles keep having the correct permissions.


That logic is complex, but not that much. The industry is full of RBAC implementations that are orders of magnitude worse, and don't necessarily do what their owners think they do.


If you make the choice to say account == project, you're good to go in most cases. If you need deep permission hierarchies, you're screwed unless you go ABAC, but that is not amazingly implemented in any of the big clouds. Another saving scenario might be using a public cloud but not using it for flexible deployments and workloads (which are not the same thing and also indicates that a cloud may not be the best place to do this).

Azure is the worst of all worlds because its Azure AD implementation is much better than classic AD but much worse than modern IAM. It's really weird how they made something that on the surface seems like a nice modern replacement but then ends up biting you in the ass as soon as you try to stop playing datacenter-in-the-cloud and actually do something new/innovative.

Same goes for GCP and Azure in terms of service credentials when compared to IRSA and Instance Roles, or the other way around, non-standard credentials in AWS vs. OIDC or x509 in GCP (and Azure, and Kubernetes...).

Granted, those aren't amazing either, but just for initial identities (after which you should switch into a specific principal - bare user principal policies deserve to be burned).

The odd thing about most of this (in GCP and AWS, Azure seems to still have a two-faced API in the 'standard for me but not for thee' style) is that at some point resource interfaces got more and more standardised within each big cloud, yet the IAM part did not grow the same way. Perhaps because it wasn't possible to refactor/discard legacy systems in time (Just like EC2 Classic is only now going into some sort of read-only mode), or the scaling issues weren't apparent early enough?

Some scaling solutions are a whole different can of worms, like the folder/inheritance structure which becomes the same mess as AWS with the boundary structure. That is one thing where a more classical "Directory" approach doesn't have this issue as much, but you trade it for the single-OU-leaf-at-a-time issue.

Seeing how we do sometimes have to apply the ugliest of kludges for complex IAM scenarios (i.e. a microservice accessing two different resources in different scopes [ AWS accounts ] and some resources support resource-level policies to allow this but others are always IAM-local, so now you have to make your IaC pick the target that has the least flexible IAM context, create the principal there, and refer to that principal from all the remote contexts), you'd think instead of developing shrink-wrapping extensions more focus would be applied to ABAC and resource-local policies...

Regardless of the implementation, I do question systems that cannot separate credentials from principals. A principal is what you are granted to BE post-authentication, and you need credentials for that. A policy grants a principal one or more ACTIONS. Therefore, a "user" should be credentials, but not a principal. But this is getting too much into semantics... as long as your authenticity is not tied to your authorisation the system should be better off (even if it ends up calling principals roles - just call users not-principals in that case, and disallow policy attachments or user references).

A thing that does make more sense than the author of the article states is making things like this 'just another service'. When you don't do that, but make it 'special' instead, you create a single-root problem. An example is the problem where billing is either in the local account, or in the org account, and nowhere else (in AWS). In GCP your projects and billing accounts are just two resources that can exist in arbitrary numbers and be linked as needed. Billing shouldn't be special, just like IAM shouldn't be special.


GCP has "Workload Identity" as an analogue to IRSA for a little while now.

What I appreciate about GCP's IAM is that it seems to work well with simple ACLs vs. needing a lot of conditional policies, but this is probably because it strongly encourages folder and project structure to define different permission domains.

Still no capability model unfortunately. The closest I have seen is SPIFFE (https://spiffe.io/docs/latest/spiffe-about/overview/) which tries to tie most permissions to the identity of the requester using a policy engine instead of direct capabilities, but without broad support in... anything, it requires a ton of proxies to existing resources to work as anything other than a service mesh. It is also not as fine-grained as capabilities, but I think it's a reasonable first step since in a distributed software ecosystem the question of what owns a capability is definitely tied up in the identity of a thing running (probably virtualized) on something somewhere, as opposed to within a single operating system where the kernel can conveniently track identity/ownership.


MULTICS reinvents itself. It's penguins all the way down.


What are the best resources for getting better at working with IAM? Already pretty experienced but I feel like my mental model could be better.


I learned so much in this thread.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: