Whether you agree with the article's recommendations or not, I do not understand how there are so many commenters saying "IAM is not that complicated". Even engineers internally at AWS frequently get tripped up with IAM permission settings. It's rare that someone gets them right on the first try.
Just some of the things that make it challenging:
1. There are permissions at various layers. If anything along the chain doesn't line up, permission denied.
2. You need deep understanding of each service's specific IAM setup. It's not enough to write a policy that will grant you read access to a DynamoDB table. Your application probably also needs to grant access to the GSI/LSI indices created.
3. Ancillary permission requirements are not obvious if you're not familiar with the details of how a service works. Want a Lambda function to have logs and traces? Make sure you have the relevant CW and X-Ray permissions set on it.
4. Permission related failures do not make the root cause immediately clear. Your S3 get operation may fail because you're missing permissions to the related KMS key. The usage of the ancillary KMS API calls here is not obvious unless you inspect the configuration details of the resource.
5. Secrets related permissions are especially tricky. To be able to read a cross-account secret, you need to grant the IAM identity permissions to get the secret value, grant the identity permissions to decrypt the associated KMS key used for the secret, grant the related account identities permissions to decrypt the key in the KMS resource policy, and grant the related account identities permissions to get the secret value in the secret resource policy. This is assuming there's no other things like SCPs and permissions boundaries mucking it up.
6. The out-of-the-box managed policies are too broad and will often have you granting much more permissions than you need if you use them.
Low level IaC tools like CloudFormation and Terraform suck for this. They leave too much complexity to the end user to get right. CDK does mitigate the issue somewhat with it's grantX methods, but even those are fairly limited and require you to write manual policy statements for many use-cases.
IAM is an excellent example of why from a developer perspective security is "broken".
Some security group at a company will have a "review" of your permissions. Occasionally they will run a "sweep" and yank permissions out from under you.
Instead, here we have an API that the security team can actively manage and PROVIDE A SOLUTION. Should a developer on some project have domain knowledge of IAM to make a perfect bespoke (and it WILL be bespoke) least-permission policy?
No, of course no. that domain knowledge should be a service in any substantive AWS org where they provide it to you, and much more importantly, DEBUG it for you when it doesn't work.
Because here's the deal: IAM may be a bit ugly and have some cruft and evolution, and I believe S3 permissions are another entire headache atop IAM, but this is what an extremely fine grained permissions model looks like: detail hell.
ALL detailed permissions models will look like this. Defining perfect names (by definition coarse grained) to communicate the precise multidimensional n-brane border of a policy is basically impossible.
Here is another issue: in my last job they were obsessed with short-duration tokens and TOTP. Ok great. Hey wait, if I need to run an automated cluster-wide job that will take hours (backup, cleanup, log analysis, etc), what do I do then?
Security team didn't care. Automation? What's that? Just sit there watching the log and manually refresh the keys.
So I end up using a software TOTP generator and hacking it that way. I should not be doing that. It is likely a security hole. The security team should have heard my requirements, accepted them as a necessity (they are) and provided me a solution.
One to add to the list is that IAM conditions[0] are extremely powerful but there's no good way to know which conditions to use in which scenario and troubleshooting is very difficult.
For instance if you look at the EC2 CreateNetworkInterface action[1] you'll see that there are three possible resources (network-interface (required), security-group (not required), subnet (required)) and each of those resources have several possible condition keys associated.
What's not obvious is which condition keys will be available in any given request. I've run the same CreateNetworkInterface request with the same parameters and IAM role twice in a row and by looking in the "encoded authorization message" that was returned with the failure in each case I found that in one case the resource was a security group while in the other case it was a subnet. Depending on the resource type different condition keys are available in the context. So if you want to allow CreateNetworkInterface but only if the ec2:SecurityGroupID is 'abc' it might or might not work.
An extra challenge is the encoded authorization message is truncated in CloudTrail so if you're using CloudFormation you don't actually get to see what the context was if a call fails. Then you have to find a way to make the same call CloudFormation made using an SDK so you can get the full text of the encoded authorization message.
There's no easy way to just say "try this API call with this role and tell me exactly what the context would be and what part of the IAM policy hits it if any"
> I've run the same CreateNetworkInterface request with the same parameters and IAM role twice in a row and by looking in the "encoded authorization message" that was returned with the failure in each case I found that in one case the resource was a security group while in the other case it was a subnet.
Well EC2 would process these requests by first verifying subnet-related permissions before moving on to security group permissions. Variations in the error messages could reflect the point at which the request encounters a permission issue?
Policy simulator is indeed a great option except I didn't have access to it at the time because it was disabled via SCP :D
Kidding around though I'll try that if I face a similar issue in the future. It has been improving quite a bit lately.
> Well EC2 would process these requests by first verifying subnet-related permissions before moving on to security group permissions. Variations in the error messages could reflect the point at which the request encounters a permission issue?
I would think the context would be deterministic in that case but I verified calling the API with the same parameters using the same role twice in a row ended up with different 'resource' values in the context. It was almost like under the hood boto3 or something else was changing the order of the parameters in the API call which was changing the way the context was created. I could've put in a support case but had bigger fish to fry.
Something similar recently tripped me up: Some parts of AWS IAM are extremely detailed and you can create insanely specific policies allowing very precise control (almost to a fault). Other parts are very broad and unspecific.
For example, I recently needed to allow some EC2 instances to push a private IP around between those. I would have assumed I can create some policy along the lines of "Yeah, VMs with this role can push 10.20.30.40 around between their network interfaces". I haven't been able to find any way to restrict these IP addresses, so now I have the smallest policy I could create: "This role can assign fuck-any internal IPs to these interfaces, let's hope for the best." Doesn't really feel the greatest.
How does this control the private IP address that can be assigned? How does this stop the VM from just grabbing any IP? There isn't even anything IP-shaped in that policy.
This is all true, and it's a pain, but the situation is still improved from 20 years ago, when all of these layers were in separately-managed systems with no integration at all. Need to access the database? Well, it's in another datacenter that we haven't added to the backbone yet, so it'll need to traverse the internet. That means you'll need an ACL to get to the outbound NAT -- talk to datacenter team A for that. Then you'll need an ACL at Datacenter B to let your NAT'd IP in -- ticket datacenter B for that, we don't have any of our own people there. Then you can talk to the DBAs to get a username and password -- make sure they lock it down to just the schemas you need, for reasons of least-privilege. At a large org you probably still have to talk to all those teams, but at a well-run one the conversation can be streamlined to a few pull requests against their IaC. At a small org running on one account, you can probably do it yourself in one merge. AWS and GCP (not sure about Azure, but maybe them too) both now also offer relatively painless ways of auditing roles to see what permissions are actually in use, so you can trim them to what is needed. This kind of feature is not really feasible with the permissions spread across 5 heterogeneous systems.
Sure, we could just put everything on one VLAN and hand out . credentials, but you can do the equivalent in the cloud too.
I gave up trying to reason with it when I attempted to upload a docker image. Turns out I had permissions to upload an image, but not individual layers.
And a lot of complexity comes from allowing a user or an external service to access some resources that my account owns. I remember inside AWS, an engineer who understands IAM thoroughly can have enormous influence because the engineer will easily become the go-to person for all kinds of design discussions. IAM is truly a complex beast.
Of all the things you mentioned I think things related to 3 are the ones that trip up even seasoned infrastructure engineers. Are you spinning up Karpenter? Well, my summer child, I hope you are aware of every single possible permission that EC2 nodes need to bootstrap themselves and join an EKS cluster. And let me tell you, that list is not tiny
A lot of times in the “developer guides” AWS includes the correct policies as a role buried in the docs somewhere. But those guides are often not tailored to work with Terraform and the like so if you go the IaC route you need to figure them out, often by trial and error.
> 1. There are permissions at various layers. If anything along the chain doesn't line up, permission denied.
- I am shocked that you don't seem to find Deny By Default the best thing in the world... (looking at you Azure...)
> You need deep understanding of each service's specific IAM setup.
- Color me shocked...
> Ancillary permission requirements are not obvious if you're not familiar with the details of how a service works.
- Imagine...Having to understand how stuff works to be gainfully employed....
> Permission related failures do not make the root cause immediately clear.
Cloudtrail is your friend...
> Secrets related permissions are especially tricky.
- Define the complaint....
> The out-of-the-box managed policies are too broad and will often have you granting much more permissions than you need if you use them.
At least for AWS, you are not supposed at any point in time to use out-of-the-box managed policies. Instead, you should use them as templates for your own policies or create your own Customer Managed Policies from scratch.
"...Another best practice is to create a customer managed IAM policy that you can assign to users. Customer managed policies are standalone identity-based policies that you create and which you can attach to multiple users, groups, or roles in your AWS account. Such a policy restricts users to performing only the AWS Private CA actions that you specify..." - https://docs.aws.amazon.com/privateca/latest/userguide/auth-...
> - I am shocked that you don't seem to find Deny By Default the best thing in the world... (looking at you Azure...)
The problem is not deny by default, but the complexity of setting "allow just the things I need". This is not easy.
> Cloudtrail is your friend...
Having to dig into the data of another service (that hopefully your org permissions allow you to read) instead of just being able to see a clear error message is not great DX. There are maybe valid security or performance reasons for not returning clear error messages, but there is a trade-off to usability made here.
> At least for AWS, you are not supposed at any point in time to use out-of-the-box managed policies. Instead, you should use them as templates for your own policies or create your own Customer Managed Policies from scratch.
Right, but because they're so broad, the templates themselves are overly broad. Even just using them as a reference, it's difficult to pare down to just what you need. You will inevitably go too far and have to play around with combinations until you identify the real need.
---
The rest of your comments essentially boil down to saying "skill issue"/"git gud". I think that downplays just how hard these things are to get right. I worked at AWS for almost 8 years and have used it for several more years as a customer since then. I still wind up with runtime errors due to permissions issues that I need to debug. I still find myself needing to spend lots of time shuffling through official docs and blog posts people have written about how to setup specific combinations of AWS services. I've seen other engineers within AWS struggle with this. I've spoken with many founders at startups who've struggled with this. The biggest challenge comes up when first learning and getting acquainted with a service. You don't even know what you don't know and there are many hurdles that can pop up along the way.
I mentioned it in my last comment, but CDK is probably the single biggest improvement to DX in the space here.
I'm not the person you're condescending to, but it is possible IMO to simultaneously recognize the security value in deny-by-default and Principle of Least Privilege while also finding it challenging to work with AWS's IAM permissions in practice.
The same way the person is condescending to the ones who don't find so difficult. I would even go and argue, that if you are already having issues with IAM, how do you expect to handle what is actually difficult?
CloudTrail is almost never useful on its own. So often CloudTrail will tell you something is denied and give you literally no other useful, direct information, especially as you start dealing with SCP-related denials. CloudTrail gives you a pile of metadata and says "here's everything, you figure it out". A mature audit solution would tell me the exact policy and line number that caused the denial and not play these guessing games.
The comment about out-of-the-box policies is true, I suppose, but hard to take seriously. Almost every policy example you encounter in the AWS documentation is insecure by default. They've gotten better over time noting this and pointing to better examples for different use cases. But it's still pretty bad.
I always heard an F-16 it's a pretty easy to fly airplane ...For trained pilots.
Maybe I have done Consulting at too many Startups or large Enterprise with large Cloud deployments, where most of the team seems to have barely spent some time with the docs. Some even proudly state they learned it by "looking in with colleagues"...or "on the job". Yes, it's Friday and that makes me grumpy...
Just some of the things that make it challenging:
1. There are permissions at various layers. If anything along the chain doesn't line up, permission denied.
2. You need deep understanding of each service's specific IAM setup. It's not enough to write a policy that will grant you read access to a DynamoDB table. Your application probably also needs to grant access to the GSI/LSI indices created.
3. Ancillary permission requirements are not obvious if you're not familiar with the details of how a service works. Want a Lambda function to have logs and traces? Make sure you have the relevant CW and X-Ray permissions set on it.
4. Permission related failures do not make the root cause immediately clear. Your S3 get operation may fail because you're missing permissions to the related KMS key. The usage of the ancillary KMS API calls here is not obvious unless you inspect the configuration details of the resource.
5. Secrets related permissions are especially tricky. To be able to read a cross-account secret, you need to grant the IAM identity permissions to get the secret value, grant the identity permissions to decrypt the associated KMS key used for the secret, grant the related account identities permissions to decrypt the key in the KMS resource policy, and grant the related account identities permissions to get the secret value in the secret resource policy. This is assuming there's no other things like SCPs and permissions boundaries mucking it up.
6. The out-of-the-box managed policies are too broad and will often have you granting much more permissions than you need if you use them.
Low level IaC tools like CloudFormation and Terraform suck for this. They leave too much complexity to the end user to get right. CDK does mitigate the issue somewhat with it's grantX methods, but even those are fairly limited and require you to write manual policy statements for many use-cases.