Whether you agree with the article's recommendations or not, I do not understand...

AtlasBarfed · on March 15, 2024

IAM is an excellent example of why from a developer perspective security is "broken".

Some security group at a company will have a "review" of your permissions. Occasionally they will run a "sweep" and yank permissions out from under you.

Instead, here we have an API that the security team can actively manage and PROVIDE A SOLUTION. Should a developer on some project have domain knowledge of IAM to make a perfect bespoke (and it WILL be bespoke) least-permission policy?

No, of course no. that domain knowledge should be a service in any substantive AWS org where they provide it to you, and much more importantly, DEBUG it for you when it doesn't work.

Because here's the deal: IAM may be a bit ugly and have some cruft and evolution, and I believe S3 permissions are another entire headache atop IAM, but this is what an extremely fine grained permissions model looks like: detail hell.

ALL detailed permissions models will look like this. Defining perfect names (by definition coarse grained) to communicate the precise multidimensional n-brane border of a policy is basically impossible.

Here is another issue: in my last job they were obsessed with short-duration tokens and TOTP. Ok great. Hey wait, if I need to run an automated cluster-wide job that will take hours (backup, cleanup, log analysis, etc), what do I do then?

Security team didn't care. Automation? What's that? Just sit there watching the log and manually refresh the keys.

So I end up using a software TOTP generator and hacking it that way. I should not be doing that. It is likely a security hole. The security team should have heard my requirements, accepted them as a necessity (they are) and provided me a solution.

Security should be a solution and service.

dopylitty · on March 15, 2024

This is a good list.

One to add to the list is that IAM conditions[0] are extremely powerful but there's no good way to know which conditions to use in which scenario and troubleshooting is very difficult.

For instance if you look at the EC2 CreateNetworkInterface action[1] you'll see that there are three possible resources (network-interface (required), security-group (not required), subnet (required)) and each of those resources have several possible condition keys associated.

What's not obvious is which condition keys will be available in any given request. I've run the same CreateNetworkInterface request with the same parameters and IAM role twice in a row and by looking in the "encoded authorization message" that was returned with the failure in each case I found that in one case the resource was a security group while in the other case it was a subnet. Depending on the resource type different condition keys are available in the context. So if you want to allow CreateNetworkInterface but only if the ec2:SecurityGroupID is 'abc' it might or might not work.

An extra challenge is the encoded authorization message is truncated in CloudTrail so if you're using CloudFormation you don't actually get to see what the context was if a call fails. Then you have to find a way to make the same call CloudFormation made using an SDK so you can get the full text of the encoded authorization message.

There's no easy way to just say "try this API call with this role and tell me exactly what the context would be and what part of the IAM policy hits it if any"

0: https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_p... 1: https://docs.aws.amazon.com/service-authorization/latest/ref...

belter · on March 15, 2024

> there's no good way to know which conditions to use in which scenario

A start could be this even if does not address your scenario of calling twice in a row. I will discuss that one further below in the comment.

aws iam simulate-principal-policy --policy-source-arn arn:aws:iam::ACCOUNT:user/Paul --action-names "ec2:CreateNetworkInterface" --context-entries ContextKeyName="ec2:Subnet",ContextKeyValues="subnet-12345678",ContextKeyType=string --resource-arns "arn:aws:ec2:REGION:ACCOUNT:subnet/subnet-12345678" > simulatedIAMOutput.json

> I've run the same CreateNetworkInterface request with the same parameters and IAM role twice in a row and by looking in the "encoded authorization message" that was returned with the failure in each case I found that in one case the resource was a security group while in the other case it was a subnet.

Well EC2 would process these requests by first verifying subnet-related permissions before moving on to security group permissions. Variations in the error messages could reflect the point at which the request encounters a permission issue?

dopylitty · on March 15, 2024

Policy simulator is indeed a great option except I didn't have access to it at the time because it was disabled via SCP :D

Kidding around though I'll try that if I face a similar issue in the future. It has been improving quite a bit lately.

> Well EC2 would process these requests by first verifying subnet-related permissions before moving on to security group permissions. Variations in the error messages could reflect the point at which the request encounters a permission issue?

I would think the context would be deterministic in that case but I verified calling the API with the same parameters using the same role twice in a row ended up with different 'resource' values in the context. It was almost like under the hood boto3 or something else was changing the order of the parameters in the API call which was changing the way the context was created. I could've put in a support case but had bigger fish to fry.

tetha · on March 15, 2024

Something similar recently tripped me up: Some parts of AWS IAM are extremely detailed and you can create insanely specific policies allowing very precise control (almost to a fault). Other parts are very broad and unspecific.

For example, I recently needed to allow some EC2 instances to push a private IP around between those. I would have assumed I can create some policy along the lines of "Yeah, VMs with this role can push 10.20.30.40 around between their network interfaces". I haven't been able to find any way to restrict these IP addresses, so now I have the smallest policy I could create: "This role can assign fuck-any internal IPs to these interfaces, let's hope for the best." Doesn't really feel the greatest.

belter · on March 15, 2024

  {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AssignPrivateIpAddresses",
                "ec2:UnassignPrivateIpAddresses",
                "ec2:AttachNetworkInterface",
                "ec2:DetachNetworkInterface"
            ],
            "Resource": "*"
        }
    ]
}

tetha · on March 15, 2024

How does this control the private IP address that can be assigned? How does this stop the VM from just grabbing any IP? There isn't even anything IP-shaped in that policy.

ElevenLathe · on March 15, 2024

This is all true, and it's a pain, but the situation is still improved from 20 years ago, when all of these layers were in separately-managed systems with no integration at all. Need to access the database? Well, it's in another datacenter that we haven't added to the backbone yet, so it'll need to traverse the internet. That means you'll need an ACL to get to the outbound NAT -- talk to datacenter team A for that. Then you'll need an ACL at Datacenter B to let your NAT'd IP in -- ticket datacenter B for that, we don't have any of our own people there. Then you can talk to the DBAs to get a username and password -- make sure they lock it down to just the schemas you need, for reasons of least-privilege. At a large org you probably still have to talk to all those teams, but at a well-run one the conversation can be streamlined to a few pull requests against their IaC. At a small org running on one account, you can probably do it yourself in one merge. AWS and GCP (not sure about Azure, but maybe them too) both now also offer relatively painless ways of auditing roles to see what permissions are actually in use, so you can trim them to what is needed. This kind of feature is not really feasible with the permissions spread across 5 heterogeneous systems.

Sure, we could just put everything on one VLAN and hand out . credentials, but you can do the equivalent in the cloud too.

mstipetic · on March 15, 2024

I gave up trying to reason with it when I attempted to upload a docker image. Turns out I had permissions to upload an image, but not individual layers.

hintymad · on March 15, 2024

And a lot of complexity comes from allowing a user or an external service to access some resources that my account owns. I remember inside AWS, an engineer who understands IAM thoroughly can have enormous influence because the engineer will easily become the go-to person for all kinds of design discussions. IAM is truly a complex beast.

SOLAR_FIELDS · on March 15, 2024

Of all the things you mentioned I think things related to 3 are the ones that trip up even seasoned infrastructure engineers. Are you spinning up Karpenter? Well, my summer child, I hope you are aware of every single possible permission that EC2 nodes need to bootstrap themselves and join an EKS cluster. And let me tell you, that list is not tiny

A lot of times in the “developer guides” AWS includes the correct policies as a role buried in the docs somewhere. But those guides are often not tailored to work with Terraform and the like so if you go the IaC route you need to figure them out, often by trial and error.

belter · on March 15, 2024

> 1. There are permissions at various layers. If anything along the chain doesn't line up, permission denied.

- I am shocked that you don't seem to find Deny By Default the best thing in the world... (looking at you Azure...)

> You need deep understanding of each service's specific IAM setup.

- Color me shocked...

> Ancillary permission requirements are not obvious if you're not familiar with the details of how a service works.

- Imagine...Having to understand how stuff works to be gainfully employed....

> Permission related failures do not make the root cause immediately clear.

Cloudtrail is your friend...

> Secrets related permissions are especially tricky.

- Define the complaint....

> The out-of-the-box managed policies are too broad and will often have you granting much more permissions than you need if you use them.

At least for AWS, you are not supposed at any point in time to use out-of-the-box managed policies. Instead, you should use them as templates for your own policies or create your own Customer Managed Policies from scratch.

"...Another best practice is to create a customer managed IAM policy that you can assign to users. Customer managed policies are standalone identity-based policies that you create and which you can attach to multiple users, groups, or roles in your AWS account. Such a policy restricts users to performing only the AWS Private CA actions that you specify..." - https://docs.aws.amazon.com/privateca/latest/userguide/auth-...

bilalq · on March 15, 2024

> - I am shocked that you don't seem to find Deny By Default the best thing in the world... (looking at you Azure...)

The problem is not deny by default, but the complexity of setting "allow just the things I need". This is not easy.

> Cloudtrail is your friend...

Having to dig into the data of another service (that hopefully your org permissions allow you to read) instead of just being able to see a clear error message is not great DX. There are maybe valid security or performance reasons for not returning clear error messages, but there is a trade-off to usability made here.

> At least for AWS, you are not supposed at any point in time to use out-of-the-box managed policies. Instead, you should use them as templates for your own policies or create your own Customer Managed Policies from scratch.

Right, but because they're so broad, the templates themselves are overly broad. Even just using them as a reference, it's difficult to pare down to just what you need. You will inevitably go too far and have to play around with combinations until you identify the real need.

---

The rest of your comments essentially boil down to saying "skill issue"/"git gud". I think that downplays just how hard these things are to get right. I worked at AWS for almost 8 years and have used it for several more years as a customer since then. I still wind up with runtime errors due to permissions issues that I need to debug. I still find myself needing to spend lots of time shuffling through official docs and blog posts people have written about how to setup specific combinations of AWS services. I've seen other engineers within AWS struggle with this. I've spoken with many founders at startups who've struggled with this. The biggest challenge comes up when first learning and getting acquainted with a service. You don't even know what you don't know and there are many hurdles that can pop up along the way.

I mentioned it in my last comment, but CDK is probably the single biggest improvement to DX in the space here.

DigitalBison · on March 15, 2024

I'm not the person you're condescending to, but it is possible IMO to simultaneously recognize the security value in deny-by-default and Principle of Least Privilege while also finding it challenging to work with AWS's IAM permissions in practice.

belter · on March 15, 2024

The same way the person is condescending to the ones who don't find so difficult. I would even go and argue, that if you are already having issues with IAM, how do you expect to handle what is actually difficult?

biimugan · on March 15, 2024

CloudTrail is almost never useful on its own. So often CloudTrail will tell you something is denied and give you literally no other useful, direct information, especially as you start dealing with SCP-related denials. CloudTrail gives you a pile of metadata and says "here's everything, you figure it out". A mature audit solution would tell me the exact policy and line number that caused the denial and not play these guessing games.

The comment about out-of-the-box policies is true, I suppose, but hard to take seriously. Almost every policy example you encounter in the AWS documentation is insecure by default. They've gotten better over time noting this and pointing to better examples for different use cases. But it's still pretty bad.

aggcefbhtd · on March 15, 2024

Lol. “You’re just holding it wrong”.

If most people find it to be difficult to use correctly, it is difficult to use correctly. Maybe that’s the best we can do but it’s still bad.

belter · on March 15, 2024

I always heard an F-16 it's a pretty easy to fly airplane ...For trained pilots. Maybe I have done Consulting at too many Startups or large Enterprise with large Cloud deployments, where most of the team seems to have barely spent some time with the docs. Some even proudly state they learned it by "looking in with colleagues"...or "on the job". Yes, it's Friday and that makes me grumpy...

patrick451 · on March 15, 2024

The military sends F-16 pilots to a months, long dedicated flight school AFIK. How many companies send their devs to IAM school for that long? None.

aggcefbhtd · on March 15, 2024

I think “IAM makes me feel like a fighter pilot” is a hilarious interpretation of your argument. Love it.

belter · on March 15, 2024

Well the stakes are sometimes as high....