Whether you agree with the article's recommendations or not, I do not understand how there are so many commenters saying "IAM is not that complicated". Even engineers internally at AWS frequently get tripped up with IAM permission settings. It's rare that someone gets them right on the first try.
Just some of the things that make it challenging:
1. There are permissions at various layers. If anything along the chain doesn't line up, permission denied.
2. You need deep understanding of each service's specific IAM setup. It's not enough to write a policy that will grant you read access to a DynamoDB table. Your application probably also needs to grant access to the GSI/LSI indices created.
3. Ancillary permission requirements are not obvious if you're not familiar with the details of how a service works. Want a Lambda function to have logs and traces? Make sure you have the relevant CW and X-Ray permissions set on it.
4. Permission related failures do not make the root cause immediately clear. Your S3 get operation may fail because you're missing permissions to the related KMS key. The usage of the ancillary KMS API calls here is not obvious unless you inspect the configuration details of the resource.
5. Secrets related permissions are especially tricky. To be able to read a cross-account secret, you need to grant the IAM identity permissions to get the secret value, grant the identity permissions to decrypt the associated KMS key used for the secret, grant the related account identities permissions to decrypt the key in the KMS resource policy, and grant the related account identities permissions to get the secret value in the secret resource policy. This is assuming there's no other things like SCPs and permissions boundaries mucking it up.
6. The out-of-the-box managed policies are too broad and will often have you granting much more permissions than you need if you use them.
Low level IaC tools like CloudFormation and Terraform suck for this. They leave too much complexity to the end user to get right. CDK does mitigate the issue somewhat with it's grantX methods, but even those are fairly limited and require you to write manual policy statements for many use-cases.
IAM is an excellent example of why from a developer perspective security is "broken".
Some security group at a company will have a "review" of your permissions. Occasionally they will run a "sweep" and yank permissions out from under you.
Instead, here we have an API that the security team can actively manage and PROVIDE A SOLUTION. Should a developer on some project have domain knowledge of IAM to make a perfect bespoke (and it WILL be bespoke) least-permission policy?
No, of course no. that domain knowledge should be a service in any substantive AWS org where they provide it to you, and much more importantly, DEBUG it for you when it doesn't work.
Because here's the deal: IAM may be a bit ugly and have some cruft and evolution, and I believe S3 permissions are another entire headache atop IAM, but this is what an extremely fine grained permissions model looks like: detail hell.
ALL detailed permissions models will look like this. Defining perfect names (by definition coarse grained) to communicate the precise multidimensional n-brane border of a policy is basically impossible.
Here is another issue: in my last job they were obsessed with short-duration tokens and TOTP. Ok great. Hey wait, if I need to run an automated cluster-wide job that will take hours (backup, cleanup, log analysis, etc), what do I do then?
Security team didn't care. Automation? What's that? Just sit there watching the log and manually refresh the keys.
So I end up using a software TOTP generator and hacking it that way. I should not be doing that. It is likely a security hole. The security team should have heard my requirements, accepted them as a necessity (they are) and provided me a solution.
One to add to the list is that IAM conditions[0] are extremely powerful but there's no good way to know which conditions to use in which scenario and troubleshooting is very difficult.
For instance if you look at the EC2 CreateNetworkInterface action[1] you'll see that there are three possible resources (network-interface (required), security-group (not required), subnet (required)) and each of those resources have several possible condition keys associated.
What's not obvious is which condition keys will be available in any given request. I've run the same CreateNetworkInterface request with the same parameters and IAM role twice in a row and by looking in the "encoded authorization message" that was returned with the failure in each case I found that in one case the resource was a security group while in the other case it was a subnet. Depending on the resource type different condition keys are available in the context. So if you want to allow CreateNetworkInterface but only if the ec2:SecurityGroupID is 'abc' it might or might not work.
An extra challenge is the encoded authorization message is truncated in CloudTrail so if you're using CloudFormation you don't actually get to see what the context was if a call fails. Then you have to find a way to make the same call CloudFormation made using an SDK so you can get the full text of the encoded authorization message.
There's no easy way to just say "try this API call with this role and tell me exactly what the context would be and what part of the IAM policy hits it if any"
> I've run the same CreateNetworkInterface request with the same parameters and IAM role twice in a row and by looking in the "encoded authorization message" that was returned with the failure in each case I found that in one case the resource was a security group while in the other case it was a subnet.
Well EC2 would process these requests by first verifying subnet-related permissions before moving on to security group permissions. Variations in the error messages could reflect the point at which the request encounters a permission issue?
Policy simulator is indeed a great option except I didn't have access to it at the time because it was disabled via SCP :D
Kidding around though I'll try that if I face a similar issue in the future. It has been improving quite a bit lately.
> Well EC2 would process these requests by first verifying subnet-related permissions before moving on to security group permissions. Variations in the error messages could reflect the point at which the request encounters a permission issue?
I would think the context would be deterministic in that case but I verified calling the API with the same parameters using the same role twice in a row ended up with different 'resource' values in the context. It was almost like under the hood boto3 or something else was changing the order of the parameters in the API call which was changing the way the context was created. I could've put in a support case but had bigger fish to fry.
Something similar recently tripped me up: Some parts of AWS IAM are extremely detailed and you can create insanely specific policies allowing very precise control (almost to a fault). Other parts are very broad and unspecific.
For example, I recently needed to allow some EC2 instances to push a private IP around between those. I would have assumed I can create some policy along the lines of "Yeah, VMs with this role can push 10.20.30.40 around between their network interfaces". I haven't been able to find any way to restrict these IP addresses, so now I have the smallest policy I could create: "This role can assign fuck-any internal IPs to these interfaces, let's hope for the best." Doesn't really feel the greatest.
How does this control the private IP address that can be assigned? How does this stop the VM from just grabbing any IP? There isn't even anything IP-shaped in that policy.
This is all true, and it's a pain, but the situation is still improved from 20 years ago, when all of these layers were in separately-managed systems with no integration at all. Need to access the database? Well, it's in another datacenter that we haven't added to the backbone yet, so it'll need to traverse the internet. That means you'll need an ACL to get to the outbound NAT -- talk to datacenter team A for that. Then you'll need an ACL at Datacenter B to let your NAT'd IP in -- ticket datacenter B for that, we don't have any of our own people there. Then you can talk to the DBAs to get a username and password -- make sure they lock it down to just the schemas you need, for reasons of least-privilege. At a large org you probably still have to talk to all those teams, but at a well-run one the conversation can be streamlined to a few pull requests against their IaC. At a small org running on one account, you can probably do it yourself in one merge. AWS and GCP (not sure about Azure, but maybe them too) both now also offer relatively painless ways of auditing roles to see what permissions are actually in use, so you can trim them to what is needed. This kind of feature is not really feasible with the permissions spread across 5 heterogeneous systems.
Sure, we could just put everything on one VLAN and hand out . credentials, but you can do the equivalent in the cloud too.
I gave up trying to reason with it when I attempted to upload a docker image. Turns out I had permissions to upload an image, but not individual layers.
And a lot of complexity comes from allowing a user or an external service to access some resources that my account owns. I remember inside AWS, an engineer who understands IAM thoroughly can have enormous influence because the engineer will easily become the go-to person for all kinds of design discussions. IAM is truly a complex beast.
Of all the things you mentioned I think things related to 3 are the ones that trip up even seasoned infrastructure engineers. Are you spinning up Karpenter? Well, my summer child, I hope you are aware of every single possible permission that EC2 nodes need to bootstrap themselves and join an EKS cluster. And let me tell you, that list is not tiny
A lot of times in the “developer guides” AWS includes the correct policies as a role buried in the docs somewhere. But those guides are often not tailored to work with Terraform and the like so if you go the IaC route you need to figure them out, often by trial and error.
> 1. There are permissions at various layers. If anything along the chain doesn't line up, permission denied.
- I am shocked that you don't seem to find Deny By Default the best thing in the world... (looking at you Azure...)
> You need deep understanding of each service's specific IAM setup.
- Color me shocked...
> Ancillary permission requirements are not obvious if you're not familiar with the details of how a service works.
- Imagine...Having to understand how stuff works to be gainfully employed....
> Permission related failures do not make the root cause immediately clear.
Cloudtrail is your friend...
> Secrets related permissions are especially tricky.
- Define the complaint....
> The out-of-the-box managed policies are too broad and will often have you granting much more permissions than you need if you use them.
At least for AWS, you are not supposed at any point in time to use out-of-the-box managed policies. Instead, you should use them as templates for your own policies or create your own Customer Managed Policies from scratch.
"...Another best practice is to create a customer managed IAM policy that you can assign to users. Customer managed policies are standalone identity-based policies that you create and which you can attach to multiple users, groups, or roles in your AWS account. Such a policy restricts users to performing only the AWS Private CA actions that you specify..." - https://docs.aws.amazon.com/privateca/latest/userguide/auth-...
> - I am shocked that you don't seem to find Deny By Default the best thing in the world... (looking at you Azure...)
The problem is not deny by default, but the complexity of setting "allow just the things I need". This is not easy.
> Cloudtrail is your friend...
Having to dig into the data of another service (that hopefully your org permissions allow you to read) instead of just being able to see a clear error message is not great DX. There are maybe valid security or performance reasons for not returning clear error messages, but there is a trade-off to usability made here.
> At least for AWS, you are not supposed at any point in time to use out-of-the-box managed policies. Instead, you should use them as templates for your own policies or create your own Customer Managed Policies from scratch.
Right, but because they're so broad, the templates themselves are overly broad. Even just using them as a reference, it's difficult to pare down to just what you need. You will inevitably go too far and have to play around with combinations until you identify the real need.
---
The rest of your comments essentially boil down to saying "skill issue"/"git gud". I think that downplays just how hard these things are to get right. I worked at AWS for almost 8 years and have used it for several more years as a customer since then. I still wind up with runtime errors due to permissions issues that I need to debug. I still find myself needing to spend lots of time shuffling through official docs and blog posts people have written about how to setup specific combinations of AWS services. I've seen other engineers within AWS struggle with this. I've spoken with many founders at startups who've struggled with this. The biggest challenge comes up when first learning and getting acquainted with a service. You don't even know what you don't know and there are many hurdles that can pop up along the way.
I mentioned it in my last comment, but CDK is probably the single biggest improvement to DX in the space here.
I'm not the person you're condescending to, but it is possible IMO to simultaneously recognize the security value in deny-by-default and Principle of Least Privilege while also finding it challenging to work with AWS's IAM permissions in practice.
The same way the person is condescending to the ones who don't find so difficult. I would even go and argue, that if you are already having issues with IAM, how do you expect to handle what is actually difficult?
CloudTrail is almost never useful on its own. So often CloudTrail will tell you something is denied and give you literally no other useful, direct information, especially as you start dealing with SCP-related denials. CloudTrail gives you a pile of metadata and says "here's everything, you figure it out". A mature audit solution would tell me the exact policy and line number that caused the denial and not play these guessing games.
The comment about out-of-the-box policies is true, I suppose, but hard to take seriously. Almost every policy example you encounter in the AWS documentation is insecure by default. They've gotten better over time noting this and pointing to better examples for different use cases. But it's still pretty bad.
I always heard an F-16 it's a pretty easy to fly airplane ...For trained pilots.
Maybe I have done Consulting at too many Startups or large Enterprise with large Cloud deployments, where most of the team seems to have barely spent some time with the docs. Some even proudly state they learned it by "looking in with colleagues"...or "on the job". Yes, it's Friday and that makes me grumpy...
I agree with people saying IAM is not that complicated. On the other hand, I think I also agree that IAM is extremely complicated. It's both. I'm serious!
The problem with IAM isn't the functionality it offers; it's almost exactly what you want. I mean look at what it actually does, isn't it literally exactly what you would do? Sure, maybe the terms are confusing or something, but on the whole... it's hard to argue that what it offers isn't basically what you want. You want to allow identities, to do things. You group those things into roles. And so on.
That said, the more powerful and granular permissions and ACLs get, the more grand the architecture you have to craft to make good use of it, and I think at some point, your own IAM rules become a work of engineering themselves. You wind up having to specifically engineer around and for IAM. This is not unique to cloud IAM; when doing complex NixOS setups, I have occasionally realized that my webs of plumbing secrets through SOPS to systemd units, and setting up group permissions for UNIX domain sockets between services, winds up getting quite complex quite quickly, and that it is basically, yes, engineering of its own sort. And if you add cgroups and network namespaces and nftables rules and seccomp, god help you, it's even worse than IAM! And that's just on a single machine...
IAM is "not complicated and complicated" in the same sense as UI frameworks/ecosystems/boost are. The "concept" is fairly simple, but you have to know a ton of bits and bobs to make sense of it all, for your particular use case. And you often have to dig deeper than you would like.
I don't run any clouds myself but if I were AWS or whoever I'd think there's a whole lot of ways to make this process more ergonomic without sacrificing functionality. A tool that can report permission failures with a "should we allow this yes/n?" button for admins. A user (real or service) tries to run a cache invalidation, or write a redshift DB or add an IP to a security group and gets a "permission denied". Admin gets an email or a ticket saying what process failed and click a button to enable or deny.
Typing that out it's really the massive gulf between the abstraction "User wants to invalidate a cache" vs the implementation of 87 granular grants with obscure nomenclature.
Isn't this already how it works tho? At least in my time doing internships, I would hit some screen saying access needed and just press a couple of buttons tagging my manager and some sec team and soon enough it would be granted or explained that, actually no, I was accessing the wrong stuff, and be pointed in the right way.
This is more or less what the article proposes except the article proposes doing this one time after collecting a full list of actions.
Both that solution and the one in the article miss one point though: If you use the AWS Console at all it makes hundreds of calls to all manners of AWS service in all available regions. Because of this you can't just assume the calls made by a role intended for interactive use over some period are the "correct" privileges for that role because someone just clicking around in the console will generate thousands API calls to many different services.
Please don't do what this article advises. These are 10 min of my life I will never get back...
IAM is not that complicated. The example diagram for AWS is showing all the features available. You can use only what you require.
Do two things from the beginning:
1) Least Privilege - Only the permissions required: You need to do that, because due to the constant zero days around the different software vendors, you are statistically guaranteed to have a security event. Establishing security boundaries will allow to contain the damage.
In combination with....
2) Temporary Privileges - Assume the roles with the permissions you require only when you required them: This will make it harder on an attacker, since they will need to compromise you while you are holding those elevated privileges. When you finish your task, either manually or programmatically, detach from the role or assume a different one with lower or different permissions.
The article completely addresses the problems in #1 and you’ve offered no solutions. It offers a solution for determining what the hell the least priveleges are, while you assume prescient knowledge of them without justification. Your #2 is also exactly one of its recommendations.
> The article completely addresses the problems in #1 and you’ve offered no solutions.
You don't define the privileges you need, by running around with full privileges. The article is pitching some kind of tool the author developed, but you can do the same at least for AWS, for a long time. And if you don't know what policies you need you should talk to either the vendor or the application creator, as you will never be able to exercise all the compute paths...
I really love the trend of simply restating what an article argues against without addressing the actual article's points.
I had an exhausting discussion on Reddit about why storing UTC is not always sufficient from commenters who continually proved they hadn't read the article or the rest of the the comments.
I'm gonna go with "a hyperbolic, but potentially-necessary corrective to the mass cargo-culting of dev practices only appropriate to global-scale organizations, with a possibly egotistical amount of clickbait self-promotion" on that one.
Temporary Privileges is an example of security theater. It does not change the attack surface in a threat model because the model is not temporal. Has anyone ever shown a practical evidence that the attacker will face a problem waiting for the elevated permissions to support this idea?
And if you still really want to do that, you don't need AWS roles as a separate concept for this. You can just use temporary membership in groups.
AWS IAM model is overcomplicated compared to GCP/Azure.
It becomes clear when you try to migrate the project with multiple envs (projects in GCP aka accounts in AWS) with cross-env accesses.
It's not just theater, it can be the difference between someone finding an unlocked laptop that needs to refresh their access before doing more actions or just having 100% unfettered access. This is only one attack vector that gets safer by temporary tokens with short expiries, and for the dev under normal work conditions just means every few hours (or whatever TTL) you need to place your finger on your fingerprint reader for 1 second.
I think you’re talking about something different. AWS session tokens let you use your SSO to request session tokens that have a short expiry. So you can do API/console actions but if an attacker takes the creds, they expire. It also lets you generate session tokens that only have the subset of your allowed perms that you need for that workflow.
Yea. Like, I’m 100% on board with the idea that AWS IAM is full of footguns, is overcomplicated, and is hard to get right (I assume other cloud platforms are similar, but my expertise is largely specific to AWS).
But also it’s a complex and very important problem space.
> Temporary Privileges is an example of a security theater. It does not change the attack surface in a threat model because it's not temporal. Has anyone ever shown a practical evidence to support this idea?
I agree. I think the benefit of this is quite low. If someone takes over your machine, things are lost anyways. If they take over your machine but for some reason cannot access your password manager (or so) or your 2fa to increase priviledges, you at least gain some time before the attack happens since the attacker has to wait - but it's not a major win.
The only case where this really helps is if the attacker gains only temporary access to you (maybe a temporary vulnerability in the browser) but can't "persist" it. In that case, you can reduce the blast radius.
Are you aware that you can associate an EC2 instance profile on a temporary basis with a role? And attach and detach them via api or on a schedule? Because if you do that, and you hack the machine (Linux, Windows or Mac not relevant...), but you don't have the role with the privileges you need, you are going nowhere.
Yeah, so that means the attacker has to wait for the next schedule. As I said, that's an advantage but I wouldn't classify it as a major win.
It's different if you use a different machine for the priviledged account. Then an attacker has to take over that second machine too. IMHO this is a mucher better concept, but also increases friction significantly.
It also helps in large companies where lots of people may have needed the higher permissions over the years - if you do temporary permissions then someone currently/soon having those permissions needs to be hacked, as opposed to hacking any one of the many people who needed those permissions a few years ago and still have them because they weren't temporary and got forgotten about.
I think this is a different case though. The point of those "temporary priviledges" is not so that someone only can use them sometimes, it's that they only do use them sometimes.
What you say (that people should only have the permissions that they need) is orthogonal to that.
Why do you assert that all threat models have no temporal component?
A realistic attack that compromises a low privilege session may not be able to leverage that into higher privileges. Therefore, limiting the use of high privilege to a smaller window of time definitely reduces the attack surface.
GCP handles #1 with their recommended actions on IAM roles. Good luck doing this manually. GCP will give adivce on each IAM role, which person hasn't used their access in 3 months and encourage you to take action (remove it).
GCP has a glaring problem with temporary privileges. The console doesn’t support them. So if there is some operational activity that a user needs to do via gui rarely you either have to manually give and take those permissions or let them sit with elevated permissions all the time.
And the lack of roles of roles is beyond idiotic, it’s downright dangerous.
Just temporary add the user to yet another group with the needed permissions. Or use the IAM conditional policy.
Or impersonate a service-account (which is more or less the same as asuming AWS role)
GCP's lack of "AWS role" concept is great and straigtforward.
As well as its lack of both identity-bound policies and resource-bound policies at the same time.
That hell wouldn't exist without the guys who like to endlessly and "dutifully" set up and rearrange security groups, IAM, ldap hierarchies, just in order to feel important I guess. Every company bigger than a dozen of employees has this type of guys and they are nightmare to work with; they vision is very often detached from the reality how the company works, but somehow they have the illusion that the policy existing mainly in their heads is more important than the real behaviours of people in the organization.
These guys are often hired to implement regulation or certification requirements and the organization, if its goal is to comply, has to change its behavior and processes.
Not saying your point is not true, I met guys who did it just because too. But it's not always malice or incompetence on their part.
My experience is that both of you are right: the security people implement important regulations, and they do so without ever looking at the business processes themselves. Then it's up to the targeted people to chase exceptions and recategorization and and and, which on one hand creates a friction which eats up lots of resources and time, and secondly pokes holes in that exact perfect structure it was supposed to create. And all this could be avoided if security worked hand in hand with business but no, security is all ivory towers and business is all "don't touch my rights". Aka, guaranteed constant conflict and frustration.
That’s because businesses goals are to make something work, and securities goals are to stop something from happening (or comply with a process with that goal in mind).
Hopefully not the same ‘thing’ being targeted of course, because then it will get really bad, but yes conflict is inevitable.
Sorry for you experience. But it not always like this.
Sane security guys always remember that they are there to support the business, not to slow down the development or to prevent useful information and assets from being accessed. So they are ready to accept compromises and are always trying to control the potential risks keeping the comfort of the colleagues in mind.
The problem with IAM systems is they tend to try to encompass so many different functionalities, and stay unopinionated, that there are just so many ways to achieve similar end results. This opens the way for endless bikeshedding, and unfortunately is inevitable to some degree in large enough organizations.
This is a bit of a shameless plug, but I hope since it's an open source project it's okay. I'm working on a suite of tools called Otterize (otter and authorize, get it, haha :) that automates workload IAM for Kubernetes workloads.
You label your Pods to get an AWS/GCP/Azure role created, and in a Kubernetes resource specify the access you need, and everything else is done by the Otterize Kubernetes operators so that your pod works.
It's a lot simpler than all the kungfu you normally have to do, but it's not magic, honestly, it's just the result of limiting scope and having an opinionated view of what the development workflow should look like. Basically, instead of maximizing on capabilities, it trades some capabilities to maximize on developer comfort.
Check it out if you're keen on contributing, or just think IAM has a tendency to devolve into a mess ridden with politics.
github.com/otterize/intents-operator and docs.otterize.com
What’s nice about having 30 years of experience is that I don’t need anyone else’s confirmation when I realize something is poorly designed. If I can build firewalls in OpenBSD or Cisco IOS in text mode, SELinux, etc and IAM is coming off as byzantine, it’s because it is.
Not that I blame Amazon. I think they’re a victim of their own success in this regard and it was a solution that was devised ad hoc reactively as they ran into authorization problems rather than something that was architected top down. When you do that you always end up with a mess, but they may not have had a choice.
I am well aware that identity and access authorisation is complex, and AWS, GCP and Azure desperately needed to add the capability to their portfolio as this is mandatory for enterprise sales which is where the big bucks are.
But boy as soon as they started adding IAM they took all the fun out of deploying my personal shit to any cloud.
I put one startup on fly.io just because it was too difficult to communicate AWS intricacies to the founders. I'm ok having a fixed secret to authorize client A to talk to API B where needed, and the actual inner network is all inside wireguard tunnels, automatically provided by Fly.
I’m glad someone wrote about it. IAM is hella confusing to me. Roles role impersonation and discovering what roles my user needs to use some service. It’s terrible. At least on the GCP clicking through the menus the wizards and front end is so nice that it makes enabling things and getting IAM right easier.
I think the usual part of AWS IAM is simple. You develop your app, it uses a Lambda role or an ECS task role, and you keep adding permissions to the role as needed. When you query a new table in DynamoDB, your app won't work until you add the missing permission. You always add those permissions to the CDK/CloudFormation definition of the role, so that the role works in any AWS account the app is deployed to. CDK actually handles most of this automatically, as you define the relationships between your cloud resources.
The more complicated stuff starts happening when you have many services that need to access each other's resources directly. Then you need to think a bit more about the architecture and how you expose resource names, managed policies and roles between services. It's no longer just simple role definitions within the CloudFormation stack, but you have to pass around account identifiers, regions and resource ARNs in the configuration.
The cognitive dissonance from trying to keep both AWS IAM and GCS IAM in my head at the same time, when they're almost but not quite entirely unlike each other, is maddening and gives me Induced Dissociative Identity Disorder (IDID).
They should merge them together into one big happy standard, and call it Worldwide Enterprise Access Rights Environment (WEARE).
Maybe if they signed up enough celebrities to sing a song about it, it would make the world a better place...
IAM is hideous; users, organisations and policies are a barouque mess, where you're expected to hand-edit JSON files specifying resources then debug the cryptic error messages thrown when their own examples are pasted in.
Plus the documentation is out of date in so many places, describing actions to take that have long since changed.
It's ok for single users (simple single-user use cases) and large organisations that can afford to allocate the time and effort to administer the thing; anywhere in between is a just-say-no.
(ranting a bit because I just lost the whole morning adding a policy to allow a single user to do a limited set of things).
IAM could be way better with some default tools. For all I know they already exist, but I haven't seen anything close in the training or poking around.
Example, even with something as abstruse as SELlinux, you can literally just attempt whatever you're trying to do, and then pipe in the failure log into `fail2allow` and the permissions will be set to least privilege automatically and then it just works in most situations.
You can also pipe in the failure log to the policy advisor and it spits out a bunch of advice in plain text as well as a copy / paste command to implement it.
IAM is horrible if it can take design notes from SELinux.
My authorization boundaries are almost entirely in Userify within projects and server groups.
I only associate other permissions through AWS instance roles, because I try to not give out specific IAM roles or keys to users or developers AT ALL -- only to the instances they're logging into.
Obviously this probably can't work for all companies and is dependent on how you have your environments set up, but we even run development on EC2 instances, and with Userify we get a color coded view for who can log into which instances, and then those instances already have the correct custom role.
GCP IAM is the worst. AWS IAM is not nearly as bad.
GCP sucks so bad as a product, that the only way to tell what IAM policies apply to your service account, is to run some kind of analysis query thing exported to a BigTable (which will cost you money).
You'd think you could just go into the console and click on the service account and it'd show you which policies are linked to roles are linked to your service account? That would make sense, and be convenient. But this is Google we're talking about. Engineering principles will always trump customer experience.
It's much worse than that of course. The default roles give too many permissions, for nearly anything you want to do. Often you are limited by what you can control, to only at an Org level, or Folder, or Project. Yet making a custom role is often difficult, leaving you to usually just slap on the default roles, making your resources insecure. Much of the time, a user must have an Admin-level permission over all VMs in order to SSH into them with GCP creds. Kind of defeating the purpose of having IAM to begin with.
I think the only reason we haven't heard of more GCP accounts getting compromised due to the shitty default policies is, thankfully, GCP has few customers.
I quite like the way JupyterHub deals with scopes¹ via roles²:
You create a role called "cleanup user servers", which requires a scope of "read:users" and "admin:server"³ (i.e. a role is bundle of scopes), and then define the services that role has, such as "garbage-collect".
You then generate a new service that names "garbage-collect", and associate an API key with it, and boom, that API key has the permissions to read user metrics, and administer the servers and is tied to a specific service.
Want to add another service to that role (e.g. "berate-users")? Just add it as a service in that role definition, and you can use the exact same API key.
The solution that the OP says they want in the article is akin to how cloudknox (whatever it's called now and there are others) handles permissions in the big three providers.
It ingests log data and permission/role assignments then reconciles them to then allow you to create custom roles that only have the permissions that someone actually uses.
The sector of tools is called CIEM. Cloud Infrastructure Entitlement Management.
Here's the thing though...PA and MS charge PER MANAGED RESOURCE. It's crazy. This is something that should be core capability, but its an added charge. Its a space that is screaming for open source tooling to make it less rent-seek-ish.
I feel like all the innovation around wireguard is a reaction to this situation.
Cloud security, especially from AWS, is (as described elsewhere in the comments) byzantine, but I've always felt that the real underlying problem is generic advice that only serves to protect the backside of the cloud provider. What's really needed are crystal clear patterns that cover >95% of actual use cases.
The author mentions building this solution multiple times, but doesn't link to any resources as far as I can tell, so here's the one I've used in the past [1].
In one mode, it proxies all your app's API calls and tells you exactly what permissions you need.
> What is this obvious solution? You, an application developer, need to launch a new service. I give you a service account that lets you do almost everything inside of that account along with a viewer account for your user that lets you go into the web console and see everything.
This is exactly what AWS Organisations do, available for many years already. This now comes with IAM Identity Center that gives you SSO into multiple AWS accounts. So the setup I use is one management account running IAM Identity Server with users and groups. Then each product gets one or few AWS accounts that they own. Super simple and effective for small organizations. For larger orgs you would need to also use SCP and maybe AWS Control Tower or similar.
Worst may it be, AWS released each feature to address specific needs of their customers. Unfortunately it is hard to build a complex product with simple and consistent conceptual models as well as orthogonal product features. I'm saying this because my current company has been trying to build its own IAM system outside of cloud. Man, that is hard. So many options, so many scenarios, so many debates, and seemingly so much work. That perspective makes me appreciate how hard it is to build a flexible IAM system that supports multiple accounts, multiple roles, multiple services, multiple granularities, and multiple accessible models.
Dealing with IAM in the context of a cloud provider is honestly a dream scenario. Imagine being responsible for it in an enterprise with thousands of users and applications. Many of these apps were written before the concept of IAM even existed.
I believe folders (or groups or similar) are the right solution to this, just not in the way that Google is implementing it. Basically, you group resources into folders and then users have read or write access to that folder.
This way, your database guys can access one big folder with all the database stuff, your server guys can access server stuff and your frontend guys can deploy to an S3 bucket, but not much more.
This is the level of granularity you need in the real world. i don't believe that any organisation, no matter how big or sophisticated has an employee that can have roles/datastore.backupsAdmin, but not roles/datastore.backupSchedulesAdmin.
> i don't believe that any organisation, no matter how big or sophisticated has an employee that can have roles/datastore.backupsAdmin, but not roles/datastore.backupSchedulesAdmin.
Believe it. Based on past experience as CTO and head of Security Engineering at one of the biggest orgs, this split is used and necessary, unless you want to inject yet another approval loop somewhere.
The first one lets someone get, list, or delete the backups, the second one lets someone make backups happen or not happen. I can absolutely see forcing regular backups to happen (a regulatory requirement) being a different person than whoever is using the backups, even different from the admin who can delete those backups.
(Delete means the backup admin can make it as if a backup didn't happen by deleting, but that's not what the compliance regulation covers, it has to happen in the first place, which is what the scheduleadmin covers.)
We have enough advanced cryptographic schemes by this point where you can just encrypt most of the sensitive data stored in cloud environments at the client-level and render it useless to an attacker (barring considerations that they might hold on to the encrypted data until a weakness in the encryption scheme is discovered).
Store secrets directly with the client so attacks only compromise the data of one user and not all of them. Want if they lose the key or what if multiple groups need access? Shamir’s secret sharing. What if we might not trust some of the k of n group members? Require interaction with a public ledger that provably logs secret access as a part of the secret sharing scheme. What about machine learning on a massive amount of user data? Well, homeomorphic encryption isn’t quite there yet, but how much sensitive info do you really need for your training data?
We’re not going to eliminate security flaws in systems without provably correct programs by default (which is probably never going to happen), and even with that, you have the whole social element of security, which means you still need to design the system in a way that limits the damage one or a few people can do. Which requires a different type of identity management than IAM.
Coming at this from a different angle: I always found many security issues are actually data retention issues. The root cause is a system that is bleeding logs, context, and valuable traces; so, all of these safe guards like specialized access, elevated access, bespoke roles, etc. act not as remedies to the bleeding, but as tourniquets.
When data isn't being lost to the void, undo-ability grows. And, having perfect undo-ability is genuine "bulletproof" security. Security, in the traditional practice, then becomes needless undo prevention. That's a lot simpler to tackle than disaster averting prevention.
As a side note to all this: Can we please get rid of encoding policy logic in JSON or other proprietary policy languages? I just want Rego, man.
While we're at it, publish these IAM engines as public, open-source projects so that we can write proper unit tests. I don't want to call a policy simulator API.
Just some of the things that make it challenging:
1. There are permissions at various layers. If anything along the chain doesn't line up, permission denied.
2. You need deep understanding of each service's specific IAM setup. It's not enough to write a policy that will grant you read access to a DynamoDB table. Your application probably also needs to grant access to the GSI/LSI indices created.
3. Ancillary permission requirements are not obvious if you're not familiar with the details of how a service works. Want a Lambda function to have logs and traces? Make sure you have the relevant CW and X-Ray permissions set on it.
4. Permission related failures do not make the root cause immediately clear. Your S3 get operation may fail because you're missing permissions to the related KMS key. The usage of the ancillary KMS API calls here is not obvious unless you inspect the configuration details of the resource.
5. Secrets related permissions are especially tricky. To be able to read a cross-account secret, you need to grant the IAM identity permissions to get the secret value, grant the identity permissions to decrypt the associated KMS key used for the secret, grant the related account identities permissions to decrypt the key in the KMS resource policy, and grant the related account identities permissions to get the secret value in the secret resource policy. This is assuming there's no other things like SCPs and permissions boundaries mucking it up.
6. The out-of-the-box managed policies are too broad and will often have you granting much more permissions than you need if you use them.
Low level IaC tools like CloudFormation and Terraform suck for this. They leave too much complexity to the end user to get right. CDK does mitigate the issue somewhat with it's grantX methods, but even those are fairly limited and require you to write manual policy statements for many use-cases.