ISO any articles/documents related to scaling access control, for example if you have 100_000 users and 90k of them have access to some resource, but 10k do not, and you can't use groups that your customer knows about. Obvious solutions are "where allowed_user_ids = ... big list" or "where disallowed_user_ids NE ... small list"; the latter not a solution as you can't optimize this query with a normal tree-like index.
I suppose you could use some sort of bloom filter, or create/maintain groups behind the scenes somehow, but haven't seen many articles cover this.
In most cases (even with many 100s of thousands or millions of users), there are far fewer _roles_. So you can generally answer the question of "which users can access this resource" by answering the questions "which roles can access this resource", and "which users have those roles". If you're using SQL to store roles and role assignments, your query for all users becomes:
select * from users
join user_roles on user_roles.user_id = users.id
where user_roles.role_id in [... small list ...]
This can get tricky if you have 100s of thousands or millions of _roles_, and each of those roles can be dynamically assigned access to a significant percentage of your resources. But that might suggest you're structuring roles incorrectly in the first place (and you should be using fewer roles with more users per role).
All that being said, I think there's definitely more to be written here. Keep a lookout -- we might do some more writing about the topic.
> This can get tricky if you have 100s of thousands or millions of _roles_
If you use something like a bitmap index (e.g. Roaring Bitmaps), you can easily manage roles in-line on each user row if the role membership is typically sparse.
You can still maintain a separate Roles table as a canonical reference, but you would no longer need to join on it to determine who has what.
Yes and no: there is no bitmap SQL type, but at least MSSQL packs multiple BIT-columns on the same table into the same word/byte and then uses bit tests for filtering. I suspect Oracle and Postgresql can do the same, but I don't know for sure.
Disclosure: I'm a cofounder of Authzed, a YC company building a permissions database[0] inspired by Google's Zanzibar paper[1].
This is actually a really hard problem and depends on the systems with which you are integrating. We call this problem "ACL filtering"[2] and there are two general strategies: pre and post filtering.
We have a blog post[3] describing our API for pre-filtering which can stream results that you can then use build a SQL query or data-structures like bloom filters/bitmaps. We currently have a proposal on GitHub[4] for an extension to that strategy adding a denormalization/caching layer (which is applying a similar strategy to the Leopard Indexing system internally at Google).
You might also be surprised at the performance you can achieve with post-filtering by building an iterator in your programming language of choice that will batch together permission checks and amortize the cost of filtering those results from the set of all results that you pull out of your database.
Additionally, if you're interested deeper database specific integrations, there are hooks into various components such as Postgres's Row Level Security, but that typically means eschewing a cloud service to operate your database for you (e.g. RDS) so that you can install your own plugins.
The key insight is to define policies as purely virtual; a policy never refers to real users or real resources. Instead, it refers to virtual users (via "roles") or virtual resources (via "tags"), which can then be associated with real users/resources without modifying the policy.
Interesting, this neatly divides authorization checks into three-ish questions: 1. What roles are assigned to the user?, 2. What tags are assigned to the resource?, 3. Do the answers to 1 and 2 overlap? Each of which could have a cached denormalized index and should compress well.
Yeah I've built this kind of stuff before, was just hoping for some interesting information on the data access side - when you can't use tags/roles. But I always push for this approach when I can.
I think I'm missing something. Are you reframing your original constraint, that you can't use groups that your customer knows about, as you can't use tags/roles? You can certainly use tags/roles that your customer doesn't know about. This is a common issue in something like healthcare data security.
Could be a lot of reasons, but what if you have millions of tags/roles, and a user has a large subset of those roles? This actually happened at one job...
What if you sell reporting on location based data with access control, and customers want fine grain control on who sees what? Then you also can't always use groups. Example: I'm a Healthcare provider with 15k locations. Manager X is responsible for 5k locations, so when we create reports we filter all the data by those 5k locations first. Same with viewing any data in the product.
Just curious how others approached this. You could obviously do grouping somehow.
I have a similar use case but even worse. Rather than locations users are given access directly to the consumer/patient and the each resource is for some consumer x. So any access not just reporting we have to filter by a huge list of accessible consumers which sometimes can be as small as 50 but large as 100K+ which is massive challenge to scale.
We can obviously manage this still with careful optimization but I'm not sure if this is kind unlimitedly scalable say for 1M+ consumers. I wonder if this can only be handled by reframing business requirements (like location as you are using) or someone has better design/ideas.
Yeah, same problem, but I don't think I ever saw more than 50k resources per user.
I think the best solution potentially is to have the system monitor the "whale" users and have a special case where you turn their list of resources into a single group/tag behind the scenes and add this single identifier to all the relevant objects. It'd be hell to get right and keep consistent but it'd be fast.
It's hard to name things. There's consumers whose data the system records, by resource I meant the data of the consumer (bulk of the system) and obviously users.
The users are authorized on consumer (a set of consumer + a set of roles or action they can do). We have been able to move the role aspect to app layer very cheaply. User A is doing X say running a report, we know role R is needed for it, we then ask the system run the report for the customers user A has access to it with index on consumer for all resources in system allowing fast query. But if your consumer list is too large the efficiency falls.
We can't make auto groups at system level because adding a group column in resource level and updating would be too expensive. There's no just large grouping. Some users say have access to consumer 1..100K where another has random 100K from 1..1M.
Have separate admin-facing and user-facing RBAC. Don't try to force them under one roof. You have an organization, and a product. Both have different authorization needs (many products don't have any), even if they may use the same authentication mechanism.
I tend to keep admin RBAC simple and group-oriented. It's mostly for direct data access permissions (per table, etc), but can add others too. Different groups for different departments. Mainly for internal CRUD apps at first (e.g. Django Admin).
OTOH, User-facing RBAC is project-oriented and uses the virtual policy system described by Tailscale. The users themselves, in a project-manager role, can manage their own policies, memberships, and resources tags. I like to provide sane defaults and support for modifying policies should they even need it. Importantly, one user can have membership in different projects with different roles.
You'd create a project with the 15K resources, add the customer to it in a manager role, and then they choose the policies, tags, and roles to make it work for their organization.
Ofc, admins can also be users. Simply create one or more projects for internal use.
Seconded! I can get to a million authentication-related articles and services, but "authorization theory" seems to be very hard to find. (It doesn't help that the big text box in the sky treats the two words as somewhat interchangeable.)
"This paper presents the design, implementation, and deployment of Zanzibar, a global system for storing and evaluating access control lists. Zanzibar provides a uniform data model and configuration language for expressing a wide range of access control policies from hundreds of client services at Google, including Calendar, Cloud, Drive, Maps, Photos, and YouTube. Its authorization decisions respect causal ordering of user actions and thus provide external consistency amid changes to access control lists and object contents. Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use."
You could use hashed token claims to encode a role or other identifying info.
If you use a webhook you don't have to worry about this, since the user doesn't get to see what you forward as the authorization claims payload. But if you're using a JWT to hold state, since JWT's are publicly viewable it has to be non-obvious and non-deterministic.
Maybe I'm not understanding. But in a JWT you can verify the signature from when you originally signed that user's JWT. So even if I change some data in my JWT, on the backend it will not match the signature. So just reject if it doesn't match, say someone change permission:read to permission:admin
Roles are an approach towards query optimization, but the question placed a constraint on the user not knowing about them. Encoding the role information so it is not visible to the user is a potential solution.
By not knowing about them I mean that the customer/company never defines the roles as part of onboarding, etc. In this case they define the restrictions per-user.
> Obvious solutions are "where allowed_user_ids = ... big list" or "where disallowed_user_ids NE ... small list"; the latter not a solution as you can't optimize this query with a normal tree-like index.
This is not exactly an answer but it's worth remembering that the lower the selectivity of a condition is the more likely it'll turn out that brute force and ignorance will actually work out better - loading data in batches and filtering out the ones that are disallowed by a secondary positive index lookup (assuming you've pre-materialised your 'is disallowed' list) could easily turn out to be a better option for a lot of cases than trying to be more clever than that.
The more general question is of course rather more complicated and interesting, but other people already had better answers than I do there in other replies.
Sounds to me like you can have a huge table of authorizations and compile it silently into a list of roles whenever the contents of that table are modified. This recompilation step may be comparatively slow, but it needs to be executed quite seldom and the roles/groups you end up with should small in number and be quite easy to index.
Sometimes it's customer enforced. Just based on various places I've worked.
One place for example sold location-based software, so users could have access to say 10k out of 100k locations, but those locations might not always be next to each other. Data would be stored with a location id, and to do things like "show the average star rating for the locations this user has access to", we'd do "aggregate star rating where location id IN [....]"
Oh okay, so you can store authorization data in a table mapping user_ids to location_ids right? Then instead of using `IN (big_list)` (which can't use indexes) you join on the mapping table (which can use indexes). Something like
CREATE TABLE locations (
location_id INT PRIMARY KEY,
rating INT
);
CREATE TABLE authorized_locations (
user_id INT REFERENCES users (user_id),
location_id INT REFERENCES locations (location_id)
);
CREATE INDEX authorized_locations_user_id_idx ON authorized_locations (user_id);
CREATE INDEX authorized_locations_location_id_idx ON authorized_locations (location_id);
SELECT AVG(rating) FROM locations JOIN authorized_locations USING (location_id) WHERE user_id = $1;
However, let's say you have an architecture where you have 100s of these "product" tables, since every piece of the app restricts access by location. So not only do you have Product but maybe you have FormResult{LocationID}, Feedback{LocationID} etc.
So for a service to find out what a user has access to, first they call the User service to get a list of locations, and then they pass the big list of ids to their own DB, whatever that DB is.
I guess this is one case where a monolith and good ol SQL would win. :)
> Oso’s Django and SQLAlchemy integrations turn partials from Polar into database queries... The SQL they produce relies heavily on nested subqueries
Live by ORM, die by ORM. This strikes me as particularly bad because these authorization queries may be running on every request. It's great to see Oso went direct to SQL to address this. And the asides about logical programming were fun as well.
Building bad queries with an ORM isn't necessarily the ORM engine's fault. It's really just a tool and, as such, its value really relies on one's proficiency with it.
But you can build any query with Sqlalchemy - even when using the ORM rather than the Core, you still have that control, right?
Don’t get me wrong, I’ve read stuff by these guys and I know they know what they’re talking about. Certainly, if you have to make it work for Django as well you’d be considering going straight to sql.
As and aside, I walked through their stuff a little while back and I’m very interested in what they’re building.
agree this seems like the worst of both worlds between doing things application-side and doing things database-side. imo you're much better off embracing your db's security features
Well (most) ORM often cannot take advantage of CTE, group concat, array types, json types and what not when building queries from code, for obvious reason. So they often can't return multi-dimensional data in a single query.
But ORM are still useful to get an application off the ground fast.
do ORMs still target lowest common denominator ansi sql? I only know that EF doesn't - it has "providers" for each rdbms it supports and will liberally emit CTEs and db specific language features like "outer apply" or "join lateral" but then stupidly ignore powerful features like "for json" or "json_agg/json_build_object" which could eliminate the outrageously naive way it encodes 1:m and m:m results - it boggles how smart and stupid orms are at the same time.
fwiw, PostgreSQL has a built-in mechanism for filtering rows based on authorization rules: row-level security [0].
This can simplify your data-access layers quite a lot and pushes you towards better security practices like limiting the scope of permissions granted to your applications' role.
If you like Polar but can't use it for whatever reason it does a lot of what Polar does.
It doesn't, you can use any variable available. So for example in postgrest you get all variables in a JWT exposed as GUC (Grand Unified Configuration) variables and you can use that in your RLS checks.
The most common way is to create a role though, yeah. I find that creating a role is not a downside, since it is generally just a row in a table that postgresql's permission system also uses.
How does this work in terms of Postgres connections/pools? My understanding of this is that you'd do something like `SET SESSION user_id TO 123;` which sets the variable `user_id` for that connection. Subsequent queries could use that variable to do row-based authorization but they need to use the same connection. Is this how Postgrest does it? (guaranteeing the same connection for the lifecycle of a request that is)
IIRC postgrest starts a transaction for each request, and inside of that sets the variables. I've done some testing to make sure that variables do not leak between requests.
The performance hit can be quite big. I had a query that went from 1s to 400ms by disabling RLS. The security policy was a simple where org = abc. When I encounter performance hits like these, I refactor the slow query into a function with SECURITY DEFINER and a huge warning that you're on your own regarding security. Besides that, it's nice not having to worry if your SQL is accessing stuff the current role isn't allowed to see.
I do and the answer is always it depends. I'm not being glib! There are a lot of variables at play that will affect performances.
In general it moves computation closer to the data and in aggregate that generally offsets most increases in query times.
If you design your schemas carefully the performance cost is easy to swallow. As always analyze your queries under different table sizes and see what works for you.
The benefit is that your application code doesn't have to use any complex RBAC->SQL compilation. You can just 'select foo, bar, baz from mytable;` and RLS will take care of making sure your application servers never see the data that the user doesn't have access to.
In my experience the performance hit is not much larger than having the permission check part of your SELECT/UPDATE/DELETE query, but I don't have hard numbers.
What happens when you have sweeping changes to existing policies? It seems like you have to chase down every other line of DSL and fix policies individually.
Are there any policy-language-libraries-backed-by-sql like Polar but that aren't based on logic programming languages? I don't really want to learn logic programming for this purpose nor do I want to require it on my coworkers.
I guess I'm just looking for a library + SQL shorthand that can easily interpolate request variables and session variables that gets declared in code where a route is declared. Just spitballing but something like `(blogs.id = $req.blogid).userid = $session.userid OR (users.id = $session.userid).isAdmin`.
This [0] is close but it doesn't have enough momentum to be well documented let alone usable as a library in every language you'd want (Go, C#, Python, Node.js, etc.).
Good pattern ! At least a library that makes sense. Thanks.
One point about production usage, you should adapt to a NoSQL backend as we want to query authorization logic against a caching layer for performance reason.
Well, you proposed converting SQL architecture to NoSQL with the goal of adding caching. So how would you add caching in the NoSQL case? What prevents you from doing the same with SQL?
I suppose you could use some sort of bloom filter, or create/maintain groups behind the scenes somehow, but haven't seen many articles cover this.