Why Google Zanzibar shines at building authorization

AndreasHae · on June 25, 2023

We successfully used a Relationship-based Authorization System based on the Zanzibar paper at my last job building a B2B SaaS leaning heavily on cross-company integration.

The flexibility in defining rules through tuples helped us iterate rapidly on new product features. We used self-hosted Ory Keto [0] instances as the implementation, though we would have preferred a managed solution. We were checking out Auth0 Fine Grained Authorization [1] but it was still in Alpha back then.

[0]: https://www.ory.sh/keto/ [1]: https://auth0.com/developers/lab/fine-grained-authorization

kkajla · on June 25, 2023

> though we would have preferred a managed solution

We completely agree here, which is why we initially started out with our managed cloud offering, Warrant Cloud[1]. While Zanzibar is powerful, operating it with solid latency/availability can be quite challenging.

[1] https://warrant.dev/

free652 · on June 26, 2023

So how do you manage filtering of a billion records?

comboy · on June 25, 2023

Can anybody explain me why there seem to be much focus on scalability in this context? I mean we have 8 billion people. If the whole planet registers, home PC can handle it, plus it partitions beautifully if necessary in case of authentication. So what am I missing?

smarx007 · on June 25, 2023

Forget about 8B people in this context. If you have 1000 microservices in the company and each has 100 rps, you are looking at ca. 100k rps to a Zanzibar-style system to authorize every request (not to authenticate a user).

dietr1ch · on June 26, 2023

Why does it need to be checked on a per-request level?

I'd expect you to be able to give short-lived capability tokens to clients that each machine can verify down the stack without making new rpcs. This would avoid the fan-out of all the internal services.

Is it just to prevent abuse?

akajla · on June 26, 2023

You can encode capabilities/permissions as scopes in distributed tokens (e.g. OAuth) but this can start to break down if you have very granular, fine-grained permissions (e.g. user:1 has 'editor' access to 1000s of documents/objects). This is similar to the problem that Carta ran into while building out their permissions[1].

In addition, yes - validating permissions on each request makes it so that you can revoke privilege(s) with immediate effect without needing a token to be invalidated.

[1] https://medium.com/building-carta/authz-cartas-highly-scalab...

smarx007 · on June 26, 2023

I think it's best to refer to the Zanzibar paper: https://www.usenix.org/system/files/atc19-pang.pdf

zcw100 · on June 26, 2023

... or the annotated one from the Authzed folks https://authzed.com/zanzibar

smarx007 · on June 26, 2023

Wow, I am also impressed by the tech behind! https://github.com/authzed/zanzibar-annotated

dietr1ch · on June 26, 2023

That's neat! All papers should allow this discussion on them.

BTW, didn't Google released something like it too early?

AlphaSite · on June 26, 2023

Does the token identify every resource you have access to? I think is for multi tenant applications with fine grained access control.

paulddraper · on June 26, 2023

"Oh, you just [insert complex solution here]"

You need one capability token per principal and resource and perhaps access right.

mort96 · on June 26, 2023

This isn't meant to invalidate what you're saying, but this whole thread reads like a parody to me. 1000 services all making requests to Zanzibar, and this oreo keto thing.

Makes me think of this: https://m.youtube.com/watch?v=y8OnoxKotPQ

sporkland · on June 27, 2023

You could apply to S3, RDS, BigTable, Spanner, Firestore, etc. I feel like engineer orthodoxy (monoliths vs microservices, every monolith I've seen accesses a remotely DB and every microservice tends to access a DB which are themselves monoliths), no "god" services tend to break down for a lot of these important high scale stateful facilities.

smarx007 · on June 26, 2023

Pure gold.

comboy · on June 25, 2023

Thanks.

arekkas · on June 26, 2023

Glad to see that you used Ory Keto! :)

Ory does have a managed service offering now for Ory Keto as well!

smarx007 · on June 25, 2023

Very happy to see industry discover the power of graphs and especially, a triple-based representation (cf. RDF [0]; subjects are “subjects”, relationships are “predicates”, and objects are “objects”).

Now, a genuine question: why try to shoehorn a freeform graph (because the list of relationships is not hardcoded) into a relational DB instead of using a graph DBMS like Neo4j, Apache Jena (Fuseki) etc. From looking at the source code briefly [1], I didn’t see any extreme SQL optimizations. This indicates to me that Warrant would either support a very limited set of query types, or be very slow on quite a few types of them. Also see “billion triple challenge” in the academia around this.

Good luck with your startup!

[0]: https://www.w3.org/TR/rdf11-primer/

[1]: https://github.com/warrant-dev/warrant/tree/main/pkg/authz/o...

zdragnar · on June 25, 2023

Is Neo4j a good option? I've not heard great things about it performance-wise, though this was some years ago when tinkerpop/gremlin was starting to make news in my circles, and we were operating on extremely dense graphs.

WirelessGigabit · on June 26, 2023

I have experience with Neo4j as a consumer of the database, but as part of a project where someone else wrote the queries.

I hate it. It's extremely expensive. It's slow. Very slow. It only recently had multiple databases per instance. It doesn't support per database encryption. Did I mention it's slow?

We also looked at the ongdb effort, but that went offline all of the sudden due to licensing issues. Now it's back but they reset (?) the version number. Confusing. Also, that one is built in version 3-ish. So no multi-db. While you can spin up multiple instance (it's free?, it's still Java, i.e. slow and eats memory.

smarx007 · on June 26, 2023

The only thing I like in Neo4j is Cypher – it's powerful and intuitive. I don't use Neo4j because of two reasons:

1) It has no support for subgraph queries. In other words, you can't run a query on a graph and have the query result be a graph too. Instead, you will get a tabular result set. In SPARQL-based systems, you can run a 'CONSTRUCT' query. Very useful if you want to process the results by other parts of the code that also expect a graph (composability). See [1] and [2] if you want to take SPARQL for a spin.

2) It has no support for a standard graph data format. Their blog had some posts about using CSVs but they are a tabular data format, which means that some acrobatics are needed to extract a graph from CSV (actually, two CSVs) and none of this would be standard. Also some attempts to fit a graph peg into a tree-shaped hole (JSON, XML). To my knowledge, RDF is the only widely used standard to actually represent graphs. Unfortunately, there is a lot of confusion around RDF because (a) RDF is actually just a model and there are multiple file formats – I recommend Turtle, and (b) RDF has a semantic web heritage – forget semantic web and just use a graph data format.

But I know that industry is most familiar with Neo4j, that's why I mentioned it. To my knowledge, Stardog is one of the most advanced and performant systems (with on-prem deployment) but is very expensive. Amazon Neptune and Azure Cosmos are cloud-only, which is a hindrance for many projects. Bottom line is that graph DBMSs have a long way to go and more interest from the community is needed to motivate more dev effort.

P.S. For dense graphs, a graph DBMS may not be the best solution. Graph DBMSs also lose their appeal if your queries are not traversal-heavy.

[1]: https://data.nobelprize.org/sparql

[2]: https://query.wikidata.org/

jnsaff2 · on June 26, 2023

I have recently gone through at 3.x to 4.x upgrade as an ops person for a ~1TB database.

My takeaway: https://www.youtube.com/watch?v=JNC1CpJQxzg

The engineering quality along with documentation left a pretty bad taste.

Tho sometimes some aspects being really primitive were helpful for getting out of trouble.

zcw100 · on June 26, 2023

Graph databases excel when you need maximum flexibility and this is when the shape of the graph is constantly changing by adding new, novel, and unexpected data. As soon as you put it behind an application it becomes static and you end up paying a huge price for flexibility that will not be needed or used.

esafak · on June 26, 2023

Same reason why vector databases are a thing. When performance matters, you go purpose built.

susscrofa · on June 25, 2023

The Zanzibar paper has a section on the consistency model, which says that the race conditions outlined are solved by respecting update order. It then solves it by using Spanner as underlying storage (which is kind of lazy).

How does Warrant deal with consistency?

kkajla · on June 25, 2023

You've highlighted a very important part of the paper. A lot of the external consistency guarantees provided by Zanzibar are facilitated by Spanner and its TrueTime mechanism. Warrant doesn't currently support/use Spanner. However, for the databases we do support (MySQL and Postgres - which are both ACID compliant), we've implemented the zookie protocol using the incrementing transaction ids they provide. This approach works for single writer deployments of these databases, so know that write throughput and overall availability will be lower. We started with this approach because most teams still use MySQL/Postgres. Warrant is built to support running on different types of databases, so we will be working on support for Spanner and other multi-writer distributed databases like Cockroach and Yugabyte in the future. I hope that helps.

aseipp · on June 25, 2023

The fact they did it that way is actually a perfect example of why Google is considered so far ahead of competitors technologically and operationally by their engineers. When you have a powerful building block like Spanner that engineers can use, they then can work on the product instead of wasting time on brittle consistency models, custom storage layers, and providing their own uptime guarantees.

This goes for every part of their stack. As a result, things like Colossus, BigTable, and Spanner effectively act like force multipliers for their engineers, because they provide the guarantees they can't get elsewhere. The fact other people at other random companies can't do that? Not their problem in the slightest, actually.

skybrian · on June 25, 2023

It's been many years, but a downside back when I worked there was infrastructure churn. Migrating off deprecated infrastructure meant you had to do a lot of work just to stay where you are. Mostly unstaffed products (like Google Reader, say) were at risk of going under due to technical debt.

When App Engine launched, that was great for me because I could write internal tools that were mostly off the treadmill. Unless you used one of App Engine's less-used API's (which themselves eventually got deprecated), your more obscure team-specific services could keep running.

So, lots of great technology is not necessarily great for productivity. I don't know what's happened since. I expected that launching Cloud would result in more mature infrastructure because external customers won't tolerate churn as much. I guess it's sort of true?

hn_go_brrrrr · on June 26, 2023

They updated the churn policy to require infrastructure teams to migrate their users, not just dump the work on them. That greatly eased the unfunded mandate load on product teams. That hasn't stopped infrastructure teams from making sweeping changes, though. There's one in particular happening now that's enormous -- to riff on the "changing the engines midflight" analogy, it's replacing the fuselage without anyone noticing.

danpalmer · on June 26, 2023

Which one is that? Feel free to just hint or PM me.

javier2 · on June 26, 2023

I worked for a place that didnt value solid engineering in this way, and our systems where always so janky and half broken. I dont expect building Spanner, but being able to migrate certain huge tables at all would be nice. But no, they always balked at spending time building engineering tools. Even a solid job execution system would be cool, but no, janky it is.

iainmerrick · on June 26, 2023

I think that view might be a bit out of date.

Most of that stuff is available to external users in Google Cloud; so why isn’t Google Cloud more popular? I don’t have hard numbers handy, but it seems to me that GCP is behind both AWS and Azure in terms of dev mindshare.

GCP has plenty of great tools, but it can also be quite awkward to use, and it’s lacking some useful stuff like lightweight edge functions.

re-thc · on June 26, 2023

> but it seems to me that GCP is behind both AWS and Azure in terms of dev mindshare

AWS has 1st mover advantage and the biggest ecosystem.

Azure has Microsoft behind it = everyone that used Office, Sharepoint, SQL Server, C#, etc that wanted to move to the cloud.

Google doesn't have such a backing. Oracle cloud (huge growth) might be stronger in that sense.

iainmerrick · on June 26, 2023

I agree, but it’s surprising that things are that way, especially as they’re not short of cash. If they have this big tech edge over their competitors, where’s the benefit?

re-thc · on June 26, 2023

> If they have this big tech edge over their competitors, where’s the benefit?

Most of it is behind huge paywalls compared to their competitors. There's e.g. no serverless version of spanner that is pay-per-usage. Same with Big Table. Even the newly released AlloyDB has a huge starting cost.

If no 1 knows and no 1 can feel for the advantage it doesn't really help.

kevincox · on June 26, 2023

A lot of stuff is proprietary so adopting it as a third party can be much riskier. While Google Cloud does have a pretty solid depreciation policy it still means that you are locked to Google Cloud and whatever their future is. At least internal users can escalate if there are serious issues.

kjgkjhfkjf · on June 25, 2023

Ironically, by the time Spanner became generally available, Google had largely lost their appetite for launching new products.

noah_buddy · on June 25, 2023

Why is it lazy? Seems like leveraging a tool Google built for distributed systems specifically for consistency guarantees.

sunk1st · on June 25, 2023

As I understood it from context, the word lazy was being used to complain that the reference to Spanner wasn’t in-lined.

susscrofa · on June 26, 2023

Right. As someone who's not a systems guru, I would love some insight if/how the consistency guarantees can be achieved using common distributed database approaches.

sulam · on June 25, 2023

I was super curious to hear how Zanzibar is being used to control access to buildings. Don’t be fooled, this is not that post. ;)

amelius · on June 25, 2023

I was super curious to hear why the Zanzibar office of Google was somehow better at authorization than e.g. their Mountain View office.

jnwatson · on June 25, 2023

At least one Google office I know has building access controlled by Zanzibar.

ryanjshaw · on June 26, 2023

I was surprised that Google even had offices in Zanzibar.

scotty79 · on June 26, 2023

It's such a weird name. It's like naming a product Google California.

kkajla · on June 25, 2023

Apologies for the confusion! Maybe an interesting idea for us to explore next :P

ants_everywhere · on June 25, 2023

I wouldn't be surprised if it is somewhere. Just have the smart card system use Cloud IAM for access control decisions.

hn_go_brrrrr · on June 26, 2023

That sounds like a nightmare. A cloud outage means no one can badge in anywhere, including at DCs?

Xymist · on June 26, 2023

Isn't that exactly what happened to Facebook a little while ago?

alxcb · on June 25, 2023

hahah, exactly, tought the same.

seymon · on June 25, 2023

One thing I find difficult using access control systems as a distributed service like Zanzibar is a convenient and performant way to search and filter resource data using permissions. For example defining database queries that should only return resources a subject has access to based on Zanzibar permissions.

kkajla · on June 25, 2023

At Google, I believe some client applications build and maintain "permission-aware" search indexes based on the permissions in Zanzibar. In essence, Zanzibar can be queried to figure out the object ids a particular subject has access to. These object ids can then be hydrated via a database query or separate service call.

At Warrant, we're experimenting with allowing customers to maintain searchable metadata in Warrant and exposing a "query" API[1] that can automatically hydrate objects based on that metadata.

[1] https://docs.warrant.dev/warrants/query-warrants/

rektide · on June 25, 2023

SpiceDB/Authzed has "Lookup"s. There's LookupResources for finding what a user has access to, and LookupSubjects now too to see who has access to a resource. Great capability. https://authzed.com/blog/lookup-subjects

h1fra · on June 26, 2023

Same reason I didn't invest much in this field yet.

When you access one resource it's fine to a do a roundtrip, but with listing, filtering, searching if you don't join at query time it doesn't work. I'm not entirely sure how they achieve it and I found it annoying that it's never mentioned because it's very common.

scarmig · on June 25, 2023

It's exciting to see so much action in this space.

How would you compare Warrant to other Zanzibar (ZaaS?) offerings? Particularly Ory and Authzed/SpiceDb.

zedadex · on June 26, 2023

> Over the last couple years, authorization (AKA “authz”) has become a hot topic of debate. Proponents of various authz frameworks, libraries, and philosophies have voiced their opinions on how it should be implemented, jockeying for position to become the de facto way to implement authz

As a developer of a tiny internal webapp - this is fascinating to read! I like to keep things as simple as possible, but as with anything our scope and use cases have grown over time.

Our authzn can handle some of this stuff - our rules, built atop our org's existing IAM, are very similar to these directed relationship tuples - but as we need to grow that out any further I'm excited to look into which aspects of ReBaC we're still missing.

Thanks for the link!

AceJohnny2 · on June 25, 2023

Plugging another company that's been implementing Google-Zanzibar-like auth tech: https://authzed.com/

They've posted a number of interesting articles on the topic here, such as this one listing competing implementations (but 2y old): https://authzed.com/blog/zanzibar-implementations

gneray · on June 25, 2023

Is anyone here using Warrant or other Zanzibar-like services?

If so, how did you evaluate them relative to each other and/or building yourselves?

sails · on June 25, 2023

> A Flexible, Uniform Data Model for Authorization

Are there good examples of similar applications of data modules for similarly niche use cases? I get that there are obviously endless data models, but this seems to extend beyond that into a more integrated concept and I don’t quite know why that seems to be the case.

kkajla · on June 25, 2023

I think GraphQL might be a good example. Some might not consider it to be very niche, but its intention is to consolidate dependent API queries such that the client can fetch all the data it needs in a single request. In both Zanzibar and GraphQL, the idea behind the schema/modeling language is to provide a layer where logic specific to relationships between data (in the case of GraphQL) or logic specific to authorization (in the case of Zanzibar) can be specified such that neither the server nor the client need to worry about it and can instead query for data in a simpler way.

sails · on June 26, 2023

Thanks, I suppose GraphQL is a valid example, I was thinking niche in terms of application eg Zanzibar is auth only, and you would struggle to use it for much else

deanCommie · on June 25, 2023

Can anyone who's taken a close look at both Zanzibar and Amazon's IAM compare and contrast them?

joshuanapoli · on June 26, 2023

Zanzibar evaluates recursively, where AWS IAM is single-pass.

So AWS groups do not nest, but Zanzibar groups do. In Zanzibar a relation on an object (an implicit set of users) can be the subject of a rule; you can define “users who have editor permission on an object also have viewer permission” in one rule. This isn’t possible in AWS; there is no way to reference the set of principals who are allowed a particular action or policy.

I think that AWS policies tend to have a lot of duplicate rules because of lacking recursion. Zanzibar rules should be easier to maintain and audit.

AWS IAM is also just quite “hairy” from gradual evolution over the years. On the other hand, Zanzibar has a clean model.

It would be nice to have a compiler that would emit AWS IAM Policies given Zanzibar-style rule tuples.

TheNewsIsHere · on June 26, 2023

These services aren’t really going after the same problems.

Zanzibar is a Google-internal implementation of the concepts outlined in this paper, focused on managing authorization as a function of relationships between objects.

AWS IAM is primarily for AAA services with AWS, though you can use it with AWS Identity Center to provide SSO to other systems via IAM.

deanCommie · on June 26, 2023

Oh I assumed Zanzibar would have also been the model for GCP's public IAM?

TheNewsIsHere · on June 26, 2023

I don’t know one way or the other. OP claimed elsewhere in this threat that Zanzibar is used to manage authorization records for services like Google Drive and YouTube.

But as far as Zanzibar itself, it’s not something Google makes available externally.

Having played in all the major (and common) sandboxes (so not like, Oracle), the GCP, Azure, and AWS permission systems are all fairly similar. They each have their foibles but their conceptual designs are all fairly similar. But that’s not a criticism: anyone designing that kind of IAM service really isn’t going to end up with something that different given the goals involved.

torgard · on June 26, 2023

Zanzibar is their internal system, and they've released a paper describing it.

https://storage.googleapis.com/pub-tools-public-publication-...

TheNewsIsHere · on June 26, 2023

Did you mean to reply to a higher level comment? :)

say_it_as_it_is · on June 26, 2023

Yet another Zanzibar system emerges. There's no functional advantage to using this system over any of the others, and the others aren't necessary either for the vast majority of needs.

Zanzibar is overkill for the majority of needs and introduces far too much complexity. It is the solution that covers scenarios of the likes in which you will never see. You will never grow into needing them, either. It is the pinnacle of over engineered software. The reason why people form companies offering it as a solution is to try to recover hundreds of hours of effort cost on something they didn't need.

esafak · on June 26, 2023

The services hide all that complexity; that's the value. What authorization solution do you prefer, for when one is needed?

say_it_as_it_is · on June 28, 2023

One far less complicated that doesn't involve zanzibar

fiddlerwoaroof · on June 25, 2023

How does Zanzibar relate to capability-based schemes?

kkajla · on June 25, 2023

Another interesting feature of capability-based systems (that is outside Zanzibar's scope) is that capabilities can themselves be used to gain access to an object. This is because they are unforgeable tokens, meaning they essentially have authentication baked into them. Zanzibar leaves the authentication piece to an external service and focuses on providing the ability to define, store, and query access rights for subjects.

kkajla · on June 25, 2023

As I understand it, "capabilities" in capability-based schemes uniquely reference an object and specify a list of access rights on that object. This seems fairly similar to tuples in Zanzibar, which reference a unique object, an access right, and a unique subject whom the access right belongs to. You can think of Zanzibar as a layer used for defining, storing, and querying for capabilities.

hamburglar · on June 25, 2023

Yeah, I believe capability/verb simply maps directly to relation in Zanzibar speak. “Can edit” vs “is an editor”. I’m more accustomed to the verb style, so whenever I read about authz systems that use relations or roles, I’m constantly mapping the concepts in my head to try to find examples where they aren't 1:1 and have yet to think of any.

leetrout · on June 25, 2023

Have you seen authzed's caveats?

https://authzed.com/blog/caveats

simongray · on June 26, 2023

Add this to the list of blog posts praising RDF without anyone realising it.

And the additional element added to the tuple is reminiscent of quads, also in heavy use in RDF implementations or similar graph databases.

rantingdemon · on June 26, 2023

Fascetiously, for internal use cases, just use AD

s09dfhks · on June 25, 2023

[flagged]

kkajla · on June 25, 2023

Google doesn't actually offer Zanzibar as a product/service (in GCP or otherwise) to customers. However, they do use it internally to manage permissions across their various products (Google Docs, Drive, YouTube, etc.) and have had a lot of success doing so. Because of that, there are many open source implementations of Zanzibar out there (as others have commented). Warrant also maintains our own open source implementation of Zanzibar[1] which powers our managed cloud offering, Warrant Cloud[2].

[1] https://github.com/warrant-dev/warrant

[2] https://warrant.dev

jsnell · on June 25, 2023

If you genuinely aren't trying to be snarky, then what was the point of this post? You clearly don't even know what Zanzibar is. If you were trying to find out more about it there would be hundreds of more useful questions to ask.

ec109685 · on June 25, 2023

Yeah, it’s really frustrating when Google EOL’s a paper they published.

rzzzt · on June 25, 2023

The actual implementation is closed source. The backing idea is described in the inaugural paper which third parties have used to build hosted and open source alternatives that you can try.

sowbug · on June 25, 2023

This tired trope has jumped the shark with this comment. Not only is it off topic, but even if it did refer to an actual Google product, it's disingenuous because its only point was to be snarky.

lopkeny12ko · on June 26, 2023

It's not disingenuous. Unless you've been living under a rock, Google has a very reliable track record of killing projects [0]. It's a very reasonable question.

[0] https://killedbygoogle.com/

kweingar · on June 26, 2023

I'm confused as to why anyone outside Google would care about the possibility of Google deprecating Zanzibar. It's an internal-only project that handles authorization across Google services. It's totally invisible to all users and customers. I can't think of a reason why anybody would be affected by Google switching to a new global authorization system.

What are you concerned about here: a potential impact on users, or the internal impact for Google engineers that could come from switching to a new system?

esafak · on June 26, 2023

Zanzibar is a system described in a public paper, not a product; it can't be killed. Indeed, it has been implemented as a service by numerous other companies.