The distinguishing feature I see compared to other systems is the ACL ordering and consistency, which is indeed difficult to do at scale. Looks like Spanner is doing most the heavy lifting, great use case for the database.
Well even more broadly it is how generalizable it is, while still providing ordering guarantees (though not necessarily perfect ones.. see my long sibling post)
Using Windows style ACEs for ACLs is also perfectly scalable and consistent, (and more performant) so long as users don't end up in too many groups and objects only inherit ACLs from objects on the same shard. It's just no where as generalizable as Zanzibar which allows much more complex dependencies.
There's always tradeoffs! But this is the best system I've seen for the general ACL evaluation against non recently updated objects.
I’ve been part of similarly generalized ACL systems and it’s pretty straightforward and very similar to Zanzibar. Though we didn’t need n ACLs and could assume the list wasn’t too long, so we didn’t need a tree. If we did, then we’d have ended up in a similar place as Zanzibar I believe, there are a limited number of ways to solve that problem.
“Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use”
"This caching, along with aggressive pooling of read requests, allows Zanzibar to issue only 20 million read RPCs per second to Spanner." ("Only")
I'm surprised by all the numbers they give out: latency, regions, operation counts, even servers. The typical Google paper omits numbers on the Y axis of its most interesting graphs. Or it says "more than a billion", which makes people think "2B", when the actual number might be closer to 10B or even higher.
If your conclusion is "throwing servers at problems" after years of reading the papers about Google infrastructure, you probably are a non infrastructure guy.
A serious conclusion should be that all these infrastructure enable application devs and researchers alike "to throw servers at problem". And these work are exactly the opposite, where they spent years and sometimes even decades meditating the nifty and figure out the most effective and efficient way of utilizing the servers.
It's not servers. It's who will have the best submarine cables. A game in which Apple is not participating btw. Not even with an Elon style moon, err, low earth orbit shot.
It doesn't matter if iCloud is slow or not since most of the interactions with it are in the background. And all of it's content e.g. App Store or Apple Music are cached by CDNs which are hosted in pretty much every country.
And there's much new fun to be had when you have this many servers around. For example - once you start shuffling around tens of petabytes a day you quickly notice that bit flips are very real. Computers do what we tell them to do with incredibly high probability, but it is always below 1.
> There's also a story behind that project name. That is not the original project name. The original project name was force-removed by my SVP. Once my hands are free again, I can explain
> Zanzibar was not the original name of the system. It was originally called "Spice". I have read Dune more times than I can count and an access control system is designed so that people can safely share things, so the project motto was "the shares must flow"
It sounds pretty appropriate. This is another SF reference "Stand on Zanzibar", one of the earliest inspirations for cyberpunk, in which we are routinely remembered that all of humanity would be able to stand on the small island of Zanzibar, but that growth is problematic.
What do other large (non-google scale) to medium companies use for authorization? Can anyone recommend open source (preferably) or close source products?
https://github.com/ory/ladon is an option. Essentially, it imposes a lot of the fine-grained access control model on you, but then it's up to you to implement the actual database/business-logic layer [1] as well as the API layer to actually expose the service.
We use LDAP for managing group memberships (i.e. person x is a member of `engineering` and `eng_team_y`; only members of `eng_team_y` can change the deployment status of service Z). We then define ACLs for these groups. IDK how they are enforced, but they're visible/malleable via Ansible recipes, such that the process of adding permissions for your group (or user) involves submitting a diff to said Ansible recipe and getting approval from an SRE.
In practice, we use Kerberos to obtain/distribute authorization tokens, which live for less than 24 hours. The authorization-value of these tokens is determined by the LDAP affinities of the bearer. If everything is configured correctly (which it always is, until you need new permissions / switch teams), all you have to do is auth with kerberos at the beginning of each day. We have ~200 engineers.
Excellent paper. As someone who has worked with filesystems and ACLs, but never touched Spanner before, I have some questions for any Googler who has played with Zanzibar. (in part because full-on client systems examples are limited)
A check my understanding: Zanzibar is being optimized to handle zookies that are a bit stale (say 10s) old. In this case, the indexing systems (such as Leapord) can be used to vastly accelerate query evaluation.
Questions I have (possibly missed explanations in the paper):
1. If I understand the zookie time (call it T) evaluation correctly, access questions for a given user are effectively "did a user have access to a document at or after T"? How in practice is this done with a check() API? The client/Zanzibar can certainly use the snapshots/indexes to give a True answer, but if the snapshot evaluation is false, is live data used (and if so by the client or Zanzibar itself?)? (e.g. how is the case handled of a user U just gaining access to a group G that is a member of some resource R?)
2. Related to #1, when is a user actually guaranteed to lose access to a document (at a version they previously had access to?) E.g. if a user has access to document D via group G and user is evicted from G, the protocol seems to inherently allow user to forever access D unless D is updated. In practice, is there some system (or application control) that will eventually block U from accessing D?
3. Is check latency going to be very high for documents that are being modified in real time (so zookie time is approximately now or close to now) that have complex group structures? (e.g. a document nested 6 levels deep in a folder where I have access to the folder via a group)? That is, there's nothing Zanzibar can do but "pointer chase", resulting in a large number of serial ACL checks?
4. How do clients consistently update ACLs alongside their "reverse edges"? For instance, the Zanzibar API allows me to view the members of a group (READ), but how do I consistently view which groups a user is a member of? (Leapord can cache this, but I'm not sure if this is available to clients and regardless it doesn't seem to be able to answer the question for "now" - only for a time earlier than indexed time).
Or for a more simple example, if I drag a document into a folder, how is the Zanzibar entry that D is a child of F made consistent with F's views of its children?
E.g. can you do a distributed transaction with ACL changes and client data stored in spanner?
5. It looks like the Watch API is effectively pushing updates whenever the READ(obj) would change, not the EXPAND(object). Is this correct? How are EXPAND() changes tracked by clients? Is this even possible? (e.g. if G is a member of some resource R and U is added to G, how can a client determine U now has access to R?)
Used to be a Googler and worked on an ACL model built on top of Zanzibar. I didn't work directly on Zanzibar so listen to ruomingpang over me.
> 3. There's nothing Zanzibar can do but "pointer chase", resulting in a large number of serial ACL checks?
Zanzibar enforced a max depth and would fail if the pointer-chasing traversed too deeply. Zanzibar would also fail if it traversed too many nodes.
> 4. How do clients consistently update ACLs alongside their "reverse edges"?
One of the recommended solutions was to store your full ACL (which includes a Zookie) in the same Spanner row of whatever it protected. So, if your ACL is for books, you might have:
CREATE TABLE books (
book SERIAL PRIMARY KEY
acl ZanzibarAcl
);
Alternately, you could opt to only store the current zookie instead of the full ACL. Then the check becomes:
1. Fetch Zookie from Spanner
2. Call Zanzibar.Check with the zookie
> but how do I consistently view which groups a user is a member of?
I remember this as a large source of pain to implement. Zanzibar didn't support this use-case directly. As rpang mentioned in a sibling comment, you need an ACL-aware index. Essentially, the algorithm is:
1. Call Zanzibar.Check on all groups the user might be a part of.
There's a bunch of clever tricks you can use to prune the search space that I don't know the details of.
How would you deal with questions like "provide all content accessible to a user" in a system like this? Would you watch and replicate to your own database?
Semi-off topic: What is the latest and greatest in authorization mechanisms lately?
I like capability-based at on OS level, but sadly I'm not doing anything that interesting. For things like webapps, is there anything better than ACLs or Role-based. Or at least any literature talking about them? Probably overkill for the application I work on, but it'd be nice to take inspiration from best practices.
Replace "digital object" with "a PDF of your checking account transactions for 2018". You want to control who can do what with that PDF. Your privacy is at stake.
Sure, that’s privacy in the sense of “no one can access my stuff, unauthorized”.
I struggled with the sentence cause, at the same time, creating one global centralized authentication source creates the opposite of privacy in the sense of anonymity. Certainly OT wrt the actual content of the work...
I may be misunderstanding the issue you're pointing out here... but I note that while the paper/sentence talks about "authorization" you're talking about centralized "authentication."
As an authorization system Zanzibar focuses on: can agent A (identified through some means) perform action X on object Y. It isn't about deciding whether an arbitrary actor is agent A but proscribing what actions agent A can perform against the universe of all possible objects (which likewise are referenced abstractly and not stored within the system itself).
The knowledge that A could do X on Y is information that might be disclosed (and thus entails some privacy risk)... but inherently doesn't reveal: anything about the identity of A; whether A has ever done X; or what Y's contents are or what it represents.
On the other hand, perhaps you mean that because membership in sets of users is also stored within it (via a sort of "is member of" permission) you can use that to de-anonymize who a given actor is. This might work but it assumes you can uniquely derive which agent from a set of abstract agents represents that individual and that you extrinsically something about the person being the only person in this specific set of sets.
This reminds me I need to get my authz paper published, and now sooner than later...
I've built an authz system that is built around labeled security and RBAC concepts. Basically:
- resource owners label resources
- the labels are really names for ACLs in a directory
- the ACL entries grant roles to users/groups
- roles are sets of verbs
There are unlimited verbs, and unlimited roles. There are no negative ACL entries, which means they are sets -- entry order doesn't matter. The whole thing resembles NTFS/ZFS ACLs, but without negative ACL entries, and with indirection via naming the ACLs.
ACL data gets summarized and converted to a form that makes access control evaluation fast to compute. This data then gets distributed to where it's needed.
The API consists mainly of:
- check(subject, verb, label) -> boolean
- query(subject, verb, label) -> list of grants
(supports wildcarding)
- list(subject) -> list of grants
- grant(user-or-group, role, label)
- revoke(user-or-group, role, label)
- interfaces for creating verbs, roles, and labels,
and adding/removing verbs from roles.
Note that access granting/revocation is done using roles, while access checking is done using verbs.
What's really cool about this system is that because it is simple it is composable. If you model certain attributes of subjects (e.g., whether they are on-premises, remote, in a public cloud, ...) as special subjects, then you can compose multiple check() calls to get ABAC, CORS/on-behalf-of/impersonation, MAC and DAC, SAML/OAuth-style authorization, and more. When I started all I wanted was a labeled security system. It was only later that compositions came up.
Because we built a summarized authz data distribution system first, all the systems that have data will continue to have it even in an outage -- an outage becomes just longer than usual update latencies.
check() performance is very fast, on the order of 10us to 15us, with no global locks, and this could probably be made faster.
check() essentially look's up the subject's group memberships (with the group transitive closure expanded) and the {verb, label}'s direct grantees, and checks if the intersection is empty (access denied) or not (access granted). In the common case (the grantee list is short) this requires N log M comparisons, and in the worst case (the two lists are comparable in size) it requires O(N) comparisons. This means check() performance is naturally very fast when using local authz data. Using a REST service adds latency, naturally, but the REST service itself can be backended with summarized authz data, making it fast. Using local data makes the system reliable and reliably fast.
query() does more work, but essentially amounts to a union of the subject's direct grants and a join of the subject's groups and the groups' direct grants.
special entities like "ANYONE" (akin to Authenticated Users in Windows) and "ANONYMOUS" also exist, naturally, and can be granted. These are treated like groups in the summarized authz data. We also have a "SELF" special entity which allows one to express grants to any subject who is the same as the one running the process that calls check().
I was going by the twitter thread, but I looked and found this in Wikipedia:
> the Zanzibar Archipelago, together with Tanzania's Mafia Island, are sometimes referred to locally as the "Spice Islands" (a term borrowed from the Maluku Islands of Indonesia).
What's interesting to me here is not the ACL thing, it's how in a way 'straight forward' this all seems to be.
It's the large architecture of a fairly basic system, done I supposed 'professionally'.
I'm curious to know how this works organizationally. What kind of architects involved because this system would have to interact with any number of others, so how do they do requirements gathering? Do they just 'have experience' and 'know what needs to be done' or is this something socialized with 'all the other teams'?
And how many chefs in that kitchen once the preparation starts? Because there's clearly a lot of pieces. Do they have just a few folks wire it out and then check with others? Who reviews designs for such a big thing?
Or was all of this developed organically, over time?
Zanzibar is basically the brainchild of a Bigtable Tech Lead + a Principal Engineer from Google's security and privacy team [1]. This led to a very sound and robust original design for the system. But it also greatly evolved over time as the system scaled up and got new clients with new requirements and new workloads.
Especially at Google, you first see the same problem appearing and getting solved in multiple products, then someone tries to come up with a more generic solution that works for most projects and, just as importantly, can serve more traffic than the existing solutions. Having to rewrite things on a regular basis because of growth is painful, but can also be a blessing in disguise.
Who that someone is who works on the generic solution, can vary. Sometimes it's one or more of the teams already mentioned. Sometimes, like in this case, it's someone with expertise in related areas that takes the initiative. And a project of this scope invariably gets reviewed on a regular basis by senior engineers, all the way to Urs (who leads all of technical infrastructure). Shared technologies require not just headcount to design and write the systems, but also to operate them (by SREs when they're large enough), so you need to get upper management involved as well.
This project says way more about the organization than any specific technical competence.
I'm not close to Google, but from those I know on the product side it can be 'a Gaggle' with nobody really in charge ... but I guess if you have enough self-motivated conscientious actors, and mature people, without ugly turf wars, who can have reasonable discussions, and responsible enough people in charge that can steer things in an appropriate direction ... it works.
But the fact this is an evolution and not a 'new product' is probably prerequisite - so many smart people are hard to coral around new ideas, but if it's done A B C times, then a 'Z' solution speaks to an Engineers sense of efficiency and it should be natural for such an org to want to do it.
I won't name names, but I worked at a large tech company that could not get 'Single Sign On' to work. It was really frustrating to think so many reasonably smart people couldn't figure that out.
We don't need genius I think just a wealth of experience and a lot of common sense.
The system is actually pretty complicated and nonobvious once you consider its caching layers, heavy reliance on spanner, assumption that ACL read times can be stale, and the various assumptions and limitations in the namespace controls.
The underlying model of role based access control (and viewing groups as just other resources with ACLs) is already well known.
I love reading about Google's systems, but I wish I could work on those problems at scale, that is my dream really. I wonder what more systems Google has that we don't know about.
I know Borg has become what we know as k8s but surely there must be more things that Google has made internally that are not open source.
Curious about this and would like to know more about it from anyone in the trenches at Google.
> I love reading about Google's systems, but I wish I could work on those problems at scale, that is my dream really. I wonder what more systems Google has that we don't know about.
I work for Google and I used to have this exact thought too. I think the reality is not quite as rosy, though far from bad!
You have to realize that there are hundreds of people who work on systems like this, and as a consequence, your day to day work is more or less the same as what you would do on systems of a smaller scale.
Before I joined Google I always wondered what things they did differently and what magical knowledge Googlers must have possessed. After joining I realized that while on average the engineers are definitely more capable than other places I've worked, there's no special wisdom and instead they just have more powerful primitives/tools to work with.
Of course, maybe I am mistaken and just don't know of the magic?
> there's no special wisdom and instead they just have more powerful primitives/tools to work with
Reminds me of compound interests. Google operates at a scale where the company has enough brainpower to design systems like GFS/Colossus and Borg, which enable systems like Spanner, which enable systems like Zanzibar, and so on.
The harsh truth of working at Google is that in the end you are moving protobufs from one place to another. They have the most talented people in the world but those people still have to do some boring engineering work.
Can you believe that people at Google still have to, like, eat lunch and stuff? And talk to each other? The coffee is the same damn color every day too.
Maybe there's a place, somewhere, for the purest-of-the-pure non-boringest thoughts.
Sandstorm.io doesn't use protobuf, it uses Cap'n Proto, which was designed to replace protobuf.
Fun fact: The Zanzibar project was started in ~2011 specifically to replace my main project at the time, which was trying to solve the same problems. Apparently, some senior engineers felt letting me work on core infrastructure was too dangerous. They succeeded in turning my project into a lame duck and making me quit, which is when I then started working on Cap'n Proto and Sandstorm.io. In retrospect I'm glad it happened.
Yeah... Google is not always the most fun place to work on big infrastructure projects.
The point is you're writing mostly business logic and glue. You get a server request, you transform it with some logic, call some other servers, combine the responses and run some more logic, and return a response.
The scalability and interesting work has been factored out and handed off to infrastructure teams that build stuff like this auth framework, load balancers, highly scalable databases, data center cluster management tools, etc.
Which really is the smart way to do it. To the extent that you can stand on the shoulders of giants who've basically made scalability the default, you are free to focus on what you're actually trying to build. The only downside is if all the interesting engineering challenges are already solved for you, the remainder might not that be that interesting to people who enjoy engineering challenges.
Adtech is interesting because of the scale, complexity, and timing required compared to many other software projects. It gets a bad look but the engineering involved is not boring.
Colossus is actually the only project I can think of for which they had one of the leaders sit down with Kirk McKusick and have a chat for ACM Queue, instead of a paper. https://queue.acm.org/detail.cfm?id=1594206
And they reveal exactly zero details. I know a bit about it, but not enough to say exactly what semantics it offers to file system clients. I believe it is not POSIX-like, hence the need to layer Spanner and GCS over it.
There's been a teensy bit more details than that, e.g. [1]. If you think about exactly the file semantics that Bigtable would require (append, pread) that's exactly what is provided. Note that Colossus and D are two separate things. Google systems can use D without Colossus and a long time ago people used Colossus without D, although today Colossus/D is implied. The presentation gives the broad strokes of how Colossus is able to bootstrap itself from Chubby. It helps if you've also read the Bigtable paper [2].
??? The original GFS paper, which the chat references repeatedly, was clear about the semantics not being POSIX-like. The interview mentions that, too, along with stuff like snapshots. Colossus is basically the same, with increased scalability.
Yeah the whole time I was there every time I had to use bigtable or Spanner I always muttered to myself “I wish I could be using some free software garbage instead of this proprietary Google stuff right now.” Every Googler secretly yearns for the performance, reliability, and elegance of MySQL.
Uncool, given that Google only exist thanks to free software.
Google started as a C, Java and Python shops. Android is born out of a Linux Kernel.
If I remember well, long ago, the initial database they used was actually an in-shop fork of MySQL.
Easy to say bad things about free software now that you have billions and hundred of geniuses to work on your own stuff.
Actually, this kind of comment reminds me of the "linux is cancer" era of Microsoft. Funny like Google is now becoming the new MS now they got all the markets, while MS pretend to be nice now that they are not the top dog anymore.
I agree with you both! I don't think Google (or probably even the parent) meant to disrespect or discredit the value and contributions of open source. But internal, non-open platforms aren't really "vendor lock-in" or just NIH. They're often much of much higher quality (if you have the resources) simply by solving the exact problems you have directly.
Disclaimer: I've been bitter about MySQL-for-everything-ism lately too but think Java is pure heaven.
To be fair, quite a few engineers at Google did yearn for MySQL and a single instance at that. Not for any of the traits you list°, but because it would have let them no longer worry about HA, replication, request hedging, key hotspots, etc. It would have also meant not having a product that works when there are more than a handful of users, but that's another story. BT was a lot of work to write for and that's why Spanner evolved the way it did.
Um, not really. Care to elaborate? If something is available open-sourced we're typically free to use it, as long as we are abiding by the license conditions.
Not strictly true. Most software would probably require at least some modifications to run internally but as far as I know there’s no policy preventing open source software in production, quite the opposite.
There’s more, but if I can’t find a reference to them on Google Search I’ll assume its not in my place to discuss it publicly.
Using protobufs as a base layer may seem like lock-in, but it very much is the opposite. Protobufs are surprisingly simple and maybe even elegant once you get past the ugly parts, but most importantly it decouples software from arbitrary protocols and makes it much easier to deal with changing implementations. (Not to mention the potential for rich backwards and forwards compatibility.)
Why? The build-or-buy trade-off is very different at Google scale and this is one of the few organizations that can build everything in-house for their specific needs.
As a side-note: 95th percentile latency statistics are pretty meaningless at this scale. With a million requests per second, a 95th percentile latency of 10ms still means that 50,000 requests per second are slower than that.
This is absolutely incredible. Since we saw login with Apple yesterday, makes me wonder if any of the other big companies can compete with this. Curious about Facebook/Netflix/Amazon.
Netflix seems zippy, but I've never looked at the request timings, which could differ pretty dramatically from UI load times. I imagine Google also dwarfs their login scale. Would be interesting to see numbers capturing full load time from clicking the login UI to successful redirect (or however you would measure this without including the time of the page load post-login).
This isn't a login service, this is an ACL service. Related space, but different concerns. You wouldn't send a user's password here to find out if it's correct (authentication), you'd use this to figure out if a user can do something once you know who they are (authorization) :)
Also, generating the login page etc is often more expensive than the actual 'validate the username and password'. Getting to the server is also going to dwarf these latencies; you probably don't store all your passwords in PoPs, so you need to make the full trek to your local Google datacentre to complete a login :)
In fact, validate the username and password might need to be artificially slowed down to protect against side channel and credential stuffing attacks :)
Awkward. I realized it was used for authz, but for some reason I assumed it would be used for authn as well. Now I’m wondering how Google does authn...
And yeah, the second half of my comment is trying to scope down the comparison to one that is reasonably “fair”
That's my corner of Google. We haven't published anything comparable to this paper in the time I've worked on it (maybe we could—I'm pleasantly surprised to see the Zanzibar folks got approval to share qps numbers and everything) but here's a bit about how it worked back in 2006:
fwiw, while we do our fair share of password checking, we do a _lot_ more oauth token and cookie checking. Most folks just stay signed in on both mobile and web, so no need to recheck their passwords. In contrast, session credentials get checked on every request.
In addition to what the neighbor comment says about authorization, an ACL is an internal service: it provides an “if (the user is allowed to X) then ...” to the business logic code. It's not a user-facing service.
I'm not saying this is the case at all, but I've noticed through experience that depending on how a system at scale is distributed that .1% outside your 99.9% may be impacting a specific user or group of users, or group of resources, or etc. So they may be getting 100% of their requests outside your 99.9% latency.
Not sure how I feel about adopting a countries name for a project.
Or more to the point I'm not sure how I would feel if every time I searched my countries name on the web this Google project appears rather than my actual country.
i.e Zanzibar is a national identity not just a "spice" island
> We define availability as the fraction of “qualified” RPCs the service answers successfully within latency thresholds: 5 seconds for a Safe request, and 15 seconds for a Recent request as leader re-election in Spanner may take up to 10 seconds. ... To compute availability, we aggregate success ratios over 90-day windows averaged across clusters. Figure 5 shows Zanzibar’s availability as measured by these probers. Availability has remained above 99.999% over the past 3 years ofoperation at Google. In other words, for every quarter, Zanzibar has less than 2 minutes of global downtime and fewer than 13 minutes when the global error ratio exceeds 10%.
Basically, they're counting by number of requests. That's fairly typical for Google, who in their SRE book point out that measuring only total outages is a poor indicator of actual user experiences. Imagine if you had an electric company that had frequent brownouts and rolling blackouts but bragged about never having a total blackout. You'd be fairly unimpressed.
Google SREs also make the point that beyond five nines, your efforts are rendered moot by reliability issues you cannot control. Mostly network issues. If you have 99.99999% reliability but the mobile data network only has 99.99%, you've wasted a lot of money on something most folks will never notice.
Overall uptime isn't the only stat that matters here, the distribution of downtime matters too. One 15 minute outage in three years is a lot worse than 900 1 second outages over that same time period. One second blips are a part of the web, we click refresh and move on--not even knowing who's fault it was.
It says greater than 5 nines, and it's usually much greater - in usual times, these core services are usually at six or seven nines as measured client side. But it doesn't take long at three nines to destroy your five nine SLA.
The other portion is client side retry logic. It's incredibly easy for developers to mark a lookup with a retry policy and timer, and one of the reasons that that latency is so low is so that even if there's a timeout, the pageview can succeed. The application code doesn't see the error at all if the retry is successful, it just takes longer. The retry code is very good and it's already known at the first rpc call where the retry should go - the connection pool maintains connections to multiple independent servers.
It kinda depends what availability means? That .001% unavailability might be degraded service, might be .001% of clients having a bad time across the entire year, might be 'acts of god' (i.e. broken CPUs and the like). This kind of service is also usually fairly low down on the stack, and higher level applications can usually degrade gracefully. If they couldn't, complex applications such as Google would fail to operate; there's always _something_ broken.