I thought this was mostly a solved problem? SSSD or FreeIPA and a bunch of LDAP ...

mfdutra · on Sept 13, 2016

We have hundreds of thousands of servers and we also run sshd inside containers everywhere, so we probably have over a million endpoints we can SSH into. Cache is only good for things you login frequently, like your workstation or a handful of servers. We don't accept the risk of not being able to login in a system. Our only dependency is a signed certificate. If things go south really really bad, we can just get the private key and sign certificates by hand.

hug · on Sept 13, 2016

That's an extremely good answer to the question I posed. Thank you.

Essentially the answer is that there's less immediate reliance on moving parts, which makes sense when you get to a certain level of scale and risk aversion -- a scale much greater than the one I work at, for sure.

otterley · on Sept 13, 2016

Why do you run sshd inside containers? I'm sort of surprised that Facebook does anything but run servers inside them. Running another init seems like an antipattern to me.

markild · on Sept 13, 2016

Would be interested in hearing this as well.

Could be that you would want to have a more generic interface for your containers, and not have to integrate too tightly with your container framework?

otterley · on Sept 13, 2016

> We don't accept the risk of not being able to login in a system.

There are plenty of ways to mitigate the risk of LDAP server unavailability. Besides, the document said that the bastion server uses it, so why isn't what's good for the bastion good enough for the rest of the fleet?

jetpks · on Sept 13, 2016

LDAP works really well until you have either too many users, or too many endpoints. Then it doesn't work at all. There's no (cost effective) way to get past all of the replication traffic flying all over the place. Eventually consistent models are awful for things like ssh keys and passwords that's exacerbated significantly when you add caching to the mix to ease the load on your ldap cluster.

Adding extra brittle moving parts like LDAP and linux clustering to a greenfield deployment just doesn't seem attractive when a CA is so easy to run, and you should already have one if you're doing config management sanely.

otterley · on Sept 13, 2016

Your premise would make sense if LDAP replication were expensive, but it isn't. LDAP database modifications are relatively rare: you only make them when a new user is added; a user's credentials change; or a user is deleted (which should be never for various post-termination accounting reasons). Even at Facebook, the change rate should be relatively low.

Also you're making an assumption about the need for consistency, when as a practical matter there's rarely a need for it. Caching is effective and practical for this use case and you'd have to make a very strong case that it should be thrown out.

Finally, it is my experience that people grossly misjudge the difficulty of securely and scalably running a CA. Most such comments come from those who have never actually operated one.

timv · on Sept 13, 2016

> why isn't what's good for the bastion good enough for the rest of the fleet?

mfdutra's comment that you replied to said: If things go south really really bad, we can just get the private key and sign certificates by hand.

That's obviously an incomplete answer - it doesn't explain how you connect to the server if you can't get onto the bastion due to a directory failure - but the basic premise is that the core system is isolated & contains everything it needs to handle the authentication process.

otterley · on Sept 13, 2016

That doesn't answer my question at all.

timv · on Sept 13, 2016

I'll assume that you didn't mean that to be as snarky as it sounds.

The bastion host is not strictly necessary in the process. Although mfdutra didn't spell it out explicitly, they clearly have a means for accessing their core servers without needing to get on to the bastion - that's where the signing by hand comes in. The benefits of using LDAP in that case justify the cost of having an external dependency - by integrating with the central password store, the bastion provides a path into the system that delivers the desired security features (centralised identity management, auditing back to individual user accounts), but other paths can exist if needed.

That trade-off isn't applicable to core servers. If they use LDAP (exclusively) for authentication and the directory is down then you're hosed. There's no other path into the server that you're trying to admin other than to get onto the server.

You could use LDAP with a fallback to some other scheme (like signed certs, or a small set of locally defined users managed with a password vault, etc) but then you've got 2 different paths onto each server and both paths needs to be maintained, tested, and regularly audited. That's certainly possible, but why would you want to when you can just have 1 authentication mechanism that has zero dependencies.

otterley · on Sept 13, 2016

I dislike guessing what someone's intent and rationale are, when we can simply ask, as we can here.

In any event:

First, the CA itself is a dependency: if it fails, no login certificates will issue, and users won't be able to log in. So they traded one kind of service dependency for another.

Second, it hasn't been established that a backup login path is available. In many environments this may not even be acceptable (e.g. PCI and billing systems, which I know Facebook has).

superuser2 · on Sept 13, 2016

>service dependency

A CA is just some bytes, not a service. And it has been established that there's a backup login path: use (a copy of) the CA outside of the automated certificate signing service to manually sign the needed certificates.

They'd be screwed if they lost the CA's private key, but it is much easer to keep some data around than to keep a service functioning properly.

otterley · on Sept 13, 2016

You're mistaking the CA certificate (i.e. the certificate that signs the login certificate) for the signing process itself.

In the scheme described, users don't log into servers with the CA's certificate and private key -- the CA's private key is always protected, preferably in an HSM of some sort. Instead, the CA issues a signed certificate to the user with a set of principals; and that latter certificate is the one used to log in.

So, the process that issues certificates (the "Authority", as opposed to the "Authority Certificate") is the one I'm concerned with here.

superuser2 · on Sept 14, 2016

No I'm not.

>If things go south really really bad, we can just get the private key and sign certificates by hand.

Very clearly states that someone has access to the CA cert's private key outside the context of the automated signing service, and can use it to manually sign certs for users if the CA service is down. So the CA service can be bypassed if it goes down,.

otterley · on Sept 14, 2016

OK, so now you're dependent on the person, who might be unreachable during an emergency.

My point is that you cannot eliminate all dependencies. And if I must have dependencies, I'd rather put my trust in a well-engineered, time-tested, highly-available system. When properly implemented, LDAP + SSSD is such a system.

At any rate, an even faster and more reliable emergency response system would be to place a static user ID and password in a lockbox (virtual or physical) somewhere and use that to log in. You don't need a complex CA infrastructure to attain that; NSS fallbacks to static /etc files would suffice.

tacticus · on Sept 13, 2016

You have a panic room/war room or etc. Somewhere with significant physical security and dedicated ports that skips the bastion entirely. You might have to go for the axe method in getting into the room during an incident but you can if need be.

otterley · on Sept 13, 2016

You can do that even if LDAP is your primary authentication method. NSS and PAM have supported fallback methods since their inception, nearly two decades ago.

andy_ppp · on Sept 13, 2016

How many engineers can sign into live servers and read information about my Facebook account. Rather scarily it sounds like all of them!

How many engineers have access/a copy of you global CA; if i get a copy of your global CA can I create a cert allowing me to log into any of the machines?

tptacek · on Sept 13, 2016

Signed certificates work even when your AAA system is entirely offline. An individual server can make a valid authentication/authorization decision just by validating the signature.

There are some significant advantages to working that way.

windowsworkstoo · on Sept 13, 2016

I guess they are worried about total network isolation for some hosts...even if the LDAP infra is up, if there is a segment isolated and you haven't logged in for a while, cached creds won't help.

That said, if it's that isolated how are you connecting to it in the first place :)

I agree with you though, we just use AD and SSSD, with the additional step of having extended our AD schema to store pubkeys and sudo roles for the Linux part of our infra.

skywhopper · on Sept 12, 2016

I haven't read this article closely yet, but based on other similar protocols, I _think_ the key element is, can you still log in to your servers when all your LDAP is catastrophically down for some reason. Sure, it's unlikely, but for someone like Facebook that's an unacceptable risk. This also means you don't have to actually _worry_ about maintaining an HA LDAP infrastructure that can support all the SSH auth you need.

hug · on Sept 12, 2016

That exact scenario is covered by the concept of "cached credentials" in the case of SSSD. If your authentication services or down (or the box you're attempting to log into is offline) there's a period in which you can log in using the last used password for your account.

Granted you only have the configurable cache time out in which to solve your problem, but that's usually on the matter of days. If your authentication infrastructure is down for longer than that you have bigger fish to fry.

subway · on Sept 12, 2016

Cached credentials only save the day if the cache has been populated. Engineers should be logging in infrequently enough that cache is rarely populated.

deathanatos · on Sept 13, 2016

Is it not possible to force the cache to have a particular set of users (say, critical devops folks) always populated?