Hacker News new | past | comments | ask | show | jobs | submit login

I thought this was mostly a solved problem? SSSD or FreeIPA and a bunch of LDAP servers. Cached credentials will keep you online in the case of a transient authentication issue.

HA LDAP is pretty much a solved problem as well. Say what you want about various Microsoft server products, Active Directory is second to none in this regard, and is a highly flexible and extremely robust service.

Am I missing something specific about Facebook that makes this not an option?




We have hundreds of thousands of servers and we also run sshd inside containers everywhere, so we probably have over a million endpoints we can SSH into. Cache is only good for things you login frequently, like your workstation or a handful of servers. We don't accept the risk of not being able to login in a system. Our only dependency is a signed certificate. If things go south really really bad, we can just get the private key and sign certificates by hand.


That's an extremely good answer to the question I posed. Thank you.

Essentially the answer is that there's less immediate reliance on moving parts, which makes sense when you get to a certain level of scale and risk aversion -- a scale much greater than the one I work at, for sure.


Why do you run sshd inside containers? I'm sort of surprised that Facebook does anything but run servers inside them. Running another init seems like an antipattern to me.


Would be interested in hearing this as well.

Could be that you would want to have a more generic interface for your containers, and not have to integrate too tightly with your container framework?


> We don't accept the risk of not being able to login in a system.

There are plenty of ways to mitigate the risk of LDAP server unavailability. Besides, the document said that the bastion server uses it, so why isn't what's good for the bastion good enough for the rest of the fleet?


LDAP works really well until you have either too many users, or too many endpoints. Then it doesn't work at all. There's no (cost effective) way to get past all of the replication traffic flying all over the place. Eventually consistent models are awful for things like ssh keys and passwords that's exacerbated significantly when you add caching to the mix to ease the load on your ldap cluster.

Adding extra brittle moving parts like LDAP and linux clustering to a greenfield deployment just doesn't seem attractive when a CA is so easy to run, and you should already have one if you're doing config management sanely.


Your premise would make sense if LDAP replication were expensive, but it isn't. LDAP database modifications are relatively rare: you only make them when a new user is added; a user's credentials change; or a user is deleted (which should be never for various post-termination accounting reasons). Even at Facebook, the change rate should be relatively low.

Also you're making an assumption about the need for consistency, when as a practical matter there's rarely a need for it. Caching is effective and practical for this use case and you'd have to make a very strong case that it should be thrown out.

Finally, it is my experience that people grossly misjudge the difficulty of securely and scalably running a CA. Most such comments come from those who have never actually operated one.


> why isn't what's good for the bastion good enough for the rest of the fleet?

mfdutra's comment that you replied to said: If things go south really really bad, we can just get the private key and sign certificates by hand.

That's obviously an incomplete answer - it doesn't explain how you connect to the server if you can't get onto the bastion due to a directory failure - but the basic premise is that the core system is isolated & contains everything it needs to handle the authentication process.


That doesn't answer my question at all.


I'll assume that you didn't mean that to be as snarky as it sounds.

The bastion host is not strictly necessary in the process. Although mfdutra didn't spell it out explicitly, they clearly have a means for accessing their core servers without needing to get on to the bastion - that's where the signing by hand comes in. The benefits of using LDAP in that case justify the cost of having an external dependency - by integrating with the central password store, the bastion provides a path into the system that delivers the desired security features (centralised identity management, auditing back to individual user accounts), but other paths can exist if needed.

That trade-off isn't applicable to core servers. If they use LDAP (exclusively) for authentication and the directory is down then you're hosed. There's no other path into the server that you're trying to admin other than to get onto the server.

You could use LDAP with a fallback to some other scheme (like signed certs, or a small set of locally defined users managed with a password vault, etc) but then you've got 2 different paths onto each server and both paths needs to be maintained, tested, and regularly audited. That's certainly possible, but why would you want to when you can just have 1 authentication mechanism that has zero dependencies.


I dislike guessing what someone's intent and rationale are, when we can simply ask, as we can here.

In any event:

First, the CA itself is a dependency: if it fails, no login certificates will issue, and users won't be able to log in. So they traded one kind of service dependency for another.

Second, it hasn't been established that a backup login path is available. In many environments this may not even be acceptable (e.g. PCI and billing systems, which I know Facebook has).


>service dependency

A CA is just some bytes, not a service. And it has been established that there's a backup login path: use (a copy of) the CA outside of the automated certificate signing service to manually sign the needed certificates.

They'd be screwed if they lost the CA's private key, but it is much easer to keep some data around than to keep a service functioning properly.


You're mistaking the CA certificate (i.e. the certificate that signs the login certificate) for the signing process itself.

In the scheme described, users don't log into servers with the CA's certificate and private key -- the CA's private key is always protected, preferably in an HSM of some sort. Instead, the CA issues a signed certificate to the user with a set of principals; and that latter certificate is the one used to log in.

So, the process that issues certificates (the "Authority", as opposed to the "Authority Certificate") is the one I'm concerned with here.


No I'm not.

>If things go south really really bad, we can just get the private key and sign certificates by hand.

Very clearly states that someone has access to the CA cert's private key outside the context of the automated signing service, and can use it to manually sign certs for users if the CA service is down. So the CA service can be bypassed if it goes down,.


OK, so now you're dependent on the person, who might be unreachable during an emergency.

My point is that you cannot eliminate all dependencies. And if I must have dependencies, I'd rather put my trust in a well-engineered, time-tested, highly-available system. When properly implemented, LDAP + SSSD is such a system.

At any rate, an even faster and more reliable emergency response system would be to place a static user ID and password in a lockbox (virtual or physical) somewhere and use that to log in. You don't need a complex CA infrastructure to attain that; NSS fallbacks to static /etc files would suffice.


You have a panic room/war room or etc. Somewhere with significant physical security and dedicated ports that skips the bastion entirely. You might have to go for the axe method in getting into the room during an incident but you can if need be.


You can do that even if LDAP is your primary authentication method. NSS and PAM have supported fallback methods since their inception, nearly two decades ago.


How many engineers can sign into live servers and read information about my Facebook account. Rather scarily it sounds like all of them!

How many engineers have access/a copy of you global CA; if i get a copy of your global CA can I create a cert allowing me to log into any of the machines?


Signed certificates work even when your AAA system is entirely offline. An individual server can make a valid authentication/authorization decision just by validating the signature.

There are some significant advantages to working that way.


I guess they are worried about total network isolation for some hosts...even if the LDAP infra is up, if there is a segment isolated and you haven't logged in for a while, cached creds won't help.

That said, if it's that isolated how are you connecting to it in the first place :)

I agree with you though, we just use AD and SSSD, with the additional step of having extended our AD schema to store pubkeys and sudo roles for the Linux part of our infra.


I haven't read this article closely yet, but based on other similar protocols, I _think_ the key element is, can you still log in to your servers when all your LDAP is catastrophically down for some reason. Sure, it's unlikely, but for someone like Facebook that's an unacceptable risk. This also means you don't have to actually _worry_ about maintaining an HA LDAP infrastructure that can support all the SSH auth you need.


That exact scenario is covered by the concept of "cached credentials" in the case of SSSD. If your authentication services or down (or the box you're attempting to log into is offline) there's a period in which you can log in using the last used password for your account.

Granted you only have the configurable cache time out in which to solve your problem, but that's usually on the matter of days. If your authentication infrastructure is down for longer than that you have bigger fish to fry.


Cached credentials only save the day if the cache has been populated. Engineers should be logging in infrequently enough that cache is rarely populated.


Is it not possible to force the cache to have a particular set of users (say, critical devops folks) always populated?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: