Hacker News new | past | comments | ask | show | jobs | submit login
Scalable and secure access with SSH (facebook.com)
161 points by samber on Sept 12, 2016 | hide | past | favorite | 61 comments



I found this really useful. It's lightweight, simple and powerful.

However, some pieces of advice in the article could prove catastrophic:

"You need to distribute ca.pub to your entire fleet. Remember, this is meant to be public, so complete access lockdown isn't the goal."

Complete access lockdown is vital. You need to prevent the CA's public key from being altered, corrupted or deleted across your fleet. When distributing the CA's public key, you definitely need to use a secure transport. The CA's public key is what your system of trust is built on. You need to make sure you are trusting the right thing.

"Copy the latter to the CA server and get it signed. Because this is public information, the transport isn't important. You can copy and paste, or fax it"

When copying the user's public key to the server, the transport is still important, otherwise a MITM could swap the public key and your CA would happily sign the wrong thing.


For (possibly) smaller scale I wrote a self-service SSH CA: https://github.com/nsheridan/cashier


> We do not place SSH certificates on laptops because it is difficult to control everything that individual employees run on them. Even though they are centrally managed, laptops are more prone to vulnerabilities than the bastion servers. Therefore, we do not trust them with SSH private keys.

If the laptop doesn't have an SSH private key, how do you SSH to the bastion host?

> In your .ssh/ directory, you'll see id_ecdsa and id_ecdsa.pub. Copy the latter to the CA server and get it signed. Because this is public information, the transport isn't important. You can copy and paste, or fax it; just don't copy id_ecdsa anywhere.

The transfer of id_ecdsa.pub doesn't need secrecy, but it does need integrity. You don't want to accidentally sign an attacker's public key.


> If the laptop doesn't have an SSH private key, how do you SSH to the bastion host?

These hosts use centralized LDAP and Kerberos installations to share account information, and they require two-factor authentication to protect against password leakage

It appears that you use your password + some unspecified second factor.


I thought this was mostly a solved problem? SSSD or FreeIPA and a bunch of LDAP servers. Cached credentials will keep you online in the case of a transient authentication issue.

HA LDAP is pretty much a solved problem as well. Say what you want about various Microsoft server products, Active Directory is second to none in this regard, and is a highly flexible and extremely robust service.

Am I missing something specific about Facebook that makes this not an option?


We have hundreds of thousands of servers and we also run sshd inside containers everywhere, so we probably have over a million endpoints we can SSH into. Cache is only good for things you login frequently, like your workstation or a handful of servers. We don't accept the risk of not being able to login in a system. Our only dependency is a signed certificate. If things go south really really bad, we can just get the private key and sign certificates by hand.


That's an extremely good answer to the question I posed. Thank you.

Essentially the answer is that there's less immediate reliance on moving parts, which makes sense when you get to a certain level of scale and risk aversion -- a scale much greater than the one I work at, for sure.


Why do you run sshd inside containers? I'm sort of surprised that Facebook does anything but run servers inside them. Running another init seems like an antipattern to me.


Would be interested in hearing this as well.

Could be that you would want to have a more generic interface for your containers, and not have to integrate too tightly with your container framework?


> We don't accept the risk of not being able to login in a system.

There are plenty of ways to mitigate the risk of LDAP server unavailability. Besides, the document said that the bastion server uses it, so why isn't what's good for the bastion good enough for the rest of the fleet?


LDAP works really well until you have either too many users, or too many endpoints. Then it doesn't work at all. There's no (cost effective) way to get past all of the replication traffic flying all over the place. Eventually consistent models are awful for things like ssh keys and passwords that's exacerbated significantly when you add caching to the mix to ease the load on your ldap cluster.

Adding extra brittle moving parts like LDAP and linux clustering to a greenfield deployment just doesn't seem attractive when a CA is so easy to run, and you should already have one if you're doing config management sanely.


Your premise would make sense if LDAP replication were expensive, but it isn't. LDAP database modifications are relatively rare: you only make them when a new user is added; a user's credentials change; or a user is deleted (which should be never for various post-termination accounting reasons). Even at Facebook, the change rate should be relatively low.

Also you're making an assumption about the need for consistency, when as a practical matter there's rarely a need for it. Caching is effective and practical for this use case and you'd have to make a very strong case that it should be thrown out.

Finally, it is my experience that people grossly misjudge the difficulty of securely and scalably running a CA. Most such comments come from those who have never actually operated one.


> why isn't what's good for the bastion good enough for the rest of the fleet?

mfdutra's comment that you replied to said: If things go south really really bad, we can just get the private key and sign certificates by hand.

That's obviously an incomplete answer - it doesn't explain how you connect to the server if you can't get onto the bastion due to a directory failure - but the basic premise is that the core system is isolated & contains everything it needs to handle the authentication process.


That doesn't answer my question at all.


I'll assume that you didn't mean that to be as snarky as it sounds.

The bastion host is not strictly necessary in the process. Although mfdutra didn't spell it out explicitly, they clearly have a means for accessing their core servers without needing to get on to the bastion - that's where the signing by hand comes in. The benefits of using LDAP in that case justify the cost of having an external dependency - by integrating with the central password store, the bastion provides a path into the system that delivers the desired security features (centralised identity management, auditing back to individual user accounts), but other paths can exist if needed.

That trade-off isn't applicable to core servers. If they use LDAP (exclusively) for authentication and the directory is down then you're hosed. There's no other path into the server that you're trying to admin other than to get onto the server.

You could use LDAP with a fallback to some other scheme (like signed certs, or a small set of locally defined users managed with a password vault, etc) but then you've got 2 different paths onto each server and both paths needs to be maintained, tested, and regularly audited. That's certainly possible, but why would you want to when you can just have 1 authentication mechanism that has zero dependencies.


I dislike guessing what someone's intent and rationale are, when we can simply ask, as we can here.

In any event:

First, the CA itself is a dependency: if it fails, no login certificates will issue, and users won't be able to log in. So they traded one kind of service dependency for another.

Second, it hasn't been established that a backup login path is available. In many environments this may not even be acceptable (e.g. PCI and billing systems, which I know Facebook has).


>service dependency

A CA is just some bytes, not a service. And it has been established that there's a backup login path: use (a copy of) the CA outside of the automated certificate signing service to manually sign the needed certificates.

They'd be screwed if they lost the CA's private key, but it is much easer to keep some data around than to keep a service functioning properly.


You're mistaking the CA certificate (i.e. the certificate that signs the login certificate) for the signing process itself.

In the scheme described, users don't log into servers with the CA's certificate and private key -- the CA's private key is always protected, preferably in an HSM of some sort. Instead, the CA issues a signed certificate to the user with a set of principals; and that latter certificate is the one used to log in.

So, the process that issues certificates (the "Authority", as opposed to the "Authority Certificate") is the one I'm concerned with here.


No I'm not.

>If things go south really really bad, we can just get the private key and sign certificates by hand.

Very clearly states that someone has access to the CA cert's private key outside the context of the automated signing service, and can use it to manually sign certs for users if the CA service is down. So the CA service can be bypassed if it goes down,.


OK, so now you're dependent on the person, who might be unreachable during an emergency.

My point is that you cannot eliminate all dependencies. And if I must have dependencies, I'd rather put my trust in a well-engineered, time-tested, highly-available system. When properly implemented, LDAP + SSSD is such a system.

At any rate, an even faster and more reliable emergency response system would be to place a static user ID and password in a lockbox (virtual or physical) somewhere and use that to log in. You don't need a complex CA infrastructure to attain that; NSS fallbacks to static /etc files would suffice.


You have a panic room/war room or etc. Somewhere with significant physical security and dedicated ports that skips the bastion entirely. You might have to go for the axe method in getting into the room during an incident but you can if need be.


You can do that even if LDAP is your primary authentication method. NSS and PAM have supported fallback methods since their inception, nearly two decades ago.


How many engineers can sign into live servers and read information about my Facebook account. Rather scarily it sounds like all of them!

How many engineers have access/a copy of you global CA; if i get a copy of your global CA can I create a cert allowing me to log into any of the machines?


Signed certificates work even when your AAA system is entirely offline. An individual server can make a valid authentication/authorization decision just by validating the signature.

There are some significant advantages to working that way.


I guess they are worried about total network isolation for some hosts...even if the LDAP infra is up, if there is a segment isolated and you haven't logged in for a while, cached creds won't help.

That said, if it's that isolated how are you connecting to it in the first place :)

I agree with you though, we just use AD and SSSD, with the additional step of having extended our AD schema to store pubkeys and sudo roles for the Linux part of our infra.


I haven't read this article closely yet, but based on other similar protocols, I _think_ the key element is, can you still log in to your servers when all your LDAP is catastrophically down for some reason. Sure, it's unlikely, but for someone like Facebook that's an unacceptable risk. This also means you don't have to actually _worry_ about maintaining an HA LDAP infrastructure that can support all the SSH auth you need.


That exact scenario is covered by the concept of "cached credentials" in the case of SSSD. If your authentication services or down (or the box you're attempting to log into is offline) there's a period in which you can log in using the last used password for your account.

Granted you only have the configurable cache time out in which to solve your problem, but that's usually on the matter of days. If your authentication infrastructure is down for longer than that you have bigger fish to fry.


Cached credentials only save the day if the cache has been populated. Engineers should be logging in infrequently enough that cache is rarely populated.


Is it not possible to force the cache to have a particular set of users (say, critical devops folks) always populated?


The implementation seems quite novel, but I can't shake the feeling that its a bit over-engineered. The additional complexity of this over a standard solution (eg FreeIPA + SSSD) may introduce other issues that need to be mitigated.

The approach of using certificates over a cache like sssd is just shifting the issue from cache expiration over to certificate expiration. With certificates, you are likely to set some period where the user certificates expire, and if u have an expired certificate and CA is down u still cannot log in. If you dont expire your certificates u have a security risk.

With FreeIPA you can also map public keys to specific users and have sshd pull authorized_keys from SSSD instead of the actual file. So if a user leaves the system, his keys get removed. So no need to worry about old keys lingering on your servers.

As somebody already mentioned, LDAP (which FreeIPA uses as backend) has a battle-proven HA capability. With some sane configurations of the underlying servers (separate physical equipment, networking, etc), you don't need to worry about your backend mysteriously going down.


Tangent: "When designing a security system, regardless of purpose or protocol, you need to think of authentication and authorization separately. Authentication securely verifies that the client is who it claims to be, but does not grant any permissions. After the successful authentication, authorization decides whether or not the client can perform a specific action."

These two operations are often abbreviated "authn" (for autheNtication) and "authz" (for authoriZation) in security frameworks.


In fact, we use a lot the terms authn, authz, authnz and AAA at Facebook. :-)


When I was doing security work, AAA stood for "authorization, authentication and audit"


The last one was also known as "accounting."


But not referred to in the article, which I found surprising since they're such common terms for those in the know. :)


How confusing for those of us in British English countries!

(or should I say 'confuzing'? :p )


How do you manage what servers clients trust?

I've seen a few SSH CA projects for signing user certs, but not for the server side certificates.

When you're running containers, if they're spawned frequently, that can mean a significant amount of known_hosts file churning. It sounds like everyone funnels through a small set of bastions, so you'd only need to update them there, but I'm wondering if anyone else uses an SSH CA to solve that side of the equation?

All the ingredients needed are in `ssh-keygen`, but that doesn't feel super awesome to me.

Solving server identity is why I built https://github.com/square/sharkey


You've not seen that?

It's HostCertificate in sshd_config. Then in known_hosts:

@cert-authority *.example.com ssh-rsa AAAAB…

https://www.digitalocean.com/community/tutorials/how-to-crea...

Edit: Oh, you mean for making them more short lived.


You can trust a CA easily enough. So do you just glob some shell scripts around ssh-keygen?

Short-lived is one solution to limiting their lifetime: the other is to use a CRL format.

Either way, you need software to manage the issuance and later revocation.

I suppose you could build this into your host imaging profile, or use config management software. I'm just interested in what people do.


This looks very similar to Netflix' BLESS package[1]

[1] HN discussion at https://news.ycombinator.com/item?id=11746425


Yes, I was coming to post the exact same thing.


Another promising solution that uses these principles is Teleport (http://gravitational.com/teleport)


If I'm reading this right, it looks like this improves over Netflix BLESS IIRC. My concern with something like BLESS was that it allows you in to root or some deploy/app user based on a cert trust chain, but doesn't seem to trace back to the actual user who's making the changes and/or differentiate between two users logged in at once. This embeds the user information in the cert. Looks nice.


How do they handle revocation? The problem I had with this approach in the past was no CRL support.


We can use RevokedKeys for that, but in fact we normally issue short-lived certificates. If we ever have a mass certificate leak, we'll just rotate the entire CA.


Also, we bind certificates to the hosts that requested them. If a certificate and its private key move somewhere else, they will be useless.

ssh-keygen -O source-address=1.2.3.4...


Interesting, so I'm assuming this wouldn't work for staff that roam with laptops?


We only do that in bastion servers. It would be a bit trickier with laptops, but doable I guess.


Makes sense. I missed that section, much clearer now, thanks. Any plans to release more details on how the bastion hosts are configured? Seems like that would be the complex bit to get right.


Absolutely nothing special about them. They trust the same LDAP/Kerberos infra the laptops are part of. We have two-factor authentication, which is pretty simple with PAM. And they're only accessible if you are in our corporate network.


With RevokedKeys. It looks like Facebook also uses short-lifespan certificates.


So staff that leave have 1 week of access to production systems?


No, they use revokedkeys. For that to happen you'd need for the revokedkeys process that happens when someone leaves to fail as well as your network security on your servers and/or client VPN.


To be clear they said they "could" use RevokedKeys. See my reply further down for my line of thinking.


Again: you can revoke keys.


I get that, but you need to update the revocation list on every single server. At which point you've just undone the benefit of not having to manage `authorized_keys` on each server.


I would argue it's a lot easier to maintain a single CRL across your entire infrastructure (you can regularly update it to all hosts, easily monitor for non-matching versions through your monitoring tools, etc) than it is to maintain a customised authorised_keys file for each server or server group (n keys across m servers can be a lot of combinations, with no easy way to check correctness)


Sure, but wouldn't the complexity of managing distributed authorized_keys be significantly higher than propagating revokedkeys across a fleet of servers? You get the benefit of central management/administration with the benefit of distributed ssh keys. It's almost the best of both worlds.


It doesn't seem like it's more work, although you are right in that it's probably more user friendly as you don't have to wait for keys to propagate to various servers before you can login. We've previously used a cron script to pick up authorized_keys from a centrally published source, I guess that same infrastructure just moves to the CRL.


Thinking about it you make it a good point. How you do it is "central" but much less of single point of failure than something like LDAP. It would require a pretty contrived scenario to go undetected for an extended period that could be slightly more likely depending on how often you rotate keys. It would be nice if OpenSSH supported CRLs.


How anticlimactic. Nothing new or internal, just "hey here's what's in the OpenSSH manpage: TrustedUserCAKeys and AuthorizedPrincipalsFile".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: