Scalable and secure access with SSH

jorangreef · on Sept 13, 2016

I found this really useful. It's lightweight, simple and powerful.

However, some pieces of advice in the article could prove catastrophic:

"You need to distribute ca.pub to your entire fleet. Remember, this is meant to be public, so complete access lockdown isn't the goal."

Complete access lockdown is vital. You need to prevent the CA's public key from being altered, corrupted or deleted across your fleet. When distributing the CA's public key, you definitely need to use a secure transport. The CA's public key is what your system of trust is built on. You need to make sure you are trusting the right thing.

"Copy the latter to the CA server and get it signed. Because this is public information, the transport isn't important. You can copy and paste, or fax it"

When copying the user's public key to the server, the transport is still important, otherwise a MITM could swap the public key and your CA would happily sign the wrong thing.

nsheridan · on Sept 12, 2016

For (possibly) smaller scale I wrote a self-service SSH CA: https://github.com/nsheridan/cashier

cakoose · on Sept 13, 2016

> We do not place SSH certificates on laptops because it is difficult to control everything that individual employees run on them. Even though they are centrally managed, laptops are more prone to vulnerabilities than the bastion servers. Therefore, we do not trust them with SSH private keys.

If the laptop doesn't have an SSH private key, how do you SSH to the bastion host?

> In your .ssh/ directory, you'll see id_ecdsa and id_ecdsa.pub. Copy the latter to the CA server and get it signed. Because this is public information, the transport isn't important. You can copy and paste, or fax it; just don't copy id_ecdsa anywhere.

The transfer of id_ecdsa.pub doesn't need secrecy, but it does need integrity. You don't want to accidentally sign an attacker's public key.

timv · on Sept 13, 2016

> If the laptop doesn't have an SSH private key, how do you SSH to the bastion host?

These hosts use centralized LDAP and Kerberos installations to share account information, and they require two-factor authentication to protect against password leakage

It appears that you use your password + some unspecified second factor.

hug · on Sept 12, 2016

I thought this was mostly a solved problem? SSSD or FreeIPA and a bunch of LDAP servers. Cached credentials will keep you online in the case of a transient authentication issue.

HA LDAP is pretty much a solved problem as well. Say what you want about various Microsoft server products, Active Directory is second to none in this regard, and is a highly flexible and extremely robust service.

Am I missing something specific about Facebook that makes this not an option?

mfdutra · on Sept 13, 2016

We have hundreds of thousands of servers and we also run sshd inside containers everywhere, so we probably have over a million endpoints we can SSH into. Cache is only good for things you login frequently, like your workstation or a handful of servers. We don't accept the risk of not being able to login in a system. Our only dependency is a signed certificate. If things go south really really bad, we can just get the private key and sign certificates by hand.

hug · on Sept 13, 2016

That's an extremely good answer to the question I posed. Thank you.

Essentially the answer is that there's less immediate reliance on moving parts, which makes sense when you get to a certain level of scale and risk aversion -- a scale much greater than the one I work at, for sure.

otterley · on Sept 13, 2016

Why do you run sshd inside containers? I'm sort of surprised that Facebook does anything but run servers inside them. Running another init seems like an antipattern to me.

markild · on Sept 13, 2016

Would be interested in hearing this as well.

Could be that you would want to have a more generic interface for your containers, and not have to integrate too tightly with your container framework?

otterley · on Sept 13, 2016

> We don't accept the risk of not being able to login in a system.

There are plenty of ways to mitigate the risk of LDAP server unavailability. Besides, the document said that the bastion server uses it, so why isn't what's good for the bastion good enough for the rest of the fleet?

jetpks · on Sept 13, 2016

LDAP works really well until you have either too many users, or too many endpoints. Then it doesn't work at all. There's no (cost effective) way to get past all of the replication traffic flying all over the place. Eventually consistent models are awful for things like ssh keys and passwords that's exacerbated significantly when you add caching to the mix to ease the load on your ldap cluster.

Adding extra brittle moving parts like LDAP and linux clustering to a greenfield deployment just doesn't seem attractive when a CA is so easy to run, and you should already have one if you're doing config management sanely.

otterley · on Sept 13, 2016

Your premise would make sense if LDAP replication were expensive, but it isn't. LDAP database modifications are relatively rare: you only make them when a new user is added; a user's credentials change; or a user is deleted (which should be never for various post-termination accounting reasons). Even at Facebook, the change rate should be relatively low.

Also you're making an assumption about the need for consistency, when as a practical matter there's rarely a need for it. Caching is effective and practical for this use case and you'd have to make a very strong case that it should be thrown out.

Finally, it is my experience that people grossly misjudge the difficulty of securely and scalably running a CA. Most such comments come from those who have never actually operated one.

timv · on Sept 13, 2016

> why isn't what's good for the bastion good enough for the rest of the fleet?

mfdutra's comment that you replied to said: If things go south really really bad, we can just get the private key and sign certificates by hand.

That's obviously an incomplete answer - it doesn't explain how you connect to the server if you can't get onto the bastion due to a directory failure - but the basic premise is that the core system is isolated & contains everything it needs to handle the authentication process.

otterley · on Sept 13, 2016

That doesn't answer my question at all.

timv · on Sept 13, 2016

I'll assume that you didn't mean that to be as snarky as it sounds.

The bastion host is not strictly necessary in the process. Although mfdutra didn't spell it out explicitly, they clearly have a means for accessing their core servers without needing to get on to the bastion - that's where the signing by hand comes in. The benefits of using LDAP in that case justify the cost of having an external dependency - by integrating with the central password store, the bastion provides a path into the system that delivers the desired security features (centralised identity management, auditing back to individual user accounts), but other paths can exist if needed.

That trade-off isn't applicable to core servers. If they use LDAP (exclusively) for authentication and the directory is down then you're hosed. There's no other path into the server that you're trying to admin other than to get onto the server.

You could use LDAP with a fallback to some other scheme (like signed certs, or a small set of locally defined users managed with a password vault, etc) but then you've got 2 different paths onto each server and both paths needs to be maintained, tested, and regularly audited. That's certainly possible, but why would you want to when you can just have 1 authentication mechanism that has zero dependencies.

otterley · on Sept 13, 2016

I dislike guessing what someone's intent and rationale are, when we can simply ask, as we can here.

In any event:

First, the CA itself is a dependency: if it fails, no login certificates will issue, and users won't be able to log in. So they traded one kind of service dependency for another.

Second, it hasn't been established that a backup login path is available. In many environments this may not even be acceptable (e.g. PCI and billing systems, which I know Facebook has).

superuser2 · on Sept 13, 2016

>service dependency

A CA is just some bytes, not a service. And it has been established that there's a backup login path: use (a copy of) the CA outside of the automated certificate signing service to manually sign the needed certificates.

They'd be screwed if they lost the CA's private key, but it is much easer to keep some data around than to keep a service functioning properly.

otterley · on Sept 13, 2016

You're mistaking the CA certificate (i.e. the certificate that signs the login certificate) for the signing process itself.

In the scheme described, users don't log into servers with the CA's certificate and private key -- the CA's private key is always protected, preferably in an HSM of some sort. Instead, the CA issues a signed certificate to the user with a set of principals; and that latter certificate is the one used to log in.

So, the process that issues certificates (the "Authority", as opposed to the "Authority Certificate") is the one I'm concerned with here.

superuser2 · on Sept 14, 2016

No I'm not.

>If things go south really really bad, we can just get the private key and sign certificates by hand.

Very clearly states that someone has access to the CA cert's private key outside the context of the automated signing service, and can use it to manually sign certs for users if the CA service is down. So the CA service can be bypassed if it goes down,.

otterley · on Sept 14, 2016

OK, so now you're dependent on the person, who might be unreachable during an emergency.

My point is that you cannot eliminate all dependencies. And if I must have dependencies, I'd rather put my trust in a well-engineered, time-tested, highly-available system. When properly implemented, LDAP + SSSD is such a system.

At any rate, an even faster and more reliable emergency response system would be to place a static user ID and password in a lockbox (virtual or physical) somewhere and use that to log in. You don't need a complex CA infrastructure to attain that; NSS fallbacks to static /etc files would suffice.

tacticus · on Sept 13, 2016

You have a panic room/war room or etc. Somewhere with significant physical security and dedicated ports that skips the bastion entirely. You might have to go for the axe method in getting into the room during an incident but you can if need be.

otterley · on Sept 13, 2016

You can do that even if LDAP is your primary authentication method. NSS and PAM have supported fallback methods since their inception, nearly two decades ago.

andy_ppp · on Sept 13, 2016

How many engineers can sign into live servers and read information about my Facebook account. Rather scarily it sounds like all of them!

How many engineers have access/a copy of you global CA; if i get a copy of your global CA can I create a cert allowing me to log into any of the machines?

tptacek · on Sept 13, 2016

Signed certificates work even when your AAA system is entirely offline. An individual server can make a valid authentication/authorization decision just by validating the signature.

There are some significant advantages to working that way.

windowsworkstoo · on Sept 13, 2016

I guess they are worried about total network isolation for some hosts...even if the LDAP infra is up, if there is a segment isolated and you haven't logged in for a while, cached creds won't help.

That said, if it's that isolated how are you connecting to it in the first place :)

I agree with you though, we just use AD and SSSD, with the additional step of having extended our AD schema to store pubkeys and sudo roles for the Linux part of our infra.

skywhopper · on Sept 12, 2016

I haven't read this article closely yet, but based on other similar protocols, I _think_ the key element is, can you still log in to your servers when all your LDAP is catastrophically down for some reason. Sure, it's unlikely, but for someone like Facebook that's an unacceptable risk. This also means you don't have to actually _worry_ about maintaining an HA LDAP infrastructure that can support all the SSH auth you need.

hug · on Sept 12, 2016

That exact scenario is covered by the concept of "cached credentials" in the case of SSSD. If your authentication services or down (or the box you're attempting to log into is offline) there's a period in which you can log in using the last used password for your account.

Granted you only have the configurable cache time out in which to solve your problem, but that's usually on the matter of days. If your authentication infrastructure is down for longer than that you have bigger fish to fry.

subway · on Sept 12, 2016

Cached credentials only save the day if the cache has been populated. Engineers should be logging in infrequently enough that cache is rarely populated.

deathanatos · on Sept 13, 2016

Is it not possible to force the cache to have a particular set of users (say, critical devops folks) always populated?

crypt1d · on Sept 13, 2016

The implementation seems quite novel, but I can't shake the feeling that its a bit over-engineered. The additional complexity of this over a standard solution (eg FreeIPA + SSSD) may introduce other issues that need to be mitigated.

The approach of using certificates over a cache like sssd is just shifting the issue from cache expiration over to certificate expiration. With certificates, you are likely to set some period where the user certificates expire, and if u have an expired certificate and CA is down u still cannot log in. If you dont expire your certificates u have a security risk.

With FreeIPA you can also map public keys to specific users and have sshd pull authorized_keys from SSSD instead of the actual file. So if a user leaves the system, his keys get removed. So no need to worry about old keys lingering on your servers.

As somebody already mentioned, LDAP (which FreeIPA uses as backend) has a battle-proven HA capability. With some sane configurations of the underlying servers (separate physical equipment, networking, etc), you don't need to worry about your backend mysteriously going down.

AceJohnny2 · on Sept 12, 2016

Tangent: "When designing a security system, regardless of purpose or protocol, you need to think of authentication and authorization separately. Authentication securely verifies that the client is who it claims to be, but does not grant any permissions. After the successful authentication, authorization decides whether or not the client can perform a specific action."

These two operations are often abbreviated "authn" (for autheNtication) and "authz" (for authoriZation) in security frameworks.

mfdutra · on Sept 13, 2016

In fact, we use a lot the terms authn, authz, authnz and AAA at Facebook. :-)

beagle3 · on Sept 13, 2016

When I was doing security work, AAA stood for "authorization, authentication and audit"

otterley · on Sept 13, 2016

The last one was also known as "accounting."

AceJohnny2 · on Sept 13, 2016

But not referred to in the article, which I found surprising since they're such common terms for those in the know. :)

vacri · on Sept 13, 2016

How confusing for those of us in British English countries!

(or should I say 'confuzing'? :p )

mcpherrinm · on Sept 13, 2016

How do you manage what servers clients trust?

I've seen a few SSH CA projects for signing user certs, but not for the server side certificates.

When you're running containers, if they're spawned frequently, that can mean a significant amount of known_hosts file churning. It sounds like everyone funnels through a small set of bastions, so you'd only need to update them there, but I'm wondering if anyone else uses an SSH CA to solve that side of the equation?

All the ingredients needed are in `ssh-keygen`, but that doesn't feel super awesome to me.

Solving server identity is why I built https://github.com/square/sharkey

knorker · on Sept 13, 2016

You've not seen that?

It's HostCertificate in sshd_config. Then in known_hosts:

@cert-authority *.example.com ssh-rsa AAAAB…

https://www.digitalocean.com/community/tutorials/how-to-crea...

Edit: Oh, you mean for making them more short lived.

mcpherrinm · on Sept 13, 2016

You can trust a CA easily enough. So do you just glob some shell scripts around ssh-keygen?

Short-lived is one solution to limiting their lifetime: the other is to use a CRL format.

Either way, you need software to manage the issuance and later revocation.

I suppose you could build this into your host imaging profile, or use config management software. I'm just interested in what people do.

thenewwazoo · on Sept 12, 2016

This looks very similar to Netflix' BLESS package[1]

[1] HN discussion at https://news.ycombinator.com/item?id=11746425

spydum · on Sept 13, 2016

Yes, I was coming to post the exact same thing.

ianunruh · on Sept 13, 2016

Another promising solution that uses these principles is Teleport (http://gravitational.com/teleport)

skywhopper · on Sept 12, 2016

If I'm reading this right, it looks like this improves over Netflix BLESS IIRC. My concern with something like BLESS was that it allows you in to root or some deploy/app user based on a cert trust chain, but doesn't seem to trace back to the actual user who's making the changes and/or differentiate between two users logged in at once. This embeds the user information in the cert. Looks nice.

lox · on Sept 13, 2016

How do they handle revocation? The problem I had with this approach in the past was no CRL support.

mfdutra · on Sept 13, 2016

We can use RevokedKeys for that, but in fact we normally issue short-lived certificates. If we ever have a mass certificate leak, we'll just rotate the entire CA.

mfdutra · on Sept 13, 2016

Also, we bind certificates to the hosts that requested them. If a certificate and its private key move somewhere else, they will be useless.

ssh-keygen -O source-address=1.2.3.4...

lox · on Sept 13, 2016

Interesting, so I'm assuming this wouldn't work for staff that roam with laptops?

mfdutra · on Sept 13, 2016

We only do that in bastion servers. It would be a bit trickier with laptops, but doable I guess.

lox · on Sept 13, 2016

Makes sense. I missed that section, much clearer now, thanks. Any plans to release more details on how the bastion hosts are configured? Seems like that would be the complex bit to get right.

mfdutra · on Sept 13, 2016

Absolutely nothing special about them. They trust the same LDAP/Kerberos infra the laptops are part of. We have two-factor authentication, which is pretty simple with PAM. And they're only accessible if you are in our corporate network.

tptacek · on Sept 13, 2016

With RevokedKeys. It looks like Facebook also uses short-lifespan certificates.

lox · on Sept 13, 2016

So staff that leave have 1 week of access to production systems?

brazzledazzle · on Sept 13, 2016

No, they use revokedkeys. For that to happen you'd need for the revokedkeys process that happens when someone leaves to fail as well as your network security on your servers and/or client VPN.

lox · on Sept 13, 2016

To be clear they said they "could" use RevokedKeys. See my reply further down for my line of thinking.

tptacek · on Sept 13, 2016

Again: you can revoke keys.

lox · on Sept 13, 2016

I get that, but you need to update the revocation list on every single server. At which point you've just undone the benefit of not having to manage `authorized_keys` on each server.

NeutronBoy · on Sept 13, 2016

I would argue it's a lot easier to maintain a single CRL across your entire infrastructure (you can regularly update it to all hosts, easily monitor for non-matching versions through your monitoring tools, etc) than it is to maintain a customised authorised_keys file for each server or server group (n keys across m servers can be a lot of combinations, with no easy way to check correctness)

brazzledazzle · on Sept 13, 2016

Sure, but wouldn't the complexity of managing distributed authorized_keys be significantly higher than propagating revokedkeys across a fleet of servers? You get the benefit of central management/administration with the benefit of distributed ssh keys. It's almost the best of both worlds.

lox · on Sept 13, 2016

It doesn't seem like it's more work, although you are right in that it's probably more user friendly as you don't have to wait for keys to propagate to various servers before you can login. We've previously used a cron script to pick up authorized_keys from a centrally published source, I guess that same infrastructure just moves to the CRL.

brazzledazzle · on Sept 13, 2016

Thinking about it you make it a good point. How you do it is "central" but much less of single point of failure than something like LDAP. It would require a pretty contrived scenario to go undetected for an extended period that could be slightly more likely depending on how often you rotate keys. It would be nice if OpenSSH supported CRLs.

knorker · on Sept 13, 2016

How anticlimactic. Nothing new or internal, just "hey here's what's in the OpenSSH manpage: TrustedUserCAKeys and AuthorizedPrincipalsFile".