My company has an internal bit of infrastructure that I think is a somewhat novel approach that allows us to never have any secrets stored unencrypted on disk. There's a server (a set of servers, actually, for redundancy) called the secret server, and its only job is to run a daemon that owns all the secrets. When an app on another server is started up, it must be done from a shell (we use cap) which has an SSH agent forwarded to it. In order for the app to get its database passwords and various other secrets, it makes a request to the secret server (over a TLS-encrypted socket), which checks your SSH identity against an ACL (different identities can have access to different secrets) and does a signing challenge to verify the identity, and if all passes muster, it hands the secrets back. The app process keeps the secrets in memory and your cap shell disconnects, leaving the app unable to fetch any more secrets on your behalf.
The other kink is that the secret server itself reads the secrets from a symmetrically-encrypted file and when it boots, it doesn't actually know how to decrypt it. There's a master key for this that's stored GPG encrypted so that a small number of people can retrieve it and use a client tool that sends the secret server an "unlock" command containing the master key. So any time a secret server reboots, someone with access needs to gpg --decrypt mastersecret | secret_server_unlock_command someserver
There are some obvious drawbacks to this whole system (constraining pushes to require an SSH agent connection is a biggie and wouldn't fly some places, and agent forwarding is not without its security implications) and some obvious problems it doesn't solve (secrets are obviously still in RAM), but on the whole it works very well for distributing secrets to a large number of apps, and we have written tools that have basically completely eliminated any individual's need to ever actually lay eyes on a secret (e.g. if you want to run any tool in the mysql family, there's a tool that fetches the secret for you and spawns the tool you want with MYSQL_PWD temporarily set in the env, so you need not copy/paste it or be tempted to stick it in a .my.cnf).
This reminds me of OpenStack Barbican (Previously called CloudKeep.. kinda..) initially built by Rackspace. A good intro video at [1].
One of the interesting (and optional) things is does, is provide a agent to run on your instances that require the secrets, the agent implements a FUSE filesystem, and access to this filesytem is controlled by policy. For example - A policy can say "Allow exactly 1 read of /secrets/AWS.json within 120 seconds of boot". Any out of policy access attempts can cause the instance to be blacklisted, preventing any future secret access etc..
This looks really great. I watched the video and the rationale and tradeoffs they discussed sounded exactly like conversations we had back when building our system. The FUSE filesystem and agent panics are features that I wish I'd thought of.
The system sounds very well thought-out, though probably not applicable at my $work location.
> When an app on another server is started up, it must be done from a shell
That's a no-go for many setups. It doesn't integrate well with how Linux distros usually start services (systemd, upstart, sysv init, ...), and means you have to have another way to manage dependencies between your services.
> When an app on another server is started up, it must be done from a shell (we use cap) which has an SSH agent forwarded to it. In order for the app to get its database passwords and various other secrets, it makes a request to the secret server (over a TLS-encrypted socket), which checks your SSH identity against an ACL
At this point you could have used ssh right away, no? Any reason you used TLS + checking SSH agent instead?
> That's a no-go for many setups. It doesn't integrate well
> with how Linux distros usually start services (systemd,
> upstart, sysv init, ...)
Change the daemon config file to use a small wrapper script, which initializes the SSH environment and then execs the target binary. Assuming a reasonable setup, this should be trivial.
> At this point you could have used ssh right away, no?
> Any reason you used TLS + checking SSH agent instead?
It sounds like they take an SSH identity certificate from the agent, send it via TLS, and then the remote process verifies it. This would have fewer potential security issues than trying to lock down a user's SSH login shell.
> Change the daemon config file to use a small wrapper script, which initializes the SSH environment and then execs the target binary. Assuming a reasonable setup, this should be trivial.
Well, the point is that the ssh needs to have forwarded agent from somewhere else. If the host on which the service is run can initiate it, the whole security aspect is gone.
> This would have fewer potential security issues than trying to lock down a user's SSH login shell.
Locking down a login shell (usually be not running a shell in the first place) is a solved problem, and for example gitolite uses it has the base of its architecture. Yes, you have to be careful, but you must also be careful when manually validating certificates.
> At this point you could have used ssh right away, no? Any reason you used TLS + checking SSH agent instead?
Yeah, using the SSH login method is actually quite slow for something you want to call at app startup on N instances during a push (at a minimum, your process responsible for whatever gatekeeping you do has to be respawned for every request, which necessarily puts a lower bound on the latency). I'm sure this could have been tracked down and optimized, but as jmillikin points out, another downside is that the additional per-user config can get kind of messy and error prone. Implementing logic like this at the .ssh/config level is (in my opinion) kind of easy to goof up and hard to test.
If anyone's interested in a somewhat out-of-the-box version of what's described above, using a Consul server/cluster to hold this information should give you basically everything ntucker listed. It's pretty trivial to setup and configuring it to store its data on an encrypted partition is also pretty simple. It's got ACLs and can support TLS connections as well. It's also got a bunch of features that the above system doesn't have, like being distributed (redundancy isn't the same thing as consensus) and datacenter-aware (I'd prefer to have different secrets per-datacenter, when possible).
We've been using it to store our application secrets for some time and had no complaints.
This sounds like a pretty standard bastion server configuration. The use of SSH is novel, usually I see the bastion address provided as a command-line option and a TLS certificate used to authenticate the client.
The other kink is that the secret server itself reads the secrets from a symmetrically-encrypted file and when it boots, it doesn't actually know how to decrypt it. There's a master key for this that's stored GPG encrypted so that a small number of people can retrieve it and use a client tool that sends the secret server an "unlock" command containing the master key. So any time a secret server reboots, someone with access needs to gpg --decrypt mastersecret | secret_server_unlock_command someserver
There are some obvious drawbacks to this whole system (constraining pushes to require an SSH agent connection is a biggie and wouldn't fly some places, and agent forwarding is not without its security implications) and some obvious problems it doesn't solve (secrets are obviously still in RAM), but on the whole it works very well for distributing secrets to a large number of apps, and we have written tools that have basically completely eliminated any individual's need to ever actually lay eyes on a secret (e.g. if you want to run any tool in the mysql family, there's a tool that fetches the secret for you and spawns the tool you want with MYSQL_PWD temporarily set in the env, so you need not copy/paste it or be tempted to stick it in a .my.cnf).