Great to hear that tag based auth is coming. I'm at a loss about how to use it without something like that. It looks like you either have to handle each instance individually (which makes no sense where AWS has been pushing auto scaling and spot instances for a decade – instances are ephemeral in our world), or have one rule that applies to everything in the account. To me, being able to limit access to groups of instances is a required feature.
I really miss projects and folders (current firm is AWS, previous was GCP). I find GCP more usable on a number of other fronts though.
Whenever there's a service that maps to the other, I just always seem to find the GCP service easier/faster to learn and use effectively. Bigquery, stackdriver, pubsub, dataproc, compute, load balancer, et al. Getting stuff done with those is miles easier, in my experience than the comparable AWS offerings, at least if you don't already have extensive experience with one over the other.
I’m puzzled. Many of the comments here seem focused on browser-based ssh which isn’t new, or even the most significant thing here. Using IAM instead instead of passing around .pem files feels like a huge improvement.
This will lookup your username in AWS iAM, and if it has the right permissions, it creates an account and copies the public ssh key associated with that user.
Google Compute Engine has had this functionality for years (at least the browser based SSH). Furthermore, Google's free Cloud Shell feature is fantastic.
Over the years, AWS has put their focus entirely on "Enterprise" customer functionality as opposed to "developer friendly" capabilities.
gcloud has had it, but as of at least a month ago there are terrible races. You can create a machine, log in with normal ssh, log in with cloud shell, and then the next normal ssh login will fail because cloud shell will modify the machine’s ssh keys.
I once found a race in the UI, reported it with complete repro instructions, and then Google made me have to do a 30 minute hangout meeting where I had to repro the bug in front of the Google engineer. The call resulted in two different tickets... we found another bug.
So gcloud might have some nicer features, but the engineees building the web app seem to have some basic misunderstandings about concurrency. The BigQuery UI (new one not the old one) is similarly riddled with bugs.
FWIW, it sounds like one advantage of the new AWS service is that it will provision a new SSH key each time you connect. Whereas, I _think_ the GCP one provisions one key per machine.
Google's browser-based method uses a new key each time, set to expire from the server-side authorized_keys file in a short time (minutes). I strongly suspect, but haven't checked, that this is also true for their Cloud Console mobile app's SSH feature.
You're mostly right about gcloud, with some nuances for multi-user or network-homedir environments among other exceptions.
Gcloud does however have one of the snazziest possible hacks for resetting and securely communicating Windows account passwords:
Transcript from Alphabet Q1 2019 Earnings Call; April 29, 2019 [1]
> Sundar Pichai, CEO Google:
> We are also deeply committed to becoming the most customer-centric cloud provider for enterprise customers, and making it easier for companies to do business with us thanks to new contracting, pricing, and more. Today, 9 of the world's 10 largest media companies, 7 of the 10 largest retailers, and more than half of the 10 largest companies in manufacturing, financial services, communications, and software use Google Cloud.
> Some of the companies that we announced at Next included: The American Cancer Society and McKesson in Healthcare; Media and Entertainment companies like USA Today and Viacom; Consumer Packaged Goods brands like Unilever; Manufacturing and Industrial companies like Samsung and UPS; and Public Sector organizations like Australia Post.
> Finally, to support our customers' growth, we also announced the addition of two new Cloud regions in Seoul and Salt Lake City, which we plan to open in 2020. These new Cloud regions will build on our current footprint of 19 Cloud regions and 58 data centers around the world.
To be fair, given their history of shutting things down, it’s easy to see anything that isn’t selling ads as a hobby for google. Even if that’s not the case, it can be hard to shake that gut-level feeling. I suspect that’ll be the hardest thing for most people to get over when considering a google stack.
Every now and then I consider using GCP for a new project, because their tech is obviously better. I never do though, because every time, I stumble across somebody with a horror story.
You know how it goes. They built their business on Google's platform, and it was a dream until some AI detected a pattern of activity it didn't like and they were excommunicated. The app was shut down, the website stopped getting traffic, and all they money they had charged customers via Google Pay was frozen. No appeals, emails go to /dev/null and after a month of campaigning on social media they finally get an email from an intern saying that after review, they won't be changing the automated decision.
This is bad news for ScaleFT, which provided this service via bastion servers (although not IAM based).
Rackspace managed AWS environments use this for high compliance systems.
The problems it solves are a) that login attempts are logged on a separate system for compliance and b) user management is handled in a centralized way. Both are handled with EC2 Instance Connect.
Another competitor to this is hashicorp vault, which does both certificate based ssh and AWS authentication. (I find the certificate based approach better though)
They could present value if their product works across cloud-providers (not familiar with their business model, but IAM is generally regarded as one of the biggest ways you can get locked into AWS).
There's also SSM Session Manager [1]. Not exactly ssh, but you get mostly the same features with ssh access completely disabled and the whole session being logged to S3 bucket or some log aggregation service.
I'm a bit confused about this one.. we have been using SSM Session Manager for quite some time now and this looks like it does the same. We also export all logs during the session with SSM and you can see which user initiated the session. What am I missing here?
For dev environments SSH is essential, but in production environments I 100% agree with using SSM Session Manager instead of SSH. Getting terminal access to a production server is sometimes necessary but it ought to be temporary access, all actions are logged, and treated as an exception situation rather than routine. SSM session manager provides all that without requiring SSH keys and SSH firewall rules in production.
SSM session manager is basically a HTTP wrapper over a shell. You have to use browser for SSM which mostly works until it doesn't. I had trouble sometimes copy pasting to it.
This new service is basically a managed SSH so things like port forwarding etc will work. With SSM you can't do port forwarding etc because it is not SSH aware.
I really really really try to not need serial console access to my machines. I try to only rarely need SSH access.
But when you've got some sort of bug or issue that you're not getting any metrics out of, no logs recorded, no kernel crash dump, nothing sent over netconsole, nothing showing up on the instance console screenshot... Sometimes serial console is what you need.
But, for the borked networking case, I'd recommend not modifying your networking on live instances. Make your changes on a test instance, figure out what works, and add it to your configuration management ;)
I'm honestly a little disappointed here, I feel like there is not fully baked but it is so close.
Unlike SSM, Instance connect goes direct over SSH - so you either need to be inside of your AWS network, on a bastion host that can route to your AWS network, or use a public IP address.
It would be great if they combined this functionality with the HTTP wrapping capability so that I do not need to expose SSH/route to SSH ports in any way but can also use IAM policy to control which unix user a given IAM principal can land in the host as (Example use case would be I would only want a certain class of user to land as a user with sudo/root access).
This is still valuable to my use case, and we'll go ahead with it using the bastion approach most likely until they hopefully integrate this with their HTTP SSH wrapper.
But sigh I just built a PKI infrastructure provisioning system using a gigantic shell script, maintenance user with sudo permissions and ssh access where a master node would command a fleet of slave nodes.
I guess all of my work was for naught since this seems to cover some my needs for user and ssh key provisioning.
Oh well, it'll work elsewhere on all other clouds. And I guess I should release it publicly, it's just not pretty enough yet. Every time I do, gremlins come out of the bushes complaining that the code isn't elegant enough for them.
Do you still need to create users manually on each machine? There have also been many tools out there to pull the ask key from IAM and use it via authorizedkeyscommand previously, but my problem is always creating the user accounts, especially if you don't want to keep a separate list in ldap/Kerberos (or similar, like active directory).
This is what I'm wondering as well. Does the fact that everything is logged by what an IAM user does work as compliance, or are individual user accounts on the operating system still required?
Interesting, but I connect via ssh from Windows 10 PowerShell. I wonder why this isn't a standard use-case. I suppose I can get it to work as long as it's "openssh" or something compatible
Not gunna lie, it makes me sad that we need some huge, fancy graphics engine just to emulate a 25 year old technology. Why are people so obsessed with their browser? No matter how much JS you layer on, it'll never be as fast as a terminal.
I mean, this is definitely cool, but we should also try to stop using ssh so much. There's a long list of reasons why using ssh leads to bad things (but not an anti-pattern - I wish people would stop using that phrase to mean anything that sometimes leads to bad things) so I just hope this functionality doesn't exacerbate its use.
I am tired of hearing about why I shouldn't use SSH. SSH is fantastic and I always want it enabled on every single server so when things go wrong I can debug. I've had AWS "solutions architects" puzzled about why I'm still using SSH but they can never justify any other solution. First they tell me that I should just log everything we'd want to look at. I do log everything I think would be useful, but unforeseen things happen and it is really handy to have shell access to the misbehaving server. Then they suggest I use Systems Manager to perform server updates but I have no need for that because I use an immutable deployment model.
Managing server access in a multi-account organization is a real issue, though. I currently manage 11 AWS accounts and the best solution I've implemented so far is extending NSS and configuring sshd to query our identity provider (Okta) for user/group information and SSH public keys. Each type of server is configured to permit access to a subset of Okta groups. For example, members of the DevOps group get full privileges, anyone else that has a use-case for using SSH (like in a QA environment) gets some form of limited access. With this in place, I can grant/revoke privileges and manage developer keys all from one central location.
I hear you, but every time I ask "what are you doing that you need SSH?", the answer comes down to "well I have these crappy random applications and I don't have enough visibility into the system or the apps." For those cases, I think SSH is a crutch that keeps the system from maturing.
Another way to think about it is, if you're an SRE, you want to eliminate toil, and an interactive SSH terminal is toil.
- Encourages you to go to instances and check stuff rather than improve monitoring / health checks.
- Can do quick fixes on a few boxes rather than re-running the deploy. Great! But terrible when the person who knows how to do that is away.
- Tailing log files rather than centralised log management for all the things.
- Trying things out / quickly checking something in production rather than being rigorous about keeping test / staging in sync with prod.
The “problem” is ssh is such a great affordance (until you have tons and tons of instances and you can’t do anything by hand anymore) that it means you don’t need to fix internal processes and tools around deployment, configuration and monitoring.
If there’s no workaround you feel the pain and will be forced to set things up right, usually with benefits to security and repeatability.
As is often the case, the best thing about ssh (in terms of managing infra instances) is also the worst thing.
With that said at very small scales it might be overkill to automate all the things so sure, fill in the gaps with ssh and a wiki page.
> Encourages you to go to instances and check stuff rather than improve monitoring / health checks.
I don't think it does, well it doesn't when you have > 40 machines anyway. Plus it doesn't give the ability to compare and contrast simply. (graphs are _awesome_)
> Tailing log files rather than centralised log management for all the things.
Yes, I tend to agree. But proper centralised logging is either exceptionally hard, or a hefty splunk tax. That also encourages people to derive graphs from logs, which is arse about face. Graphs first, logs when you are desperate.
> Trying things out / quickly checking something in production
I can see this, but normally one would expect people to not have general access to prod, if they are going to do that...
SSH is wonderful. The problems are more what it enables, and what it lacks (or isn't designed to do).
For example, managing ssh keys for an individual is gloriously simple, but managing them for a large organization is a huge headache. You want to use ssh certificates, but even those are implemented in a weird way, and really you should use an SSO system for auth. (This makes that easier/better, so, yay?)
When people start sshing into production servers, they end up making local changes. They focus more on the "pets" aspect of managing systems rather than as "cattle". They have to install a litany of extra software to diagnose and troubleshoot bugs, rather than expose system metrics and tightly control the app environment and its operation.
Remote access to production app servers is basically a backdoor waiting to happen, and may violate corporate security policies. When you have local user access to a Linux host, it's almost guaranteed you can privesc to root.
Finally, almost everyone I have ever seen will either force-ignore/auto-accept host key changes, or just accept them blindly, because IPs and hostnames may change, and there may be multiple environments you haven't logged in to, etc). This completely defeats the purpose behind mitm protection, which is the main intent of using SSH, though these days its other features may be arguably more of a reason to use it.
And for the tech hipsters out there: "it isn't serverless!!"
>They have to install a litany of extra software to diagnose and troubleshoot bugs, rather than expose system metrics and tightly control the app environment and its operation.
There's a lot of things that require more than easily exported system metrics and logs to troubleshoot.
While I've played around with using PCP's perf plugin to try and remotely do things with perf, generate flamegraphs, etc., it doesn't work nearly as well as just SSH'ing into the thing and running perf directly, especially if the perf data file is going to be large. I don't see how you could do serious performance engineering work without SSH access.
But, I think I'm nitpicking here, because I generally agree that there should be very little to no reason to login to servers via SSH day to day.
in the ‘pets vs cattle’ analogy, it totally makes sense that even with cattle, occasionally you bring one in for a checkup by a vet to see if you can detect any problems that might affect the herd. Ssh into a production box to check everything is working as expected and take some readings. Sure.
On the other hand, I tend to lean more towards a ‘wild animals’ model, where, sure, you can tranquilize one and bring it in to look over, but once it’s got the smell of humans on it, it’s doomed if you let it back out in the wild again.
Once you ssh into a production box, it is forever tainted. Sure, poke around in it, install some perf tools to run some diagnostics, learn what you can about its behavior. But then, rather than putting it back into the wild to serve traffic, out of mercy, you should destroy it and replace it with a clean instance.
I don't disagree with this, but, at the same time, if I haven't made any actual changes to the application, I'll generally not worry about manually taking it out of service, because autoscaling will be getting rid of it soon enough anyway.
As an alternative, I try to implement solutions that can install and run software out of bounds and still get what you need without a persistent connection and opening up security groups (example: AWS SSM Run Command).
I make extensive use of Run Command (And Automations!) but they're not a replacement for every use case, and very much not a replacement for the specific use case you're replying to.
> almost everyone I have ever seen will either force-ignore/auto-accept host key changes,
Thanks for the super informative answer. About the quoted portion... yeah! I assume it's my responsibility to do something... like manually check the host IP or something? What is the recommended practice to deal with this situation?
Nothing bad with SSH per se, but building your infrastructure in a way that makes ad-hoc remote changes unnecessary is something to strive for. For anything but small deployments, automation, immutability and reproducibility will keep you sane. Less moving parts, things don’t suddenly change, easy to audit, easy to rollback, etc.
* Works on amzn linux 2 - installed by default on newer versions
* otherwise: $ sudo yum install ec2-instance-connect
* The SSH public keys are only available for one-time use for 60 seconds in the instance metadata.
* you can send up your own SSH keys `aws ec2-instance-connect send-ssh-public-key`
* cloudtrail logs connections for auditing
* doesn't support tag based auth but it's on the roadmap
* plans to enable it in popular linux distros in addition to amzn linux 2
Install local client:
$ aws s3 cp s3://ec2-instance-connect/cli/ec2instanceconnectcli-latest.tar.gz .
$ pip install ec2instanceconnectcli-latest.tar.gz
$ mssh instanceid