Dynamic Route53 records for AWS auto scaling groups with Terraform

cxmcc · on Feb 8, 2020

You can attach an ELB to the ASG. Then DNS name to ELB. That's how AWS expects you to use it.

Cpoll · on Feb 8, 2020

I can see why you might not want to, though. App ELBs charge by usage and can get somewhat expensive (like running another EC2 instance or two). They can also have cold-start performance issues in specific circumstances (traffic spikes).

cxmcc · on Feb 8, 2020

yup. but that's usually negligible to most companies. and it's probably not worth the extra complexity introduced by production hacks.

X-Istence · on Feb 8, 2020

That doesn't solve the problem of the hostname on the EC2 instance itself being the same across all instances thereby making it harder to see what logs came from what hosts.

It doesn't solve the problem of allowing you to look at logs and then quickly SSH'ing to a single machine in the ASG.

tiew9Vii · on Feb 8, 2020

Is this not already a solved issue?

Install a log agent on the machine like fluentd. Have it inject the host ip and other contextual meta data in to the logs then forward to your central log system?

When you see the error message in your logs, you get the internal ip and can ssh in.

Persistent internal ip’s/hostnames also means you are not treating hosts as ephemeral. It’s always good in the cloud to get things to a point you can just blow away instances and they auto recreate. It’s even possible with traditional services requiring persistent storage. Put the storage on a seperate volume and have the instance startup scripts discover available volumes and attach as required.

cxmcc · on Feb 8, 2020

what you need is probably something like https://github.com/adhocteam/ec2ssh (I never used it, but I have built similar ones) -- and then you tag the log entries with instance id.

so you can do "ec2ssh i-0017c8b3"

imho: Hacking around debugging tools is better (mostly because more reliable) than hacking around production configurations (one problem you will see is that changing route53 records frequently will be subject to API rate-limits).

jimsheldon · on Feb 10, 2020

Post author here.

We absolutely put an ELB/ALB in front of these ASGs as well. The post mentions a few use cases where unique hostnames with internal Route53 records are helpful for us.

merlincorey · on Feb 7, 2020

This is a pretty sweet Terraform module that I will personally be testing out soon.

I was curious what they are using for the Lambda function, and it turns out is a framework-less Python script[0].

One thing I'm not clear on yet is if using this will imply one such lambda for every autoscaling group or not.

[0] https://github.com/meltwater/terraform-aws-asg-dns-handler/b...

spier · on Feb 7, 2020

I just checked with the Jim, the author of the blog post, and module (we both work for Meltwater).

He confirmed "You only need one instance per account, we have always been fine with just one".

I will let him fill in further details here but I figured you would be interested in a short update on this.

spier · on Feb 7, 2020

Great to hear! Please let us know how the testing goes!

fatninja · on Feb 8, 2020

Can't this be solved by using IP addresses for hostnames? This can be a part of bootstrap script(which ASG/Launch Configuration already supports via UserData[1])

What I can't understand is -

If your logs are in ELK and metrics in prometheus/grafana - why do you need SSH access? Sounds like thats a good problem to solve

[1] - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-dat...

jimsheldon · on Feb 10, 2020

Post author here.

SSH access is a last resort, but it can be necessary in certain cases. For example, if our log forwarding breaks. SSH is also just one example, it can also be helpful to curl endpoints on the host directly without hitting the ELB/ALB.

The post actually provides the user_data script we use.

VectorLock · on Feb 7, 2020

I do this already but from my configuration management system on my instances, but one thing I don't have that I'd love if Route53 would help support is being able to run route53 to handle in-addr.arpa zones for my IP addresses so I can get reverse IP looking for my VPC networks without having to run my own resolver.

mrud · on Feb 8, 2020

We run our reverse zones on route53. It is just a little bit cumbersome but overall works relatively well with vpc private hosted zones.

VectorLock · on Feb 8, 2020

Are you using RFC1918 networks or are you using public subnets you have actual delegation for?

spier · on Feb 8, 2020

Interesting, which configuration management system do you use?

VectorLock · on Feb 8, 2020

Chef and hating it.

discobean · on Feb 8, 2020

I wrote and use this regularly, docker app that adds and removes instances from route53, similar to their terraform solution. Similar idea, different implementation.

https://github.com/discobean/route53-sidecar

OJFord · on Feb 8, 2020

Why are ASGs 'outdated' in a world with Kubernetes?

Our nodes are in ASGs.

jimsheldon · on Feb 10, 2020

Post author here.

I brought up that point since I think most developers prefer the user experience of Lambda/Kubernetes where they don't have to manage individual instances in Auto Scaling Groups. They certainly are not 'outdated' for our use cases, and especially not for those responsible for running the underlying infrastructure (when running Kubernetes nodes).

aequitas · on Feb 7, 2020

I think AWS not having automatic hostnames for ASG instances is a way to lure you into Lambda's, at least that's how I got hooked.

scarface74 · on Feb 7, 2020

Why should an instance created by an ASG have a host name? These are cattle not pets. I use Serilog for logging with an EC2 enricher that automatically adds the instance Id and the IP address.

Since Serilog does structured logging, I can use either an ElasticSearch or Mongo sink and do complex queries.

If I routinely need to log into an instance to troubleshoot, I need to be capturing data and sending it to a central logging system.

paulddraper · on Feb 7, 2020

> Why should an instance created by an ASG have a host name?

It means you can connect to it by just knowing its instance ID.

Adding the IP address everywhere also works.

There can be some nice SSH config options though, like using a particular key for everything *.prod.myaws.com

joeskyyy · on Feb 7, 2020

You can also use SSM and session manager to get a similar experience for those instances: https://docs.aws.amazon.com/systems-manager/latest/userguide...

I haven't had to manage SSH keys in a long time ;)

With this I just have a bash function for my various environments (e.g. dev = dssm) where I provide in the instance ID giving me issues if I really need to log into the server.

e.g.

function dssm { aws --region us-west-2 --profile my-dev-profile-name ssm start-session --target $1 }

Then:

dssm i-abcdef123456

And I'm dropped into a shell. SSM Session manager is far from perfect, but it gets the job done, and is fully auditable, gets logged (including commands ran), and best of all works with SAML IAM profiles right out the gate. No more sharing keys, no more managing keys, it's great!

spier · on Feb 7, 2020

Yes exactly, SSH access is also one of the reasons for building the module, that is mentioned in the blog post.

> Access: When troubleshooting, we save time not having to look up the instance’s internal IP address for SSH access.

scarface74 · on Feb 7, 2020

That’s the second part. If I’m troubleshooting by logging into EC2 instances, there is something wrong with my logging infrastructure. That’s actually the larger issue.

jimsheldon · on Feb 10, 2020

Post author here.

SSH access is absolutely a last resort, but can be necessary in certain cases (like when Filebeat breaks...). Turning SSH off completely (i.e. "No SSH") is certainly better for security and something we may pursue.

I mentioned in another comment here that SSH is just one example, we can also easily hit endpoints with curl via hostname.

Also mentioned in the post are other tools (like Grafana dashboards) have an expectation of unique hostnames.

paulddraper · on Feb 8, 2020

> If I’m troubleshooting by logging into EC2 instances, there is something wrong with my logging infrastructure.

I suppose it's possible to build enough logging to account for an interactive SSH session for debugging problems...but that would be massive.

I ran out of disk space. Why?

scarface74 · on Feb 8, 2020

If you’re logging to a local disk on ephemeral VMs, that doesn’t make the situation any better.

That’s why you need a central logging facility. If you’re using AWS, you could store your structured JSON logs in S3 and query them with Athena. (https://medium.com/quiq-blog/store-json-logs-on-s3-for-searc...)

Of course there are other ways both using AWS and third party services. Centralized logging is a solved problem.

AWS isn’t going to run out of disk space any time soon. You could also use a lifecycle policy to delete old logs or move them to a lower cost storage depending on your retention policy.

I’m not saying that I have never had to log on to a VM to troubleshoot, but that’s a sign of the need of better logging.

And if my logging infrastructure isn’t good, how pray tell will I troubleshoot my programs running on Lambda or Fargate?

paulddraper · on Feb 8, 2020

I never said your disk was full with logs.

> how pray tell will I troubleshoot my programs running on Lambda or Fargate?

That is indeed a big problem running on Lambda and Fargate.

In my experience, Fargate isn't very commonly used and Lambda is used for only relatively simple things.

scarface74 · on Feb 8, 2020

It’s not a problem at all with lambda or Fargate. Logging can be as simple as printing to the console and they go to CloudWatch.

It’s the same concept. If you’re troubleshooting at any point involves needing to log in to an EC2 instance, you might as well have a few bespoke servers called “Web01” and “Web02”. You’re just using ASG to create pets at scale. We run an ASG in production that scales from 2 to 30 instances based on the number of messages in a queue, lambdas running all of the time, some a Fargate tasks etc. it would be a nightmare to troubleshoot all of those processes without centralized, queryable logs.

In my experience, Fargate isn't very commonly used and Lambda is used for only relatively simple things.

And that experience is representative of the entire AWS ecosystem?

aequitas · on Feb 8, 2020

I agree, I wouldn't want it any other way nowadays, but back then I had to migrate a lot of legacy system to AWS under pressure.

For one part we had a legacy service needing to connect to the services in the ASG and the best way to implement it was with round-robin DNS. So the lambda would update a DNS record contianing all the ASG host ips.

Also, because we had some had some semi stateful legacy instances that where basically lift and shift to AWS, but I wanted to have them in ASG to keep our environment similar until we could refactor them into real cattle.

scarface74 · on Feb 8, 2020

Just out of curiosity, why not just put the ASG behind a load balancer?

aequitas · on Feb 9, 2020

I don't remember exactly. We did use elb's for all other services. So it was either cost or it had to do with MX record restrictions in that you're not allowed to use CNAMEs in MX records.

0xbadcafebee · on Feb 7, 2020

Or a way to get you more familiar with tagging, or the various queries and filters on different api results. It's annoying at first, but it leads to less reliance on the console and more effective scripting. (Instead of naming my instances I just made a script which looks up the instance I want and outputs the IP and username, and put that in an SSH config)

jedberg · on Feb 7, 2020

They expect you to front it with a load balancer.