Change Sets for AWS CloudFormation

voidfunc · on March 30, 2016

CloudFormation is awful. I used it extensively for about a year and will never go back.

buremba · on March 30, 2016

+1

1. We can't test the templates in local environment, the only way to test a template is to create a real cluster which usually takes ~15 minutes.

2. The error messages are not helpful enough. It usually fails with "SETUP_FAILED" error with no proper error cause.

3. The rollback may not work if you change the configuration. You need to delete the resources manually or even need to get help from Amazon since the resources stuck at some failed status.

4. The JSON template is hard to write and read. We have a fat single file template that has 2k lines and scared to change it since we might broke something.

5. Not all AWS actions are supported. For example we need to add custom resources (ELB and RDS instances) to a Opsworks stack but it's not possible with Cloudformation.

nunez · on March 31, 2016

1. This is true, but also valid of other cloud provisioning tools. Some operations just take a long time. Using things like DependsOn and template validation help decrease provisioning time.

2. If you check the events that happened in the CloudFormation console or by running "aws cloudformation describe-stack-events", you'll see where things went sideways pretty easily (unless you're using nested stacks, in which case debugging can take a bit longer.)

3. I can't say that that's happened too often for us, and we've deployed quite a few stacks so far.

4. Your stacks shouldn't be super long. If you're using Designer to create your template (which I personally don't do), removing the position metadata can save you lots of lines depending on the number of resources you have. Also, you might be mixing in parts of your infrastructure that should really live in their own stacks (I.e. Not creating web servers in your VPC/networking stack). Consider using nested stacks to help you organize this.

5. What do you mean? Tagging is supported by most CF resources (and AWS is adding support for more as the product matures)

zwily · on March 30, 2016

I'm not a CloudFormation superfan, but...

1. Yup.

2. In my experience, the error message is usually fine, but there are many times where it's completely obtuse.

3. They just added a new feature to help with this. When you get ROLLBACK_FAILED, you can now tweak things and retry the rollback. It has helped me a few times not have to contact support recently.

4. 100% agree. I write all my templates in yaml, which helps a lot. I really hope they just add native yaml support. Example[1]

5. You can now create custom resources using Lambda from within your templates, which is pretty cool. Example[1]

edit: I should add that at my company, we mostly use Terraform now. I still like to use CF for projects that will be open-sourced though, as I don't want step 1 of my instructions to be "install terraform".

[1] https://github.com/zwily/sqs-to-lambda-via-lambda

leg100 · on March 30, 2016

Can't counter these points other than that what else is there to automate AWS infrastructure? Terraform suffers all these problems, too.

buremba · on March 30, 2016

I don't have experience with Terraform but it should be possible to find a way to fix these issues except the first one.

pmoriarty · on March 30, 2016

Could you please elaborate on your experience? Very interested.

Also, would you mind talking about what you use instead?

voidfunc · on March 30, 2016

I use Terraform right now but I'm looking at some other tech as well currently because Terraform is a bit too low-level for my current employers needs (but a fantastic product overall).

CF problems:

* JSON format is awful (practically speaking this can be solved with Troposphere...)

* So slow.

* UPDATE_ROLLBACK_FAILED ... It apparently was fixed very very recently, but what a garbage design decision.

* Debugging CF errors is a nightmare.

* Nested templates have to be stashed in S3.

xorgar831 · on March 30, 2016

Likewise, CF is the leading cause of outages in our infrastructure. We've completely abandon it in favor of Terraform.

* CF gets deadlocked if resources are changed outside of it, and there's no way to prevent people from doing that either

* No concept of a rolling update to a fleet or any warning it's going to cause an outage with out having to dig through fine print in the docs on every single change, just terminates all the EC2 instances in an ASG/ELB in some cases

* If you have a template, remember any change you make to that template has to be applied to every single stack you have, otherwise you'll either damage something or deadlock the stack when updating it later, it can't figure out what's deployed on it's own and maintains its own state that can be different from the real state. We had to stop using templates at all and just store a copy of each template in our local git repo.

* If you have more than 100 stacks searching the list of stacks requires clicking a handful of times on the "show more stacks" button before it'll show the one you typed in

* Stupid slow, esp. compared to Terraform

* Doesn't delete Route53 DNS records even after saying it does in the log outout

* Can break ELB configurations, such as sticky sessions and cause outages

* Doesn't actually support deploying all AWS resources, SNS/SQS perms/rules aren't complete for example.

smithclay · on March 30, 2016

I've found the Python wrapper troposphere useful for dealing with the JSON.

https://github.com/cloudtools/troposphere

lox · on March 30, 2016

Cfoo (https://github.com/drrb/cfoo) is great, it's simply a yaml-to-json converter with some syntactic sugar for cloudformation. Troposphere seemed like too heavy an approach.

eropple · on March 30, 2016

YMMV, but I'd much rather Cfer (https://github.com/seanedwards/cfer), which I found useful enough to contribute back to. A real DSL instead of relying on comparatively inexpressive YAML is very appealing to me.

jzelinskie · on March 30, 2016

Because we don't want to abstract the cloud provider (and potentially lock us out of any Amazon-specific behavior), we use a small Python script that takes Jinja2-templated YAML and compiles it to JSON. It also handles the uploading to S3 and a whole bunch of convenience macros for generating the more verbose parts of defining infrastructure in CloudFormation.

It took a lot of hair-pulling to write it, but now that it's done, CF isn't so terrible to use.

lox · on March 30, 2016

As a counter point, we use Cloudformation extensively at 99designs and whilst it has some quirks, it's also been stable, reliable and incredibly beneficial for automating lots of our infrastructure.

SteveNuts · on March 30, 2016

Were you using cloudformation only or did you use something like troposphere

voidfunc · on March 30, 2016

CloudFormation only.

the_arun · on March 30, 2016

What do you use instead?

_3u10 · on March 30, 2016

Call me when they get rid of UPDATE_ROLLBACK_FAILED until then it's practically useless.

Whoever thought it was ever acceptable to leave your production stack in state that requires deleting it is an idiot.

Even if they fixed it I doubt I'd ever trust a service that ever acted like that with my production infrastructure.

andrewguenther · on March 30, 2016

It isn't gone, but you can at least recover from it now.

https://blogs.aws.amazon.com/application-management/post/Tx1...

justinsb · on March 30, 2016

Is that just a retry though? It sounds like the onus is still on you to fix the problem manually so that CloudFormation can complete its rollback, rather than e.g. being able to request a different desired CloudFormation state that might be reachable.

leg100 · on March 30, 2016

This is effectively the 'dry-run' feature that has been sought for a long long time [1], and led me to switch to Terraform (which does do this amongst other key differences) . Why did it take so damn long to implement?

I do appreciate it has come about eventually. Terraform has its foibles (and stupid half-declarative language) but it would take further improvements to consider going back to CF.

[1] https://forums.aws.amazon.com/thread.jspa?messageID=563929&#...

zwily · on March 30, 2016

Actually, this isn't quite dry-run. What this does is calculate the diff between two templates, and just provide the operations from that diff. It does not inspect the current state of resources to see if there were any out-of-band changes that would be unwound.

So it gets you closer to a dry-run, but is not as complete as terraform's.

kenbreeman · on March 30, 2016

This is a step in the right direction but the lack of consistent error handling and having some resources created that can't be referenced are still huge pain points (e.g. the default security group and route table).

I've had the most success using CloudFormation to create an entirely new stack, test it, cut over, then delete the old one... definitely not optimal.

mryan · on March 30, 2016

References to existing resources, such as the default SG, can be handled with stack parameters.

You can pass in its ID as a parameter to the stack, and refer to this parameter in your launch configs or ingress rules.

An example from one of my client's stacks:

    "Parameters": {
      "DefaultSG": {
        "Type": "AWS::EC2::SecurityGroup::Id",
        "Default": "sg-abc123"
      },

Personally I prefer to create a new SG to replace the default one as it means all of my infrastructure is part of a CF stack, but the parameter method can be used to partially manage (some) non-CF resources.