1. We can't test the templates in local environment, the only way to test a template is to create a real cluster which usually takes ~15 minutes.
2. The error messages are not helpful enough. It usually fails with "SETUP_FAILED" error with no proper error cause.
3. The rollback may not work if you change the configuration. You need to delete the resources manually or even need to get help from Amazon since the resources stuck at some failed status.
4. The JSON template is hard to write and read. We have a fat single file template that has 2k lines and scared to change it since we might broke something.
5. Not all AWS actions are supported. For example we need to add custom resources (ELB and RDS instances) to a Opsworks stack but it's not possible with Cloudformation.
1. This is true, but also valid of other cloud provisioning tools. Some operations just take a long time. Using things like DependsOn and template validation help decrease provisioning time.
2. If you check the events that happened in the CloudFormation console or by running "aws cloudformation describe-stack-events", you'll see where things went sideways pretty easily (unless you're using nested stacks, in which case debugging can take a bit longer.)
3. I can't say that that's happened too often for us, and we've deployed quite a few stacks so far.
4. Your stacks shouldn't be super long. If you're using Designer to create your template (which I personally don't do), removing the position metadata can save you lots of lines depending on the number of resources you have. Also, you might be mixing in parts of your infrastructure that should really live in their own stacks (I.e. Not creating web servers in your VPC/networking stack). Consider using nested stacks to help you organize this.
5. What do you mean? Tagging is supported by most CF resources (and AWS is adding support for more as the product matures)
2. In my experience, the error message is usually fine, but there are many times where it's completely obtuse.
3. They just added a new feature to help with this. When you get ROLLBACK_FAILED, you can now tweak things and retry the rollback. It has helped me a few times not have to contact support recently.
4. 100% agree. I write all my templates in yaml, which helps a lot. I really hope they just add native yaml support. Example[1]
5. You can now create custom resources using Lambda from within your templates, which is pretty cool. Example[1]
edit: I should add that at my company, we mostly use Terraform now. I still like to use CF for projects that will be open-sourced though, as I don't want step 1 of my instructions to be "install terraform".
I use Terraform right now but I'm looking at some other tech as well currently because Terraform is a bit too low-level for my current employers needs (but a fantastic product overall).
CF problems:
* JSON format is awful (practically speaking this can be solved with Troposphere...)
* So slow.
* UPDATE_ROLLBACK_FAILED ... It apparently was fixed very very recently, but what a garbage design decision.
Likewise, CF is the leading cause of outages in our infrastructure. We've completely abandon it in favor of Terraform.
* CF gets deadlocked if resources are changed outside of it, and there's no way to prevent people from doing that either
* No concept of a rolling update to a fleet or any warning it's going to cause an outage with out having to dig through fine print in the docs on every single change, just terminates all the EC2 instances in an ASG/ELB in some cases
* If you have a template, remember any change you make to that template has to be applied to every single stack you have, otherwise you'll either damage something or deadlock the stack when updating it later, it can't figure out what's deployed on it's own and maintains its own state that can be different from the real state. We had to stop using templates at all and just store a copy of each template in our local git repo.
* If you have more than 100 stacks searching the list of stacks requires clicking a handful of times on the "show more stacks" button before it'll show the one you typed in
* Stupid slow, esp. compared to Terraform
* Doesn't delete Route53 DNS records even after saying it does in the log outout
* Can break ELB configurations, such as sticky sessions and cause outages
* Doesn't actually support deploying all AWS resources, SNS/SQS perms/rules aren't complete for example.
Cfoo (https://github.com/drrb/cfoo) is great, it's simply a yaml-to-json converter with some syntactic sugar for cloudformation. Troposphere seemed like too heavy an approach.
YMMV, but I'd much rather Cfer (https://github.com/seanedwards/cfer), which I found useful enough to contribute back to. A real DSL instead of relying on comparatively inexpressive YAML is very appealing to me.
Because we don't want to abstract the cloud provider (and potentially lock us out of any Amazon-specific behavior), we use a small Python script that takes Jinja2-templated YAML and compiles it to JSON. It also handles the uploading to S3 and a whole bunch of convenience macros for generating the more verbose parts of defining infrastructure in CloudFormation.
It took a lot of hair-pulling to write it, but now that it's done, CF isn't so terrible to use.
As a counter point, we use Cloudformation extensively at 99designs and whilst it has some quirks, it's also been stable, reliable and incredibly beneficial for automating lots of our infrastructure.
Is that just a retry though? It sounds like the onus is still on you to fix the problem manually so that CloudFormation can complete its rollback, rather than e.g. being able to request a different desired CloudFormation state that might be reachable.
This is effectively the 'dry-run' feature that has been sought for a long long time [1], and led me to switch to Terraform (which does do this amongst other key differences) . Why did it take so damn long to implement?
I do appreciate it has come about eventually. Terraform has its foibles (and stupid half-declarative language) but it would take further improvements to consider going back to CF.
Actually, this isn't quite dry-run. What this does is calculate the diff between two templates, and just provide the operations from that diff. It does not inspect the current state of resources to see if there were any out-of-band changes that would be unwound.
So it gets you closer to a dry-run, but is not as complete as terraform's.
This is a step in the right direction but the lack of consistent error handling and having some resources created that can't be referenced are still huge pain points (e.g. the default security group and route table).
I've had the most success using CloudFormation to create an entirely new stack, test it, cut over, then delete the old one... definitely not optimal.
Personally I prefer to create a new SG to replace the default one as it means all of my infrastructure is part of a CF stack, but the parameter method can be used to partially manage (some) non-CF resources.