The problem with both is that you quickly accumulate weeks/months of accumulated coding time that costs a pretty penny at market freelance rates. Spending a few hundred K on devops is routine for a lot of companies.
My main issue with this is that a lot of that coding is just reinventing the same wheels over and over again. Jumping through many hoops along the way like a trained poodle.
It's stupid. Why is this so hard & tedious. I've seen enough infrastructure as code projects over the last fifteen years to know that actually very little has changed in terms of the type of thing people build with these tools. There's always a vpc with a bunch of subnets for the different AZs. Inside those go vms that run whatever (typically dockerized these days) with a load balancer in front of it and some cruft on the side for managed services like databases, queues, etc. The LB needs a certificate. I've seen some minor variations of this setup but it basically boils down to that plus maybe some misc cruft in the form of lambdas, buckets, etc.
So why is it so hard to get all of that orchestrated? Why do we have to boil the oceans to get some sane defaults for all of this. Why do we need to micromanage absolutely everything here? Monitoring, health checks, logging systems, backups, security & permissions. All of this needs to be micromanaged. All of it is disaster waiting to happen if you get it wrong. All of it is bespoke cruft that every project is reinventing from scratch.
All I want is "hey AWS, Google, MS, ... go run this thing over there and tell me when its done. This is the domain it needs to run over. It's a bog standard web server expecting to serve traffic via http/websockets. Give me something with sane defaults. Not a disassembled jig saw puzzle with thousands of pieces. This stuff should not be this hard in 2021.
PaaS has existed since the mid-2000s. It turns out people don't want it -- none of them ever got more than a miniscule fraction of the market for boring workloads that 90% of companies are using IaaS for. People want knobs & levers. Just look at the popularity of Kubernetes, it is nothing but knobs & levers.
Kubernetes, once deployed, has surprisingly little knobs to tweak for the end user. You might have to pick between a StatefulSet and a Deployment depending on your workload, but that's about it.
Kubernetes cleanly separates reponsibility between maintainers of the platform (which have to make decisions on how to deploy it, and in cloud environments it's the cloud provider's job to do this) and users of the platform (which use a universal, fairly high-level API that's universal across clusters and cluster flavours). It's usually the former that people complain about being difficult and complex: picking the networking stack, storage stack, implementing cluster updates, ... that matters if you, for some reason, want to run a k8s cluster from scratch. But given something like a GKE cluster and a locally running kubectl pointing to it, it takes much less effort to deploy a production workload there than on a newly created AWS account. And there's much less individual resources and state involved.
Are Kubernetes capabilities not something that Cloud providers should have made available from the beginning. Meaning, its only possible future, is no future at all. Those capabilities should have been there from the beginning or will be in near (very short) future?
Different cloud providers did different things. Google's cloud offerings started with things like GAE, a very high-level service that many people ignored because it was too alien. AWS, on the other hand, provided fairly low-level abstractions like VMs (with some aaS services sprinkled in, but distinctly still 'a thing you access over HTTP over the network'). Both offerings reflect the companies' internal engineering culture, and AWS' was much less alien and more understandable to the outside. Now every other provider basically clones AWS' paradigm, as that's where big enterprise contract money is, not in doing things well but different.
With Kubernetes we actually have something close to a high-level API for provisioning cloud workloads (it's still no GAE, as the networking and authentication problems are still these but can be solved in the long term), and the hope is that cloud providers will implement Kubernetes APIs as a first class service that allows people to truly not care about underlying compute resources. Automatically managed workloads from container images are effectively the middle ground between 'I want a linux VM' pragmatists and 'this shouldn't be this much work' idealists.
With GKE you can spin up a production HA cluster in a click [1], but you still have to think how many machines you want (there's also Autopilot, but it's expensive and I have my problems with it). AWSs' EKS is a shitshow though, it basically requires interacting with the same networking/IAM/instance boilerplate as for any AWS setup [2].
It might also be the wrong incentives being passed around. I mean, if you're hired and paid to push knobs and levers, you'll choose a tool with knobs and levers. Even with more of them.
GAE did this way back when. They give you a high level Python/Java API that works both locally and on prod, and you just push a newer version of your codebase with a command line tool - no need for containers, build steps, dealing with machines and operating systems, configurable databases, certificates, setting up traces or monitoring... No need to set anything up for that particular GAE, just create a new project, point a domain to it if you're feeling fancy, and off you go.
But in the end, the industry apparently prefers low-level AWS bullshit, where you have to manually manage networks, virtual machines, NAT gateways, firewall rules, load balancers, individually poke eight different subsystems just to get the basic thing running … It’s just like dealing with physical hardware, just 10x more expensive and exploiting the FOMO of ‘but what happens if we need to scale’.
I've been working with AWS CDK for a little while now, and it kind of has some of what you want.
In my case, I wanted a scheduled container to run once per day. Setting it up manually with CF or Terraform would have been a lot of work defining various resources, but CDK comes with a higher-level construct[1] that can be parameterized and which will assemble a bunch of lower level resources to do what I want. It was a pretty small amount of Python code to make it happen.
The AWS CDK is getting closer to this. Your standard VPC setup is extremely simple now. Like
new Vpc()
simple. The tooling definitely has its quirks, but is steadily improving, and you can drop into CloudFormation from CDK (with TypeScript type checking) when needed.
Although your headline is correct imho, I think there are lots of things that the orchestrator might need to do above this, especially if you want the tool to be cloud agnostic so that everyone can use it, which just makes things a little more complicated.
You might want to add something to the existing system, like another web server. These tools have to "add item and wire it up to load balancer".
You might want to scale up. This might work natively or it might require creating new larger instances and then getting rid of the old ones.
You might want to update the images to newer versions.
You might need more public IPs.
You might be adding something to an existing larger network so you need to reference existing objects.
You might need to "create if not exists"
etc. I think your argument covers the intiial use-case in most places but any system used over time will need the other stuff done to it, hence the "complexity" of the tools. tbf, I don't think Terraform is that complex in itself, I think because it is in config files, it can be more complex to understand and work with.
Still the argument stands. Also the points you listed should be expected as standard - everybody will need sooner or later scaling or image updates right?
I agree. And you just described pass like Render.com, Heroku and diy paas like Convox and I'm sure others. I don't understand why companies, especially young - mess with all the low level infra stuff. It's such a time suck and you end up with fragile system.
> So why is it so hard to get all of that orchestrated? Why do we have to boil the oceans to get some sane defaults for all of this. Why do we need to micromanage absolutely everything here?
This post truly resonates with me, however i don't think that we appreciate just how many things are necessary to run a web application and do it well. There is an incredible amount of complexity that we attempt to abstract away.
Sometimes i wish that there'd be a tool that could tell me just how many active code lines are responsible for the processes that are currently running on any of the servers and in which languages. Off the top of my head, what's necessary to ship an enterprise web app in 2021.
RUNTIMES - No one* writes web applications in assembler code or a low level language like C with no dependencies - there is usually a complex runtime like JVM (for Java), CLR (for .NET), or whatever Python or Ruby implementations are used, which are already absolutely huge.
LIBRARIES - Then there are libraries for doing common tasks in each language, be it serving web requests, serving files, processing JSON data, doing server side rendering, doing RPC or some sort of message queueing etc, in part due to there not being just one web development language, but many. Whether this is a good thing or a bad thing, i'm not sure. Oh, and front end can also be really complex, since there are numerous libraries/frameworks out there for getting stuff rendering in a browser in an interactive way (Angular, Vue, React, jQuery), each with their own toolchains.
PACKAGING - But then there are also all the ways to package software, be it Docker containers, other OCI compatible containers (ones that have nothing to do with the Docker toolchain, like buildah + podman), approaches like using Vagrant, or shipping full size VMs, or just copying over some files on a server and either using Ansible, Chef, Puppet, Salt or manually configuring the environment. Automating this can also be done in any number of ways, be it GitLab CI, GitHub Actions, Jenkins, Drone or something else.
RUNNING - When you get to actually running your apps, what you have to manage is an entire operating system, from the network stack, to resource management, to everything else. And, of course, there are multiple OS distributions that have different tools and approaches to a variety of tasks (for example, OpenRC in Alpine vs systemd in Debian/Ubuntu).
INGRESS - But these OSes also don't live in a vacuum so you end up needing a point of ingress, possible load balancing or rate limiting, so eventually you introduce something like Apache, Nginx, Caddy, Traefik and optionally something like certbot for the former two. Those are absolutely huge dependencies as well, just have a look at how many modules the typical Apache installation has, all to make sure that your site can be viewed securely, do any rate limiting, path rewriting etc.!
DATA - And of course you'll also need to store your data somewhere. You might manage your databases with the aforementioned approaches to automate configuration and even running them, but at the end of the day you are still running something that has decades of research and updates behind them, regardless of whether it's SQLite, MariaDB, MySQL, PostgreSQL, SQL Server, S3, MongoDB, Redis or anything else. All of which have their own ways of interacting with them and different use cases, for example, you might use MariaDB for data storage, S3 for files and Redis for cache.
SUPPORT - And that's still not it! You also probably want some analytics, be it Google Analytics, Matomo, or something else. And monitoring, something like Nagios, Zabbix, or a setup with Prometheus and Grafana. Oh and you better run something for log aggregation, like ELK or Graylog. And don't forget about APM as well, to see what's going on in your app in depth, like Apache Skywalking or anything else.
OTHERS - There can be additional solutions in there as well, such as a service mesh to aid with discoverability of services, circuit breakers to route traffic appropriately, security solutions like Vault to make sure that your credentials aren't leaked, sometimes an auto scaling solution as well etc.
In summary, it's not just because of there being a lot of tools for doing any single thing, but rather that there are far too many concerns to be addressed in the first place. To that end, it's really amazing that you can even run things on a Raspberry Pi in the first place, and that many of the tools can scale from a small VPS to huge servers that would handle millions of requests.
That said, it doesn't have to always be this complex. If you want to have a maximally simple setup, just use something like PHP with a RDBMS like MariaDB/MySQL and server side rendering. Serve it out of a cheap VPS (i have been using Time4VPS, affiliate link in case you want to check them out: https://www.time4vps.com/?affid=5294, though DigitalOcean, Vultr, Hetzner, Linode and others are perfectly fine too), maybe use some super minimal CI like GitLab CI, Drone, or whatever your platform of choice supports.
That should be enough for most side projects and personal pages. I also opted for a Docker container with Docker Swarm + Portainer, since that's the simplest setup that i can use for a large variety of software and my own projects in different technologies, though that's a personal preference. Of course, not every project needs to scale to serving millions of users, so it's not like i need something advanced like Kubernetes (well, Rancher + K3s can also be good, though many people also enjoy Nomad).
Edit: there are PaaS out there that make things noticeably easier for you by focusing on doing some of the things above for you, but that can lead to a vendor lock, so be careful with those. Regardless, maybe solutions like Heroku or Fly.io are worth checking out as well, though i'd suggest you read this article: https://www.gnu.org/philosophy/who-does-that-server-really-s...
> There's always a vpc with a bunch of subnets for the different AZs.
It’s funny because you’re already out of touch with how a lot of people would avoid having to do this in 2021. If your stack was simpler, you might not have such infra as code dependencies.
When you say “if your stack was simpler” do you mean “if your problem was trivial”? I’m always interested in simpler ways to do things, but solving distributed system and data governance issues tends to involve putting things in different places.
Deploy everything using a cloud native architecture and have all services internet facing, read about "zero trust networks" to understand more about securing such things.
Maybe there are "data governance issues" stopping the internet facing thing from happening. But if not, that's a more modern approach then three tier network segmentation.
Managed services offer big benefits over software. With CF, new stacks, change sets, updates, rollbacks and drift detection are an API call away.
Managed service providers offer big benefits over software. With CF and AWS support, help with problems are a support ticket away.
Using a single cloud provider has a big benefit over a multi-cloud tooling. I only run workloads on AWS, so the CF syntax, specs and docs unlocks endless first party features. A portable Terraform + Kubernetes contraption is a lowest common denominator approach.
Of course everything depends.
I've configured literally 1000s of systems with CloudFormation with very few problems.
I have seen Terraform turn into a tire-fire of migrations from state files to Terraform enterprise to Atlantis that took an entire DevOps team to care for.
Main use case of Terraform is not portability. Have fun porting your SQS queue, DynamoDB table or VPC config to GCP or Azure equivalents. It won't look like similar at all except the resource name. However, If you are only running containers and virtual machines, sure you can benefit from portability
Cloudformation lags features so bad that you end up hacks or Lambda functions with custom resources. DynamoDB Global tables took 4/5 years to be available in Cloudformation.
I've also seen wrongly constructed Cloudformation deleting critical databases, hanging (often), times out, rollback hanging not succesful, so it's not always rainbows & sunshine there either. Also I don't like Terraform for its usage of excessive state files, handling state files with DynamoDB locks, having them on S3.
I won't deny the good features of it like being managed is a huge plus, but it's so slow and lagging behind, and it's YAML is so verbose an has stack size limits, it's always a workaround with Cloudformation. My company uses it by abstraction for internal PaaS and deployment automation and it takes a lot for trivial changes to complete.
So in short, neither is perfect, but for me Terraform is easier to use, easier to debug, and faster, and features don't lag nearly as CF. Those are good enough reasons to avoid Cloudformation for me. I also don't like CDK because it's too verbose and its still CF, and I would rather generate Terraform JSONs/HCLs myself if I need more logic either.
Terraform also helps when you need to configure multiple stacks i.e for a service you can have a module that reflects your organizational best practices, a Fargate service for runnning container, automatic Datadog dashboards, cloudwatch alarms connected to Opsgenie/Pagerduty etc.
> hanging (often), times out, rollback hanging not succesful
The timeouts in CF are ridiculous. Especially with app deployments. I can't remember which service it is, but something can wait up to 30min on a failed `apply` and then wait the same on a failed revert. Only then you can actually deploy the next version (as long as it wasn't part of the first deploy, then you get to wait until everything gets deleted as well).
(yes, in many cases you can override the timeouts, but let's be honest - who remembers them all on the first run or ever?)
I've been using CF for a few years now with minimal complaints but I just hit a create changeset endless timeout (2 days to finally time out).
The worst part is that there are no error messages. When it fails and I click "Details", it takes me to the stack page and shows 100% green. Support ticket seems slow to get a response too.
That aside, my overall experience has been positive!
Terraform has its fair share of lag as well. One particular case that irks me is that the "meta" field on Vault tokens is unsupported. Vault of course being another first-class Hashicorp product makes this particularly odd.
That being said the Vault provider is open source and it's quite easy to add it and roll your own.
If I'm remembering correctly, I'm pretty sure the Vault provider for Terraform was originally contributed by an outside company rather than inside HashiCorp. My guess would be that it has encountered what seems to be a common fate for software that doesn't cleanly fit into the org chart: HashiCorp can't decide whether the Vault team or the Terraform team ought to own it, and therefore effectively nobody owns it, and so it falls behind.
Pretty sure that's the same problem with CloudFormation, albeit at a different scale: these service teams are providing APIs for this stuff already, so do they really want to spend time writing another layer of API on top? Ideally yes, but when you have deadlines you've gotta prioritize.
I really don't get why features come so late to CloudFormation - I guess AWS don't use much CloudFormation internally then, but surely they're not stringing together AWS cli calls? CDK is reasonably new too, unless they waited a long time to go public with it.
Many/most teams internally using AWS use CloudFormation; an AWS service I was a part of was almost entirely built on top of other AWS services, and the standard development mechanism is to use CFN to maintain your infra. You only do drastic things like "stringing CLI calls" if there's something missing from CFN and not coming out soon enough, in which case maybe someone writes a custom CFN resource and you run it in your service account.
Depending on how old the service is, the ownership of the CFN resource may be on the CFN service team (from back when they white-gloved everything) in which case there are massive development bottlenecks (there are campaigns to migrate these to service team ownership) or more often the resource is maintained by the service team itself, in which case the team may not be properly prioritizing CFN support. There can be a whole lot of crap to deal with for releasing a new CFN _resource_, though features within a resource are relatively easy.
On my last team, we did not consider an API feature delivered until it was available in CFN, and we typically turned it around within a couple of weeks of general API availability.
CDK is a higher-level (and awesome in my experience) way to just generate CloudFormation specs. In other words, you need both CloudFormation and CDK support for features to become available there.
In terms of getting new features fast CDK is strictly worse than CloudFormation.
Agree. CF is not a magic bullet, but neither is ansible or terraform.
We used ansible heavily with AWS for 2 years. Then we decided to gut it out and do CF directly. Why? If we want to switch clouds, it's not like the ansible or terraform modules are transferable ... So might as well go the native supported route.
I agree with the article, messages can be cryptic, but at the end of the day, I have a CF stack that represents an entity. I can blow away the stack, and if there's any failure or issue, I can escalate my permissions and kill it again. Still a problem? Then it's AWS's fault and a ticket away (though I've only had to do this once in 5 years and > 150,000 CF stacks.
I also would argue, if a stack deletion stalls development, you are probably using hard-coded stack names, which isn't wise. Throw in a "random" value like a commit or pipeline identifier.
I've had far less issues with CF than terraform or ansible. I have yet to see CF break backward compatibility, while I had a nightmare day when I couldn't run any playbooks in ansible because the module had a new required parameter on a minor or patch version bump.l (which was when I called it quits on ansible, I then relooked at terraform, and decided to go native)
I will caveat that our use case for AWS involves LOTS of creation and deletion, so I find it super helpful to manage my infrastructure in "stacks" that are created and deleted as a unit.. I dont need to worry about partial creations or deletions.. like ever... It basically never fails redoing known-working stuff... Only "first time" and usually because we follow least-privilege heavily
Yes Ansible does have extensions and can be used to provision AWS services.
The approach between Cloudformation/Terraform/Pulumi and Ansible are entirely different though.
The former are declarative, they define how the end state should look. Ansible is a task runner, you define a set of manual tasks it needs to execute to get to the end state.
I strongly advice against using Ansible for provisioning resources. It's idempotent by convention only. When I had to reluctantly use it for jobs it was extremely difficult to get a repeatable deterministic environment set up. Each execution lead to a different state and was just a nightmare to deal with.
Cloudformation/Terraform/Pulumi are much better in that regard as they generate a graph of what the end state should be, check the current state, generate an execution plan how to make the current state look like the target state.
Where Ansible is better than Cloudformation/Terraform/Pulumi is you have a bunch of servers already set up and you want to change the software installed/configuration on them. That's bit of an anti pattern these days changing config/provisioning at runtime. You can change that slightly and use Ansible with Packer to generate pre-baked images which works ok if you don't mind lots of yaml. This isn't to bad and works reasonably well and works to Ansible strengths all though these days most people don't prebake images with containerization. Also if you are only using Ansible for provisioning config on a host Nix achieves this much more elegantly / reliably.
We already used ansible for other things, so it wasn't too hard to swap over to AWS modules... (Except they were inconsistent and poorly supported, we ultimately found out)
Someone at Hashicorp then convinced mgmt that terraform is almost a write-once system, and we could jump from AWS to Azure or GCP easily "just change the module!"... When actual engineers looked at it, after 3 days there was almost a mutiny and we rejected terraform mostly based on the fact someone lied to our managers to try and get us to adopt it... I know someone who is very happy with terraform nowadays, but that ship sailed for us.
Those were basically the only people in this space, so we started rewriting ansible to CloudFormation. Since we mostly use lambdas to trigger the creation of CF stacks, this really works well for us, since our lambdas exist for less than a second to execute, and then we can check in later to see if there's issues (which is less than 1 in 50,000? 100,000? in my experience... Except for core AWS outages which are irrespective of CF). Compared to our ansible (and limited terraform) setups which required us to run servers or ECS tasks to manage the deploy. We can currently auto-scale at lambda scale-up speed to create up to 30 stacks a second if demand surges (the stack might take 2-3 minutes to be ready, but it's now async). Under ansible/terraform we had to make more servers and worker nodes to watch the processes... And our deployment was .3/.4 stacks per minute per worker (and scaling up required us to make more workers before we could scale up for incoming requests)
If I was building today, I'd probably revisit terraform, but I think the cdk or CF are still what I'd recommend unless there's a need for more-than-AWS... E.g. multi-cloud deployments, or doing post-creation steps that can't be passed in by userdata / cloud-init.. in which case CF can't do the job alone and might not be the right tool.
I'm a big proponent of CF when you are using AWS, but if you are on GCP, don't even bother with their managed tool, just go straight to TF. Their Deployment Manager is very buggy (or at least it was 2 years ago).
CloudFormation/Terraform/etc are also configuration management programs. They just work on the APIs of cloud vendors, rather than a package management tool's command-line options. They've been given a new name because people want to believe they're not just re-inventing the wheel, or that operating on cloud resources makes their tool magically superior.
> We used ansible heavily with AWS for 2 years. Then we decided to gut it out and do CF directly. Why? If we want to switch clouds, it's not like the ansible or terraform modules are transferable ... So might as well go the native supported route.
>
> I agree with the article, messages can be cryptic, but at the end of the day, I have a CF stack that represents an entity. I can blow away the stack, and if there's any failure or issue, I can escalate my permissions and kill it again. Still a problem? Then it's AWS's fault and a ticket away (though I've only had to do this once in 5 years and > 150,000 CF stacks.
Another killer feature is StackSet. I managed to rewrite datadog integration CF (their version required manual steps) to a template that contained custom resources that made calls to DD to do registration on their side.
I then deployed such template through StackSets and bam, every account in specific OU automatically configures itself without any manual steps.
> Managed services offer big benefits over software. With CF, new stacks, change sets, updates, rollbacks and drift detection are an API call away.
>
> Managed service providers offer big benefits over software. With CF and AWS support, help with problems are a support ticket away.
The problem is when those help tickets get responses like “try deleting everything by hand and see if it recreates without an error next time”. They've worked on CloudFormation over the last year or but everyone I've known who's switched to tools like Terraform did so after getting tired of unpredictable deployment times or hitting the many cases where CloudFormation gets itself into an irrecoverable state. I can count on no fingers the number of development teams who used CF and didn't ask for help recovering from an error state in CF which required out-of-band remediation.
I believe they've also gotten better at tracking new AWS features but there were multiple cases where using Terraform got you the ability to use a feature 6+ months ahead of CF.
> A portable Terraform + Kubernetes contraption is a lowest common denominator approach.
Terraform is much, much richer than CloudFormation so I'd compare it to CDK (with the usual aesthetic debate over declarative vs. procedural models) and it doesn't really make sense to call it LCD in the same way that you might use that to describe Kubernetes because it's not trying to build an abstraction which covers up the underlying platform details. Most of the Terraform I've written controls AWS but there's a significant value to also being able to use the same tool to control GCP, GitLab, Cloudflare, Docker, various enterprise tools, etc. with full access to native functionality.
Terraform (and kubernetes) itself aren't a lowest common denominator, however I believe the comment alludes to an approach where you try to abstract cloud features. This can (kind of) reasonably be done with terraform and kubernetes and avoiding vendor specific services such as various ML services, DynamoDB, etc.
However, you can use terraform just fine while still leveraging vendor specific services that actually offer added value, like DynamoDB or Lambda. Cloudformation however doesn't really offer that much added value (if any) over terraform, so using terraform isn't an LCD approach perse.
Yes — that's basically what I was thinking: you could make an argument that using Kubernetes inherently adds an abstraction layer which might not be preferable to using platform-native components but it sounded like the person I was responding to was making the argument that using Terraform requires that approach.
I found that especially puzzling because one of the reasons why we switched to Terraform was because it let us take advantage of new AWS features on average much faster than CloudFormation.
> Managed services offer big benefits over software.
TF can be used as a managed service.
> Managed service providers offer big benefits over software. With CF and AWS support, help with problems are a support ticket away.
The same is true with TF, except 100000% better unless you're paying boatloads of money for higher tiered support.
> I only run workloads on AWS, so the CF syntax, specs and docs unlocks endless first party features.
CF syntax is an abomination. Lots of the bounds of CF are dogmatic and unhelpful.
> I have seen Terraform turn into a tire-fire of migrations from state files to Terraform enterprise to Atlantis that took an entire DevOps team to care for.
CF generally takes an entire DevOps team to care for, for any substantial project.
Sure, but I never seen that myself. If TF was used it was always own set up infrastructure at best.
> The same is true with TF, except 100000% better unless you're paying boatloads of money for higher tiered support.
Again, all places I worked had enterprise support and even rep assigned. I think I only used support for CF early on, I don't know if it was buggier back then or I just understood it better and didn't run into issues with it.
> CF syntax is an abomination. Lots of the bounds of CF are dogmatic and unhelpful.
I would agree with you if you were talking about JSON, but since they introduced YAML it is actually better than HCL. One great thing about YAML is that it can be easily generated programmatically without using templates. Things like Troposphere make it even better.
> CF generally takes an entire DevOps team to care for, for any substantial project.
Over nearly 10 years of my experience, I never seen that to be a case. I'm currently in a place that has an interesting approach: you're responsible for deployment of your app, so you can use whatever you want, but you're responsible for it.
So now I'm working with both. And IMO I see a lot of resources that are not being cleaned up (because there's no page like CF has, people often forget to deprovision stuff), also seeing bugs like for example TF needs to be run twice (I think last time I've seen it fail was that it was trying to set tags on a resource that wasn't fully created yet).
There is also situation that CF is just plain better. I mentioned in another comment how I managed to get datadog integration through a single CF file deployed through stackset (this basically ensured that any new account is properly configured). If I would end up using TF for this, I would likely have to write some kind of service that would listen for events from the control tower, whenever a new account was added to OU, then run terraform to configure resources on our side and make API call to DD to configure it to use them.
All I did was to write code that generated CF via troposphere and deploy it to stackset in a master account once.
Right, your post is mostly "I like the thing that I've used, and I do not like the thing I haven't used". They're apples and different apples.
> Again, all places I worked had enterprise support and even rep assigned
So, again, you've worked at places that were deeply invested in CF workflows.
> but since they introduced YAML it is actually better than HCL. One great thing about YAML is that it can be easily generated programmatically without using templates.
Respectfully, this is the first-ever "yaml is good" post I think I've ever seen.
> Over nearly 10 years of my experience, I never seen that to be a case. I'm currently in a place that has an interesting approach: you're responsible for deployment of your app, so you can use whatever you want, but you're responsible for it.
I'd love to hear more about this.
> And IMO I see a lot of resources that are not being cleaned up (because there's no page like CF has, people often forget to deprovision stuff), also seeing bugs like for example TF needs to be run twice (I think last time I've seen it fail was that it was trying to set tags on a resource that wasn't fully created yet).
I guess we're just ignoring CF's rollback failures/delete failures/undeletable resources that require support tickets then?
> There is also situation that CF is just plain better. I mentioned in another comment how I managed to get datadog integration through a single CF file deployed through stackset (this basically ensured that any new account is properly configured). If I would end up using TF for this, I would likely have to write some kind of service that would listen for events from the control tower, whenever a new account was added to OU, then run terraform to configure resources on our side and make API call to DD to configure it to use them.
Again respectfully, yes, the person that both doesn't like and hasn't invested time into using Terraform at scale probably isn't going to find good solutions for complicated problems with it.
While this is true and AWS support is very responsive and useful, it doesn't mean they solve all the problems. Sometimes their help is: "I'll note that as a feature request, in the meantime you can implement this yourself using lambdas".
CloudFormation can be very flexible, especially with tools like sceptre, it can work very well. A huge issue is that WITHOUT tools like sceptre, you can't really use stacks except as dumb silos. You already need additional tooling (sceptre, CDK, SAM, ...) to make CF workable. I think that most people who despise CF haven't got good tooling.
The issue with CloudFormation is that it lags behind all other AWS services quite often. It seems to be maintained by another team. I realize that getting state management for complex inderdependent resources right requires time and diligence, BUT it's a factor in driving adoption.
- New EKS feature? Sorry, no knob in CF for MONTHS.
- New EBS gp3 volume type available? Sorry, our AWS::Another::Service resource does not accept gp3 values for months after feature release.
- A AWS API exposes information that you would like to use as an input to another resource or stack. SURPRISE, CloudFormation does not return that attribute to you, despite it being available. SORRY NOT FIXABLE.
- Try refactoring and moving resources in/out of stacks while preserving state? Welcome to a fresh hell.
- Quality of CloudFormation "modules" varies. AWS::ElasticSearch::Domain used to suck a whole lot, AWS::S3::Bucket as a core service was always very friendly.
- CloudFormation custom resources are a painful way of "extending" CF to suite your needs. So painful that I refuse to pay the cost of AWS not keeping their stuff up to date and well integrated.
These kinds of lag, this kind of incompleteness when it comes to making information from AWS APIs available have driven me to Terraform for things that are best done in Terraform and require flexibility and CloudFormation for things that work well with it.
At the end of the day CF is a closed-source flavour of Terraform + AWS provider. I would like have gone all in, but just doesn't work and costs hacks, time and flexibility.
That being said, if you have no idea how to work with TF, tell the devs to use CF.
The experience of AWS support is probably also very different when it comes to feature requests. An org that spends half a billion with AWS will get their requests implemented ASAP whereas small fish have to swim with the flow and hope it works for them.
Do you understand what is the difference between the things you can express with Dhall or ML vs Go or TS?
People are so hung up because we could do so much better with expressing only valid states and you do not need to deploy your infra to figure out that and S3 bucket cannot have a redirect_only flag set and also a website index document set at the same time.
That's right - use AWS CDK instead. You don't have to worry about the low-level CloudFormation syntax and details. I switched a few years ago and haven't looked back. CDK keeps getting better and better, also handling things like asset deployments (Docker images, S3 content, bundling), quick Lambda updates with --hotswap, quick stack debugging with --no-rollback, etc.
I've developed cdk for about 6 months now and it seems to me like a great idea.
However, the implementation is quite lackluster.
- It's very slow. Deploying our app from scratch takes >30 minues. An update of a single lambda (implementation, not permissions) takes about 3-5 minutes.
- Bad error information. On an error, cdk usually just reports which component couldn't be created. Not why it couldn't be created. That information you 'might' find in the aws cloudformation console, or not.
- It's buggy. Some deployments randomly fail because cdk couldn't find your cli credentials. Sometimes resources can't be created. In each case it works if you deploy again...
I suspect that most shortcomings that I experience are due to cloudformation. I'd really be interested in a 2.0 version that is either highly optimized or polished, or works with a different intermediate abstraction entirely. Updating and validating a tree of resources and policies should not be that difficult, at least not for AWS.
I think that the --hotswap option does what you are hoping for speedier updates. It allows CDK to update some cloud resources directly (e.g. Lambda function code) so that they "drift" away from the CloudFormation template definition. It's good for quick development.
CDK is terrible and way too low level for most apps, and worst of both worlds for code/declarative IaC (verbose & cognitive overhead heavy way to try to author code that generates a CF template - that still may not work).
Use AWS Copilot, it's like a reboot of Elastic Beanstalk complete with good CLI, made for the Fargate age and lets you extend it with bits of CF. (https://aws.github.io/copilot-cli/)
It sounds like your development is focused on containers, and perhaps Copilot is best for that. AWS CDK works very well when your application is based on many different kinds of cloud resources that need to be connected together, utilizing all the serverless and managed features of AWS.
Copilot supports serverless too (App Runner, serverless Aurora, DynamoDB, S3, CodePipeline etc). But if your app is using a huge nr of AWS features, you're probably over engineering it.
Sure it's opinionated enough that it won't fit every use case people have on AWS, but in most cases for new apps it's good.
I agree, and have the same experience. CDK is so much easier, much less verbose, and unit testable (at least to some degree).
Since resource importing is possible in CDK (not nice, but possible) you can even start using it if you already have resources that you do not want to recreate.
I'm always surprised that more people arent aware of CDK. Its an extremely powerful way to write software. Especially once you get good at it. CFN pales in comparison, CDK to me feels like the future of software development.
My comments are my own and don't represent my employer.
Almost every new service built inside Amazon uses CDK. I too am surprised that more people aren't aware of it. And you're right once you get good at it you can spin up infrastructure incredibly quickly with minimal guess work.
Yea the nodejs dependency is an abomination. The cdk also requires a nodejs version newer than what comes on many linux distributions so you end up installing a custom nodejs version just to run cdk.
I don’t know why they picked nodejs over a more sensible language such as python3 which comes installed almost anywhere or golang which is easy to distribute in binary form and relatively reverse compatible.
The aws cli really got it right: just let me install a standalone binary and call it a day.
AWS CDK also works with Python. However, in my experience, TypeScript-based CDK projects are more productive to develop, due to all the built-in error checking and autocompletion features in Visual Studio Code. Node.js also provides a more coherent environment with the package.json structure, which defines all the scripts and dependencies etc. Python-based CDK projects seem to include more "custom hacks". Because of all this, I prefer to use TypeScript for CDK even if the actual application to deploy is based on Python.
Regardless of what language you use to write your cdk stacks, the cdk cli application which transpiles your cdk stacks into cloudformation requires nodejs. So in reality cdk requires nodejs — always — and then whatever language you are writing the stacks in as well. It becomes a pain in the ass when you have a python application which is defined in a python cdk stack, but you still need nodejs to transpile your python cdk stack into cloudformation.
My main gripe is that the cdk cli should be distributed as a standalone binary so I don’t need to install nodejs to deploy my project.
Ok thanks for clarifying. I guess we already have Node.js everywhere where CDK is used, so never noticed it. I can see it's uncomfortable when that is not the case.
Maybe CloudFormation is too entrenched at the moment? I think CDK became publicly available in late 2019 or early 2020.
I haven't used it because I'm currently working on a project using Serverless (thin layer on top of CloudFormation) with no easy way to slip any CDK in.
That's very true. :-( I really don't understand why AWS doesn't optimize the CloudFormation backend. I don't see any reason why it couldn't update resources basically as fast as the APIs allow.
The purpose of CDK is exactly to reduce the complexity of your IaC. It automatically sets up all kinds of defaults, such as required permissions between different cloud resources that need to talk to each other. The underlying complex things are handled by AWS-managed library code.
This doesn't even include one of the most useful features of CDK: truly modular components. I can create a module used 100 times and all you have to worry about is the constructor and inputs instead of the nightmarish lack of reusability in other IaC.
I came here to say that Pulumi is the answer, but people already mentioned it. It's TF under the hood, some of the docs links even refer to the links in TF documentation [not intentionally, just by copy and paste], but they made the experience so much better. Writing code in Typescript and seeing your infrastructure provisioned with all the goodies Pulumi added on top of TF is pure joy. No more HCL, or other "here's my DSL, learn it" scenarios.
Pulumi is not "Terraform under the hood" in any sense than that some of the providers are forks of Terraform providers.
The engine is entirely separate and, now that they have a foothold with it, they are building out providers that are built around the Pulumi execution model first-class, rather than an adapter layer.
One notable core difference in the Pulumi model is that it doesn't have "plan" in the same sense that Terraform does, where the set of actions is fixed and just executed once you accept it, returning an error if that turns out not to be possible.
"pulumi preview" just runs the program in a dry-run mode, so there's no guarantee that the subsequent apply will match unless you've been careful to write a totally deterministic program that always behaves the same with placeholder values as it does for concrete values.
That's a typical tradeoff for using a DSL vs. a general-purpose language though, and one that seems to pay off for a lot of teams.
Commenting as an employee, with the account created for that purpose? It's not true what you're saying, but it's good that you're going that route. As I've said, I am a user of Pulumi, and you're doing loads of things good, just don't try to discard that you started as "TF under the hood".
I don't and never have worked at Pulumi, but I've read the open source code and can see that the Pulumi engine is distinct from what HashiCorp seems to call "Terraform CLI" or "Terraform Core".
But there are indeed repositories in their GitHub org which started as forks of Terraform providers, as I said. Pretty good strategy to start from all of the generic glue code somebody else already wrote and then adapt it to a new engine, because that's the undifferentiated heavy lifting that needs to be done regardless of the engine design.
I’ve seen teammates write enough impenetrable JS/Python/ etc code that I’m pretty comfortable saying I want none of that near the machinery that provisions potentially expensive amounts of infrastructure.
I like that HCL is declarative and deliberately restricted in what’s possible, it means I never have to think about some dev writing “clever” code or over abstracting things, or writing bugs that spin up to many machines.
HCL being so restricted is the reason why many teams feel the need to implement a layer on top of it.
Once you’re working with HCL templated via Gomplate or what have you, and a pipeline which processes the code base and then spits out the resultant HCL as an artifact to be used by a subsequent step, you’ll pine for the impenetrable JS or Python code. At least that is code which has a reasonable development workflow associated with it, not a monstrosity held together with duct tape and vulgar hacks.
I recently did a comparison of Pulumi and CDK. Pulumi's AWS abstractions are sadly nowhere near as good as CDK patterns, resulting in far more code needing to be written for the same result. I've also noticed that Pulumi seems to have a hard time accurately tracking state for AWS resources.
Pulumi does work very well in the Azure space as Azure resources are already well abstracted.
The Pulumi providers for major clouds (AWS/GCP/Azure) have native versions instead of using the TF providers under the hood. Caveat: some are still in preview.
I stopped arguing about which is better CF or TF. Some people go crazy every time this topic is discussed.
Me, I just hate all this IAC thing. I spend weeks sometimes to deliver a script that deploys an rds database with blah blah blah. Yeah it is nice to have all your infrastructure as code. It would be definitely useful if you are deploying the same structure again and again. or if you are running it frequently.
But if you are deploying a structure once only, do you really need this? Do you have to spend 2 weeks on a script you will re-run once in 6 months? (and then you will find that many objects need to be updated to re-run without errors) It seems a lot of over-engineering for me (I know I will be hated so much for this).
Sometimes when I run TF in production, I start praying that all goes well!
IaC is faster than doing things manually. If you're spending weeks on it, that sounds to me like you either haven't yet practiced enough or you're working in an environment that's already a mess.
Once you have a bit of practice, using IaC tools is just better in every way compared to creating resources manually.
It's faster, it's more consistent and robust, serves as emergency documentation and allows easy and usable change tracking and management because you can put the code in a git repository.
If your Terraform manifests are going out of sync, that means your team is not actually using it and the manual, untrackable change is messing with you.
I wrote the initial set of manifests for one setup (dozens of components and 3 identical environments) about 5 years ago. Since then, a different team has been expanding and managing the system and I really have no idea of everything that's there nowadays, but I do still occasionally need to go in and make changes and I can do that confidently because they kept using Terraform and I can just read the code to know what's new. I did make some mistakes (choosing Chef for CM, it wasn't a good fit in retrospect), but aggressively automating everything was not one of them.
I also have one environment that I wrote which consists of a VPC, an EIP, a single EC2 instance and S3 buckets and policies for cross-account backups. I also automated that, and it definitely resulted in a better result faster than doing it manually. Turns out, even setting up a single server to run a single piece of well-packaged software (plus a local database) has a lot you need to take into account if you want to do everything properly, including backups and monitoring. On top of it, testing my backups was easy because I could just delete the whole environment and run my automation again to rebuild it, so I know that as long as there's a backup in a bucket somewhere, the system can be recovered in minutes.
One thing I can't see mentioned is disaster recovery. We expect the cloud to be largely bulletproof but that has been proven wrong, we could easily lose something for multiple hours, which for a SaaS business is very bad for business. When Azure London went down for 6 hours, or whatever, we should have been able to recreate our microservices cluster in another centre but we simply didn't have it setup like that and had to live with the outage.
If we had to recreate any of our cloud system manually, it would take an enormous amount of time and would probably not work properly since we forgot those registry settings/frameworks/configurations that we did 2 years ago.
Part of the point of IaC is the ability to recreate all of most of what we need at the click of a button. Even if you had everything documented, the physical time would be prohibitive.
> But if you are deploying a structure once only, do you really need this?
Yes. Because "once" is never just once if you're doing something remotely useful. At a bare minimum you need to pull updates for your upstream dependencies and redeploy once in a while, even if you're doing zero extension or replication of actual functionality.
> Do you have to spend 2 weeks on a script you will re-run once in 6 months?
It does not take 2 weeks to write the vast majority of IAC. And even if does - at least you've saved the next person who needs to update it a world of complications of having to know how to manually curate the system.
They're not though. Theu talk about deploying multiple times. They also talk about problems with drift which means the infrastructure is probably changing outside of IaC. The problem is the process at ops company not IaC.
This is why companies like Gruntwork have pre-written production-ready Terraform modules for just about any service/provider you would need.
I have my own set of modules that I've written and use in my consulting work. It's repeatable work that I can have tailored to my clients in under an hour.
I haven't done that yet, but I'm thinking about going that route. My last job was with a start-up. I got to build everything from scratch, and I was 100% focused on infrastructure. I promised myself that everything new thing the team need would be deployed for them manually, then thoroughly researched and redeployed via IaaC asap. That meant that I was always delivering work that the team need. It also meant that I could take their request and build a really useful module from it. Example: nobody ever "just wanted" an SQS queue. They wanted a queue, various different sized workers in auto-scaling group with metrics to trigger scaling events. And logs. And probably an s3 bucket. And a deadletter queue. And all the security policies to allow those things to communicate with each other safely. Oh and custom metrics with alarms that notify you when your queries fill up. So I would build a single module that did all that stuff, using sensible and well-documented defaults. It was executed via for loop that iterated over a yaml structure. When the dev team wanted a new queue+workers, they edited the yaml (Hiera, in this case), added a few lines to the 'sqs-worker-groups:' block, and sent me a merge request. I would make sure they read the module readme, ask a few questions to make sure the foot-gun want cocked, and merge
I know some people have said that the developers in their company write their own terraform. That's really neat, but do you get solutions like this, where a single person is focused on writing the DRYest possible code that can be re-used by everyone?
When devs write infrastructure-as-code, there are also soooo many foot-guns that they all have to learn about over and over. Things that a seasoned infra engineer will have shot themselves with once and never again (s3 permissions, IOPS, egress costs, using us-east-1 etc). How many developers would bother studying the AWS infra-focused exam material?
I am a developer who migrated more towards doing ops work. I still enjoy writing app code but most of the work I do now is ops related. That means setting up infrastructure and pipelines for companies to help them ship things quicker and safer.
I think for most organizations having a dedicated ops person is worth it. Chances are developers are going to be developing app features. They won't have time to dedicate an entire day on why the nginx ingress for Kubernetes won't allocate an external IP address when you're using KinD locally with the nginx ingress' Helm chart default arguments.
Basically, there's always something to work on. Either infrastructure related improvements or helping developers do things better and faster in development by writing custom scripts to improve their workflows.
So... what happens when the person who originally deployed everything eventually leaves or gets hit by a bus? Hey, it's six months later and you need to redeploy everything again.
You need some automation or a really good how-to document, and hopefully both. It won't save you all the pain of learning how the deployment works but it will save you a lot of time and effort.
> But if you are deploying a structure once only, do you really need this?
Exact reason why I'm building my company. I worked on several teams now where developers would spend weeks figuring out how to get their apps to run in AWS instead of releasing features. At one company, we solved the issue by letting the devs use Terraform modules - but ended up with a different issue: devs would just copy/paste code without fully understanding why they are spinning the infra in that way. It's my honest opinion that developers should not be forced to become cloud experts.
> I spend weeks sometimes to deliver a script that deploys an rds database with blah blah blah
It can take a while if you're doing it the first time. Once you get used to the CF ideas it's a couple hours max. But keep in mind how that compares to: setting it up by hand, documenting the process so others know it, tagging the relevant services according to project, redoing the process in a while when RDS deprecates your currently running version, redoing it again when you need to change the db instance sizes.
Something funny (well, kind of sad) about CloudFormation I noticed this summer was that if you deploy a CloudFormation stack which updates a ECS service and deploys tasks which then fail health checks, CloudFormation will do nothing about this and just let ECS keep killing and restarting tasks for.. well, at least several hours. You have to know to go into ECS and drain the tasks manually and then initiate a rollback from CF to get your service back into a good state. The bug reports about this I found were going back years.
The upside is that I got really well acquainted with how ECS worked.
This support for CloudFormation across AWS services is crazy bad. For the longest time you couldn't have an ALB rules with it. To fill this gap you have to create dreaded custom resources... which can get suck for 4 hours.
IIRC, the problem was that rollbacks couldn’t start until it agreed that the service deployment had failed, which wouldn’t happen until I manually drained the task (or waited an unknown period for CF to timeout). You could click the rollback button but nothing would happen. I don’t think the alarm would’ve helped, but maybe I’m misunderstanding.
CFn will allow you to specify up to 5 alarms to monitor during stack deployment as well as a duration to monitor them for up to 3 hours.
During deployment, the stack does not transition into the COMPLETE state until the monitor duration has completed without the alarms being triggered. If one of the alarms is triggered, the stack immediately rolls back.
Since failed ECS tasks create CW Events, you can set up alarms on them and use them as the rollback monitor trigger.
100x this. Prior company committed to doing Infrastructure as Code and CloudFormation worked well except for this hiccup. We didn't even have that many services on ECS but we probably had 1 ticket a week asking support to help us with a 'stuck' stack.
Our commitment to CloudFormation was doubled down on that we could do containers, Lambda, and 95% of any other AWS Services....
However, in hidsight using SAM and the ECS CLI probably would have resulted in a more predictable CI/CD process as we weren't fighting deploy semantics through CloudFormation abstraction.
What we ended up doing is mostly decoupling CF and tasks. CF manages the infrastructure. Software deployments / tasks are done by another system which understands issues like this one.
Yes! It drove me nuts! I had a task that I forgot to give a / 200 ok and it just thrashed like crazy. I think once I have the cluster auto rollback it solved the issue though (other than ensuring a reliable health check). They should fix this!
Your updating the service to a broken image? I would say ensuring it's healthy falls outside of the scope of CF responsibilities. `aws ecs update-service` won't even do that. ECS should wait till your new image is healthy before draining previous image versions. If you deployed a bad image, probably needs someone to look at it anyway.
CloudFormation, with its HORRIBLE YAML templating (whatever dsl/language) and arcane error messages is a horror story. I hate it so much that I'd rather quit my job than debug why CloudFormation decided for no reason to update my RDS instance for a PR that was just a README file update.
I spent a bit of time trying to deploy a lambda app with Cloudformation. I wanted to use a relational database, so I needed to handle migrations.
Ok, so apparently I need to write a custom Cloudformation resource to execute a lambda function that will run the migrations prior to deploying the new version of the lambda. Kind of neat that you can do that.
Except I messed up the output of the custom resource lambda and Cloudformation completely locked my deployment up for 3 hours. 3 hours. I couldn't do _anything_ - rollback, update, whatever.
Cloudformation via a CDK is interesting, and I don't hate it, but oh boy if it gets into a weird state it can completely kill your iteration loop. And the docs say something along the lines of "if it's stuck for too long contact support". No thanks.
CF does have a lot of quirks (especially stacks locking up for various reasons, or rollbacks taking hours).
I find it easiest to run migrations when an application is first starting up (with an appropriate transaction lock so other instances won't cause the migration to run more than once), this way you don't have to do a lot of devops magic for it to work.
That’s not a fantastic model, especially for short lived lambdas. In some small environments just running a migration check on boot is passable, especially if you don’t have better tooling to handle this, but I don’t want to double my execution time working out if I need to run a migration or not in a lambda.
I guess I could do the migration check in the lambda startup rather than in the request path, but it’s still not ideal especially when handling rollbacks.
I might have preferred to use SAM, which uses CF under the hood but is much more focused on getting lambdas to work.
I'd pass in DB settings as parameters to the SAM file, rendering them as env. variables for the lambdas. with DB access managed by IAM roles (so you don't need sensitive info in the lambda env).
And depending on size/scope of project/team, might have one CF for the DB infra, and another for the app lambdas.
Using IaC services means I can have both dev and prod infra, so I can test DB migrations.
And I can do this as a solo dev on a side project where I'm also responsible for front-end and the other back-end stuff plus keeping clients happy on evenings/weekends.
The C in IaC stands for Code, and code is best written in proper programming languages. Both CF and Terraform are greatly limited by their use of yaml and hcl. Tools like AWS CDK and Pulumi, especially when used with a typed language like TS, are the future of IaC.
you are arguing about another thing, Teraform also has a CDK. But the question isn't whether to use CDK, it's "native" vs third party (Terraform & Pulumi)
I think the best solution is K8s and some tool to help you generate the yaml files.
Terraform, CDK, and Pulumi all work similarly. At the end of the day, you declare a DAG that says how you want your infrastructure to work and use that DAG to come up with a list of changes you should make. Then you execute each of those changes via API calls and hope that reality on your cloud provider really does reflect the state DAG that you've built. CDK is better than the rest here, because it's got the best ability to measure the state.
When you work with K8s, on the other hand, you tell your infrastructure what shape it should be in with yaml files and then let your infrastructure figure out the best way to get there. When combined with something like ArgoCD, you've got a really powerful deployment stack.
The obvious downside is you've got to figure out how to get that all working.
"let the tool figure out how best to get there" doesn't really sound any different than the concept of the other systems you were describing.
It sounds like the difference is largely just some implementation details: rather than a one-shot run through a DAG which will abort out if anything doesn't match the plan, I assume the kubernetes approach will just keep retrying until it converges, as if you were running "terraform apply" in a loop constantly until it tells you there are no changes left to make.
Is there something about the kubernetes approach that differs in capability rather than just implementation approach? These underlying APIs are always fallible so I don't really see how you can get away from some sort of converging process where total convergence (without errors) might not actually be possible until you change the configuration.
Both CDK and TF are really bad dealing with several apps spread over different repos that share resources (e.g.: ECS Clusters, VPCs, etc...).
I had good experience by going a submodule route. One multirepo containing a CDK infra submodule and multiple repositories as submodules which each output resources (lambda code, etc.) which in turn can be imported more easily by cdk.
k8s is fundamentally just a much better API. YAML on the other hand is awful so I recommend Jsonnet instead to generate the k8s manifests. I currently use Tanka for this.
Writing Terraform scripts for AWS is 70% of my job. I do have some issues with the AWS provider in Terraform. Firstly, there are bugs. I ran into a bug a few days ago where the ARN attribute on a Lamba alias was resolving to the ARN of the Lambda, not it’s alias. I only figured it out because I found a GitHub Issue. Additionally, Hashicorp is often playing catch-up with Amazon. A few days ago AWS released a new instruction set architecture for Lambdas that would save my org a lot of money. However after I saw the announcement in AWS I see tons of different GitHub issues created to add this functionality. So I start editing my files based off the documentation only for that issue to be closed and pointed to a new one with different syntax. So I start working off the new syntax only for that issue to close and be pointed to a different one
I'm ex AWS so I used CloudFormation sort of because I had to (I guess, no one told me not to use terraform, but it felt wrong not to drink our own champaign). I left AWS to co-found a startup and when I had to pick between the two, I just used what I knew already (CloudFormation, but more specifically CDK). I have to say I am highly tempted to give Terraform a look, but if I do, it will be terraform + CDK probably: https://github.com/hashicorp/terraform-cdk
For me troubleshooting, speed, and that punch in the gut feeling when you see after 30 mins of crunching the dreadful - update failed rollback in progress - are great reasons to give Terraform a try.
AWS should fix this by making the SDK/API be 1:1 with infra as code. The result of a describe call should === the stuff needed to create that thing declaratively.
CDK is an amazing project, their high level constructs are making AWS SAM / Serverless framework / Amplify seem complex. With a line of code I get a best-practice opinionated VPC, an ECS cluster, a Fargate task with an ALB. (off topic - be careful - always ensure tasks return 200 OK on a GET / or it will thrash for hours, as others commented below, it's a known bug, sadly I knew about it only after wasting hours on troubleshooting it)
So I agree perhaps CFN has native issues (speed, troubleshooting) but don't hate CDK just because it uses CFN in the back.
So if you have to choose between CloudFormation or Terraform, I'd choose CDK.
The above package supports typescript as a language but not nodejs ... no thank you.
I don't want to see microsoft products (typescript) taking the place of the original, open source products (javascript/nodejs) upon which they are based
Got it. Thanks for the heads-up! My hope is that CloudFormation will keep getting faster and easier to troubleshoot. And that the investment in CDK will pay off... but I'm going to start dabbing with terraform on the side... (although I rather my infrastructure as code to be in, well, code, I don't think yaml / json counts as code, but I won't let that distract me...)
Unless things have changed in the meantime, the killer feature of CloudFormation for me is that I don't have to keep track of the state locally. Having to set up tracking of the infra state in Terraform is a huge pain, since it should be stored independently of both the infra code (to allow deploying anything but HEAD) and the infra itself (duh). As long as Terraform doesn't query the existing infra to work out what needs doing I don't want to go back to it.
Their service owners don’t seem to feel that way. Terraform’s AWS provider often adds support for new services and features weeks or months before CloudFormation does.
I've seen this argument everywhere. If this is the web, people would say "use the platform", because aparently any abstraction made on top of that is automatically worse, if this is .Net it would be "use Azure". If this is desktop applications... use C++?
That's just not how reality works. First class citizens can suck, because guess what, they're designed by mortals
not the case with cdk. i hate IAC in general. terraform being one of my least favorites due to mostly being declarative, verbose, and the way it manages state. cdk is a large improvement and moves towards just being code.
I find it mildly disturbing to see how willing people seem to be at a personal level to attach themselves skill-wise to single-vendor solutions.
I get that CloudFormation may be a bit better than Terraform in this way or that way. But I have no personal interest in learning it. All that investment of time and effort and when I need to do a project on a different vendor I have nothing, no skills. What's the point?
To be clear: I understand why companies and even specific teams or projects may choose to go all in on vendor-lock-in. It's often the most direct line to the business value they want and as people point out, it's rare to never that most cloud native applications will shift cloud vendors, and even if they do most of a Terraform setup will have to be reworked.
So I get why companies and managers want that. I just don't understand why software engineers in those teams are so happy to go along.
I don’t see how you couldn’t learn both. Terraform is dead simple to learn and cloud formation is the best way to define your infrastructure in code in AWS. Even more wrapped by CDK. I don’t think you would go that far career wise knowing only with Terraform for IaC.
I also don’t understand those organisations that use every single AWS managed solution (RDS, Redis, Elasticsearch, EKS) but then they feel the need to manage their infrastructure state themselves with Terraform instead of choosing the managed solution for this.
As a former AWS employee, Terraform is MUCH better at scale for AWS resources than Cloudformation or even AWS CDK. CDKs reliance on CloudFormation underneath means being limited to the exact same problems that are mentioned in the article, as well as managing resources at enormous scale incredibly painful.
Often, these single-vendor technologies are thin syntactic and nomenclaturial veneers over actually-quite-general abstractions. E.g. it's very easy, if you want or if you need, to rewrite a CloudFormation stack as a Terraform project, or a even a set of Kubernetes custom-resource-controller resource manifests. They're all analogous technologies; the and the "hard part" of an implementation in any of them isn't to do with the things that make them different, but rather the things they share in common — the formal modelling of a digraph of dependent declarative infrastructure resources.
CloudFormation is great because of its transactionality, so it lends itself nicely to deploying multiple services which are versioned together. You either succeed fully, or all services will be rolled back.
This way you can deploy your whole infra with Terraform, and then deploy to your i.e. ECS cluster using CloudFormation. Works great in practice.
The rollback functionality of CF is a blessing. We use both CF and Terraform at my company and i vividly recall multiple times where my connection had cut out during "terraform apply" and left the Terraform infrastructure in a half-finished state.
When it works, which is a big caveat: we had far more cases where it failed in a way which required manual remediation and the gaps in validation meant that you'd be in a “apply / error / rollback” loop requiring 20+ minutes before you could try again. Terraform was always considerably faster but it was especially the orders of magnitude improvement in retry time which convinced most of us to switch.
The CloudFormation team has been working on this so it's possible that experience has improved but the scar tissue will take time to fade.
It's especially interesting from a product design standpoint: everyone knows things break sometimes but when it forces you out of an easy path into something much harder or simply locks you into a lengthy penalty time delay, people will remember that one time FAR more vividly than all of the times it worked normally.
This is especially true for things like CloudFormation where the user is likely on a deadline or in a stressful situation trying to fix a problem or hit a deployment window. Forcing someone to wait an unpredictable amount of time for no reason ramps up the stress level massively.
Rollback doesn't always work with CF. I've noticed so many times that it would mostly delete everything but not certain things once in a while. Then you're left having to play detective to manually figure out what you need to delete while having to delete dependencies by hand in a specific order.
I've spent hours just waiting for CF to fail deleting EKS or RDS related resources then I end up getting billed for $30+ a month sometimes because I forgot to manually delete a NAT gateway.
One of the most amazing things I saw at AWS reInvent was an advanced talk on IaC that provided the code of a lambda function inline in a CloudFormation template. I realize that this is just one talk, and there are plenty of ways to structure things well, but this practice is directly encouraged by the design of CloudFormation[1]. AWS has attempted redefining the lambda deployment story multiple times, there are multiple companies whose primary offering is providing a better way to deploy code to serverless offerings, but this still stands out to me as one of the most terrible ways to do things, and I blame the design of CloudFormation.
I'm going off track here but Pulumi have a totally mind-bending feature where you can write the code of a lambda function not only inline, but such that it captures the value of variables from the surrounding infra code at the time the function is serialized.
Seeing the specific examples they use it for (AWS infra glue) makes me think that there is room for infrastructure related lambdas to be defined right in cfn or infra code, with very low ceremony, even if you wouldn't want to deploy "applications" like that.
I'm working with what I think is a similar problem-space, trying to create a Node.js program that produces code to be run on the browser. The code runs in Node.js at "build time" which essentially means constructing the static web-site, which means some JavaScript must be somehow "deployed" to the website to be then later run at "run time".
So deploying to an AWS etc. Lambda seems conceptually similar to deploying from static-site-generator to a static web-site source-code, which is then what the browser will execute. Or is it?
I found both CloudFormation and Terraform's AWS provider to be too verbose for quickly spinning up a lot of infrastructure as a startup.
So I ended up creating https://provose.com/ own high-level wrapper for Terraform's AWS provider. I can happily spin up VPCs with AWS ECS containers, RDS databases, EFS and FSx Lustre filesystems, and more.
I really believe in the Terraform ecosystem, but I do have to admit that it is hard to write a Terraform module or provider that provides a significant level of abstraction over low-level cloud provider APIs. I've seen this issue drive a lot of other people to Pulumi or the AWS CDK, but I think Terraform is a better product specifically because Terraform HCL is not a (trivially) Turing-complete programming language.
> With Terraform, your local executable makes rest calls to each service’s REST API for you, meaning no intermediary sits between you and the service you’re controlling. Want an RDS instance? Terraform will make calls directly to the RDS API.
How is this different than CloudFormation making the same calls?
You give CloudFormation a list of instructions. It accepts it and gives you an ID to watch for updates, then it goes off and executes them.
Terraform executes a list of instructions. It executes them in front of you while you wait.
Both are fine until you run into something like this:
I'm pushing a Elastic Container Service Task Definition change via CDK. A CloudFormation change is submitted, and I wait for it to finish. In the background, it's trying to do the update but the update fails due to some misconfiguration with the new container.
CloudFormation doesn't fail or return an error. It times out after an hour and reverts the change. I have to know to dig into the AWS console to find my failed tasks to view the error.
If I did this update via Terraform, I would get the error back in my console quickly as Terraform is directly telling ECS to make the change. With CDK, the CloudFormation changeset is generated, it is submitted to CloudFormation, then the tool polls the AWS API for progress updates. Sometimes you get specific messages back, sometimes it fails and you need to go in and see what it failed on.
Vanilla cloudformation is bad, but so is terraform (for my use case anyway). We wrap our cloudformation with python, you need something similar for terraform to make it less terrible (cdktf, terragrunt, terrascript).
To be honest, I don't agree this. Manage an infrastructure need evidence and trace how this get created. I've been in the situation a few times. Have been threw projects terraform code doesn't match aws infrastructure. We don't know when an how the drift happen. At least, cloudformation can have some feature to detect the difference and help me trace back which commit actually has been deployed. CDK make the job easier for developers because it deliver some convenience and offer more pattern to write code. I like both.
There is a local company here in Seattle called Pulumi[1]. We've been using their tool extensively over the last 18+ months and it's pleasure to use.
It's built on top of Terraform, but it simplifies IaC because you can now write your infra components in one of few high-level languages they support (JS, Go, Python, etc). CDK is similar, but biased towards AWS. With Pulumi you can provision your infra stuff in multiple clouds easily.
Personally I like to use both and for specific jobs.
For example, when we have tools that need to be deployed across 10+ AWS accounts managed by different Ops team, I hand them a CloudFormation template and they could run it, plug in the right parameters, pulling lambda code from the same S3 bucket we have etc... Totally agree with Writing Cloudformation is a pain but when you have it done once, it works consistently and we don't have to worry about terraform version, the tfstate etc... It just works.
I use terraform for more complicated setup, an environment that we keep adding ontop, share & manage among our team and need rebuild/redeploy often/quickly or an environment we need to spin up for various tasks (eg incident handling VPC, interview challenges, CTF events...) ... Terraform and its powerful reusable terraform modules/module registry make spinning up these environments in minutes make it extremely attractive.
Managing terraform version and tfstate is still a pain with terraform even with the help of remote S3 bucket & dynamodb lock but it is definitely better than when i first started and we had gpg encrypted tfstate.
With that said, I work in security and only very occasionally I need a big deployment like scalable spark cluster or multi-zone elasticsearch cluster etc... so perhaps I don't have enough indepth knowledge about each tools.
- (good) One tool for multiple service providers is nice.
- (good) Fully open source.
- (good) directly calling service API's makes it more debuggable, and greater control over performance/time of deployments
- (bad) directly calling service API's makes the agent running the tool a SPOF
Cloudformation
- (bad) YAML sucks, not a programming language.
- (good) managed service, doesn't require an agent to run the deployment
- (bad) often slow, no visibility into what it's doing under the hood
- (bad) AWS specific(yes I know about public registry, but it still requires AWS)
Pulumi
- (good) can use an actual real programming language!!
- (bad) directly calling service API's makes the agent running the tool a SPOF
- (bad) need to be connected to, and have credentials for your environment to develop
- (bad) running your code requires bi-directional communication with the pulumi GRPC api
CDK
- (good) can use an actual real programming language!!
- (good) produces a declarative output
- (bad) dependent on cloudformation
- (unknown) CDK for terraform(I haven't tried this, but it sounds promising)
I don't think we yet have the tool we need, but CDK is pretty close, having a real programming language, that produces a declarative intermediate format, that is then ran by another system, I think is the right approach.
If "directly calling service API's makes the agent running the tool a SPOF" is a "bad" for Terraform, it also must be for Pulumi.
The idea that one "need[s] to be connected to, and have credentials for your environment to develop" to use Pulumi is equally true of Terraform, assuming by "credentials" you mean "credentials for the target provisioning platform".
The idea that "running your code requires bi-directional communication with the pulumi GRPC api" is also untrue.
It's important to be accurate when making these kinds of comparisons, lest you fuel memes instead of providing insight.
Both CDK and Pulumi are declarative. The program declares resource desires to an engine, which evaluates a diff and effects change. You cannot instruct the engine what to do, only the desired outcome, which makes it declarative.
You are correct there is an engine - I mistakenly mistook that as criticism of the default state back end, which is not required.
Well yeah, I mentioned that at the end, CDK just produces a declarative cloudformation output.
I think this is a good thing, but I wouldn't say that CDK is declarative, as you don't write declarative code.
You can write literally anything you want in a CDK/Pulumi program.
Depending on how you write it, the outcome could be different every time.
The generated outcome, ie a cloudformation template, is absolutely declarative, but the CDK code is not.
I think this discussion is going on circles because you are both discussing different parts of the problem as if they are the same part.
A system could be said to have a declarative execution model if its input is a description of a desired result and that system then decides for itself what sequence of side-effects to perform to achieve that result. Terraform, CloudFormation, and Pulumi are all declarative by this definition.
However, there's also the question of how you tell the system how to build that data structure in the first place. CloudFormation and Terraform both do so with some form of DSL, with the effect that the source program perhaps more directly corresponds with the data structure it produces. Pulumi instead uses general-purpose languages to generate the data structure, and so the source code that produces it can, if desired, bare very little direct resemblance to the resulting data structure.
The tradeoff here then seems to be more about how far removed the input shape is from the resulting declarative data structure, rather than about "declarative or not". Some folks prefer the input to more directly match the underlying model, while others prefer to intentionally abstract away those details behind a higher level API.
I don't think either is right or wrong... I actually wish these two approaches were complementary inside the same product rather than two companies trying to kick each other in the shins, but that's capitalism for you, I guess.
The TypeScript CDK is the best use of TypeScript I've experienced. Just having all the documentation inline is hugely helpful, but the type checking also makes development faster by catching errors before you deploy (or even validate.)
I personally prefer aws-cdk over Terraform for the specific use case of writing complex Step Function State Machines. Using code to link and chain steps and logic together is much easier than HCL imo. For everything else I prefer Terraform though.
What is the relationship between these thoughts and the new recent AWS API called "AWS Cloud Control API", which uses the Cloud Formation schema registry under the hood, and that will soon be available as a new aws terraform provider: https://www.hashicorp.com/blog/announcing-terraform-aws-clou...
terraform plan has the benefit of showing me whether what I did is at least internally logically consistent before making any attempts to change the state of the world. It doesn't mean things can't go wrong, but it is harder to make silly mistakes that are bound to happen when you try to inject logic into JSON or YAML which are just text.
Sure, the equivalent terraform would not win a beauty context either but this is just an abomination. If someone is paying me well enough, I can live with it, especially if it worked perfectly, but it doesn't. Plus, standard CloudFormation crap Amazon will push on you tends to assume a bunch of things that may or may not be true in an account that has existed for a while before a particular feature became available.
I’m not so sure that’s a great example, since it could be written like this:
HealthCheck:
Target: !Sub HTTP:${WebServerPort}/
CloudFormation also has changesets which you can inspect to see what will change and why before applying them. `aws cloudformation deploy` is not all of CloudFormation.
I am not necessarily here to debate canonical or good CloudFormation ... I took one particularly unappealing snippet that does not reveal anything that I should not be disclosing from a bit of AWS provided CloudFormation. Regardless of how you write it, whether you write JSON or YAML, you are writing logic in pure text and the difference between that vs your code going through terraform's solver is night and day IMO.
Plus, in source control, I find terraform diffs much easier to grok than JSON or YAML diffs. Especially since with the latter one misplaced invisible character can pass through without being noticed and cause disaster.
I'm pretty sure that was sarcasm. I disagree with said sarcasm, because CDK takes you one layer away from the actual thing that gets run but gives you a much nicer thing to work with so it can still be a good trade off; writing rust (or whatever) "obfuscates" the underlying CPU instructions but it still turns out to be a good idea.
Terraform doesn't mean that you want the option to go with other cloud providers, it just mean that you want to treat your infrastructure the same way, regardless if it's is a cloud resource in AWS or something that AWS doesn even offer, such as email, analytics and so on. I don't know if it's still the case but AWS used to not have a rabbitmq offering.
We use terraform for our AWS infrastructure, but also for Datadog, PagerDuty, etc. That's where it starts to shine: consistent infrastructure as code across all your (provider-supported) vendors.
maybe we skip the need and move to https://crossplane.io, to abstract all this part as well as gain 2nd day operations natively, and k8s readiness.
(bonus: the closest path to cloud agnostic definition)
Can we just agree that none of these tools (Terraform, CloudFormation, CDK) are particularly good? None of them really achieve the goal of purely declarative, immutable, stateless infrastructure-as-code and all of them have significant disadvantages.
Agreed. We use Pulumi and it's fine, but the state-managing feels brittle.
I think the state-of-the-art is Kubernetes + jsonnet, though you still need to suffer through the learning curve and you'll still need TF and company to provision some of your cloud resources.
While the read was interesting and informative, something about the tone made me search for a disclaimer/disclosure of interest. Are you an "influencer?"
This whole article is a case not to use Terraform.
If you are creating an RDS cluster, CloudFormation will call the RDS REST endpoint for you.
And so AWS takes care of things like retries, and you won't hit AWS's API call limit.
With Terraform, the error messages you get are usually much more informative
Often the errors are completely cryptic, or even misleading. Anyway, this isn't really a reason not to use CF or TF, it's a reason not to use AWS.
With Terraform, you need to use valid credentials that grant access to multiple accounts.
This is a security and reliability concern. Not only that most people end up making one credential which can access everything, but that people then end up running Terraform from one place, rather than running it per-account/per-region/per-AZ, which is much more resistant to failure and separates security domains.
CloudFormation’s pseudo parameters and intrinsic functions are cruel jokes. By contrast, Terraform offers a rich set of data sources and transforming data with Terraform is a breeze.
The more complicated you get with Terraform, the worse it works. This is actually a truism of all DSLs, but when you're using a DSL to talk to a program to talk to an API to talk to a program, it gets even worse. For example, some AWS resources will not work with Terraform properly unless you remember to apply the correct lifecycle policy (they'll go into a create/destroy loop, or never be able to be destroyed, or created, etc). With CF, AWS is taking care of that decision for you.
AWS CloudFormation is S-L-O-W.
Someone hasn't tried to apply a Terraform root module that manages hundreds of different resources, I see... Terraform can take dozens of minutes, to hours, depending on how carefully you've crafted your root modules, or your network latency, or even problems with AWS's TF provider. And if you're using STS creds that expire, it can leave you with broken infra / a broken remote state.
Terraform, on the other hand, will occupy your shell until the directly-involved AWS service coughs up an error. No additional tooling is required.
There's like 5 different hosted cloud services dedicated just to being able to run Terraform for large teams without it breaking or taking forever. On top of that, often Terraform will fail to apply, half way through applying, because Terraform doesn't ask APIs to validate all the operations before it starts performing them, and it often doesn't even check whether the operation it wants to do makes sense. It is very common for 'terraform apply' to break your Production infrastructure because it was literally impossible for you to know it would work until you tried it.
If you learn AWS CloudFormation, then guess what: you can’t take your skills with you.
This is almost a legitimate criticism, but it's not like using CF stops you from using Terraform too. You can use either, or both (just not on the same resources).
Useful providers you can run on your own laptop include TLS key generators and random string generators.
I highly recommend not using these. Makes it harder to run TF deterministically, and you want as much determinism as you can get in Immutable IaC.
Importing Existing Resources
This is a manual nightmare with TF. Terraformer, a completely different tool, makes this easier.
My main issue with this is that a lot of that coding is just reinventing the same wheels over and over again. Jumping through many hoops along the way like a trained poodle.
It's stupid. Why is this so hard & tedious. I've seen enough infrastructure as code projects over the last fifteen years to know that actually very little has changed in terms of the type of thing people build with these tools. There's always a vpc with a bunch of subnets for the different AZs. Inside those go vms that run whatever (typically dockerized these days) with a load balancer in front of it and some cruft on the side for managed services like databases, queues, etc. The LB needs a certificate. I've seen some minor variations of this setup but it basically boils down to that plus maybe some misc cruft in the form of lambdas, buckets, etc.
So why is it so hard to get all of that orchestrated? Why do we have to boil the oceans to get some sane defaults for all of this. Why do we need to micromanage absolutely everything here? Monitoring, health checks, logging systems, backups, security & permissions. All of this needs to be micromanaged. All of it is disaster waiting to happen if you get it wrong. All of it is bespoke cruft that every project is reinventing from scratch.
All I want is "hey AWS, Google, MS, ... go run this thing over there and tell me when its done. This is the domain it needs to run over. It's a bog standard web server expecting to serve traffic via http/websockets. Give me something with sane defaults. Not a disassembled jig saw puzzle with thousands of pieces. This stuff should not be this hard in 2021.