Hacker News new | past | comments | ask | show | jobs | submit login
“Who Should Write the Terraform?” (zwischenzugs.com)
222 points by zwischenzug on Aug 9, 2022 | hide | past | favorite | 134 comments



Software Engineer at Spacelift[0] here - a CI/CD specialized for Infra as Code (including Terraform).

A pattern we're seeing increasingly commonly are Platform Engineering teams doing the bulk of the work, including all the fundamentals, guidelines, safety railing, and conventions, while Software Engineers only use those, or write their own simple service-specific Terraform Stacks which however extensively use modules developed by the former.

This does also seem like the sweet spot to me, where most of the Terraform code (and especially the advanced Terraform bits) is handled by a team that's specialized for it. If you don't have a Platform Engineering team, or one that is playing its role (even if its called DevOps or Ops or SRE) in even a medium company, you'll probably start having as many approaches to your infrastructure as there are teams, complexity will explode, and implementation/verification of compliance requirements will be a chore. Just a few people responsible for handling this will yield huge benefits.

And yes, I can wholeheartedly recommend Spacelift if you're trying to scale Terraform usage across people and teams - and not just because I work there.

Disclaimer: Opinions are my own.

[0]: https://spacelift.io


I think a platform team taking ownership is the correct model, but the early product teams need to have "embeds".

The platform team owning base terraform functionality works well for the product teams that are the 3rd or 4th user of said functionality.

For the early days of the platform, and the early users.. your product is constantly in dependency & priority battles with said platform team. This is where "embeds" help continually unblock while making sure the work is done in a platform centric manner that will be reusable for other product teams.

Simply saying the product teams need to go down into the weeds at this level just puts too much disparate responsibility on product teams who exist to deliver a single product. Similarly it encourages vastly different approaches to similar problems, with all the wasted duplicate & re-work.


I tend to think of embeds as being very similar to the open source contribution model: you want some sort of BDFL entity that drives the overall direction of the platform, but also some sense of community/collaboration where individuals can feel empowered to contribute features to scratch their own itches, or bring up discussions, etc.

Having a team owning the platform doesn't necessarily need to mean shutting yourself in a cave. Granted, promoting cross-functional collaboration is a challenge in and of itself, but similar to OSS, projects that invest in the community aspect are the ones that eventually gain critical mass and set themselves apart from the rest.


Having a single "platform" team per company is a bottleneck as soon as the number of product teams is greater than N.

> ...you'll probably start having as many approaches to your infrastructure as there are teams, complexity will explode, and implementation/verification of compliance requirements will be a chore. Just a few people responsible for handling this will yield huge benefits.

Agree with the centralization of "how infrastructure should be managed/defined". A "platform" team composed of M platform engineers (where each platform engineer works 80% of their time for a given product team) can handle such centralization.


> Having a single "platform" team per company is a bottleneck as soon as the number of product teams is greater than N.

This is my experience as well. Having a single platform team has been a great experience for laying foundations, establishing shared architectures, and centralization documentation.

As soon as two or more teams need something from the platform team, it becomes a battle of priorities. A good platform team will recognize this and work on a division of labor and coordination strategy that can start to scale. A bad platform team will treat this as an opportunity to claim the company’s wins for themselves and leverage their bottleneck position for political gain.

The company’s management of the platform team is key. I’ve also seen a single platform team abused as the engineers who are expected to own all the hard work while other teams get to walk all over them with demands. This results in a lot of employee turnover, which is the opposite of what you want on a team tasked with holding the core knowledge of the company’s infrastructure.


You can have more than one platform team.

I think reality is more complicated than a one size fits all approach. It's going to be specific to your org, your project, the stage it's at etc. To add to that, the right thing to do is often in flux.

Dedicated capacity is necessary, as is embedding. Not always at the same time or in that order. That's where only the information found inside the walls of your organisation can help you decide what is necessary to solve your problem.


It also creates the unrealistic expectation that one size fits all. An architecture that works well for stateless microservices fails spectacularly when faced with monolithic session-bound legacy telecoms services.

Yet so many people insist that the one is the same as the other, when one is a duck and the other is an elephant wearing two swimming fins on its face.


> Platform Engineering teams doing the bulk of the work, including all the fundamentals, guidelines, safety railing, and conventions, while Software Engineers only use those

So, sysadmins and programmers - but with new 2020s vintage titles? (and renumeration...)


Basically, yeah, but with the difference that these sysadmins are generalizing and abstracting the patterns they've learned over years.

I personally think of "devops" not quite so much as being about "dev" and "ops" collaborating (though that is a noble and worthwhile goal) as about having "developer-operators", people who know how to do operations effectively and who can turn that knowledge into automated, generalized software systems.

The abstract modules and tools can live in their own repositories (or folders, in a monorepo), and your devoperators can work closely with the product teams to use them (and abstract specific changes to meet projects' individual needs to be more generally applicable).


Seen this at a couple companies and it doesn’t work well. The platform team becomes a bottleneck and the devs don’t want to have to deal with or learn the mess that is terraform.

It’s time for the ecosystem to move beyond the half baked config language known as HCL


Pulumi


What do you suggest instead?


Pulumi seems a lot more sane, trying to back all the complexity of infrastructure into a config language just doesn't add up at the end of the day. This is why we have general purpose languages.

We could also probably use more abstractions similar to Pulumi, theres been talk on HN about storing all of the state for applications like this in the tags of the underlying cloud resources. There are some caveats with this approach, but it would provide a interesting tradeoffs


Have a look at GruCloud, an alternative to Terraform/Pulumi/CDK, which generates the infrastructure code automatically from a live infrastructure. Disclaimer, I am the author.


Disclaimer: a kind of disavowal, rejection of responsibility

Disclosure: put hidden info forward, e.g. source of potential bias


I think this would easier to adopt if it could be plopped into an existing agnostic CI system. We built something like this in-house on top of Gitlab CI and it works really well for us. Locking isn't as much of an issue as you make it seem in the pitch, we just have our infra contaiers wait to acquire and renew a distributed lease while they're running. Some kinds of failures just release the lock and others panic and stop the world for human intervention.

Presumably your core competency isn't building CI systems or job runners so why bother? I'm sure at the core of your own infra it's job agnostic The value-add is the management plane on top of it.


The semantics of standard CI/CD providers are in practice very ill-suited to more advanced Infrastructure-as-Code use cases (triggering other stacks based on changes, multi-repo workflows, etc.), so building on top of them would add a lot of complexity. I don't want to go too much into it.

Overall, if your setup works for you and you're happy with it - keep using it!

We've seen a lot of companies (many now our customers :) ) try to build their own on top of existing systems (GitHub, Gitlab, Jenkins, etc.) and waste a ton of time and engineering resources, while ultimately not achieving anything that works well.

What Spacelift does is it gives you a bunch of much better-suited building blocks which let you build your required workflow very quickly.

And it obviously does integrate very deeply with your VCS provider - Commits, Pull Requests, Comments, etc. - everything is supported and customizable using - amongst others - Push Policies[0].

[0]: https://docs.spacelift.io/concepts/policy/git-push-policy


Author here: yup

Pendulum back to the center


Partly yes, but not fully.

The idea is not to go back to the Software Engineer asking the Ops team "Hey, can you provision a Postgres database for me please?" and then waiting a week for it.

It's that the Software Engineer takes a module that was prepared by the Platform team - i.e. "terraform-postgres-mycompany" - which already includes all the requirements the company has for handling databases (think backups, monitoring, encryption, etc.). They can then proceed to use it in the small service-specific Terraform configuration, which really is just putting such ready-made modules together.

The important bit being - the Platform team isn't a bottleneck here.


Sure, but early in this transition it is slow&painful.

At my last org..

The old process of "Hey, can you provision a Postgres database for me please?" was managed in a web ticketing system, change managed, and had 24 hours turnaround! As was most other requests - VMs, NFS shares, FTP servers, network/FW changes, etc.

The new process was "Hey is there a terraform module for Postgres?" followed by weeks/months of prioritization and specification battles with the platform team. Somehow despite the platform existing 18 months, we were the first team to need a.. database? Then they didn't want to support Postgres and were forcing everyone onto Aurora. Then it look N months for it to come up in their queue.

Rinse&repeat with every fundamental building block of a cloud offering, ad infinitum.

This is why I strongly prefer the embed model until the platform has proven itself to at least be at the "80% solved" end of the spectrum.


As an Ops, this sounds like a very overworked ops team. I can write a production worthy postgres module in less than a day, maybe a few hours. Or... just a subpar team, as much as I hate to say that.


As an Ops, who literally replaced Patroni with a Terraform CTS module, in about a week... can say that it would be nearly impossible to do in a non Pizza size team due to communication and confirmation biases, alongside the respective anti-patterns.


Mostly overworked but also not a team of veterans who knew what they were doing already. Platform Ops:Product App Devs ratio was like 1:30, and they had very little implemented yet so just completely overscheduled and understaffed.


I've been ops for about 15 years and understaffing us is such a huge problem. I just joined a company with 2 other Ops for 80 developers over 8-10 teams, my previous had 7 for 70 and was one of the few places I felt like I could take a breath and relax, easily take days off, etc. I felt more like a SWE than ops with regard to work-life balance/on-call rotations/ability to do creative & research work, etc.

I'm glad those 2 ops at my current spot finally have help but they were alone for years while the company was generating a ton of revenue which bothers me but seemingly was their own decision; nobody told them they were overworked until us more SR ops people arrived and went out of our way to pull work away from them.

Having all sorts of different teams (style, culture, language) relying on you to not gate-keep is.. stressful at times. I try to context switch as little as possible but sometimes it just can't be avoided.


> They can then proceed to use it in the small service-specific Terraform configuration, which really is just putting such ready-made modules together.

now, a new requirement comes in for the product team(s) - a new service which bulk reads/processes another team's data (e.g., an export service).

Turns out this is too slow if the bulk reads use the usual web api route. So direct access to another team's provisioned database is required.

This now falls onto the platform team to produce a method for doing so.

Imagine this, but multiplied by N, where N is the number of different features being worked on at the same time, and all of them needs time from the platform team to produce something new for them!


Agreed.


Thank you for the plug. Surprised such an open commercial promotion got top comment.


ITT people arguing for embedding infrastructure engineers into product teams.

Ayyyy, dios mio.

a) If you need to embed, then actually, you need to embed InfoSec, UX, IT, Customer Success, Product, Compliance, etc. etc. for exactly the same reasons. In today's labor-constrained economy, good luck finding qualified people for every role on every team! And if one of them leaves, who ensured that they documented everything for the next guy? Or that you'll find someone to fill the role quickly? If you have a 30 person company, fine, no big deal. 150+ and it starts to become a serious problem.

b) Particularly for infrastructure, you will shoot yourself in the foot on your production cloud bill. If you share no infrastructure with other teams, then you will find no shared efficiency in sharing the same infrastructure. Conway's Law will burn your runway. If you're 100% serverless then this doesn't really apply, but if you're spinning up eight different Kubernetes clusters for eight different teams then you probably need to collaborate a bit better.

Product teams need to own their product top to bottom. Platform teams need to make that easy for them, because modern stacks are huge, it's not possible to staff a single team with all the necessary experts, and all that expertise is a genuine necessity. The lines are drawn in different places in different companies depending on available labor and technical requirements.


> If you need to embed, then actually, you need to embed InfoSec, UX, IT, Customer Success, Product, Compliance, etc. etc. for exactly the same reasons. In today's labor-constrained economy, good luck finding qualified people for every role on every team!

If those people are part of your core value proposition, the thing that's supposed to give you your competitive advantage, then yes (though if you need all of them, you probably don't have a very good value proposition). If not, if they're just a cost center doing commodity-level work, then they don't need to be part of the product team - but in that case you should be looking to minimize or outsource them.

> Product teams need to own their product top to bottom. Platform teams need to make that easy for them, because modern stacks are huge, it's not possible to staff a single team with all the necessary experts, and all that expertise is a genuine necessity. The lines are drawn in different places in different companies depending on available labor and technical requirements.

If the "platform team" are doing something so independent from the products that they don't need to be part of the same team, why are they in-house at all? If you're offering a generic platform, either you're doing it better than Amazon and should be in the business of competing with them, or (more likely) you're doing it worse than Amazon and should just use Amazon.


> why are they in-house at all?

Someone needs to answer to Compliance, to InfoSec, to Finance. Someone needs to make sure that they all understand exactly what production looks like in their language. Compliance wants to know whether we keep EU data in the EU. InfoSec wants to know whether all our code in production passed security review. Finance wants to prevent costs from spiraling out of control and to judge which projects to fund.

Good luck trying to get AWS's "platform" to do any of that as a managed service, out of the box and without any in-house engineering time!


> but if you're spinning up eight different Kubernetes clusters for eight different teams then you probably need to collaborate a bit better

Why? there can be reasonable scenario for that - say 8 reasonably seperated projects run by 100 people?

Also I do not see how being serverless "doesnt apply". It does apply because a lot of your infra is security, especially company-wide security configuration.

I understand the deeper meaning of the message, but at the same time devops is a thing because it likely hurt more than other cases mentioned. But I think the whole thing is often a balance between integrated / standalone.

Every team and project requires breathing room but also requires certain level of integration. Devops was needed and is proceeding - find an engineer who has no docker experience today, compared to the past where often engineers had 0 idea of delivery. Other groups may rise their own requests if they feel, but they will lose some flexibility from being standalone.


> If you're 100% serverless then this doesn't really apply, but if you're spinning up eight different Kubernetes clusters for eight different teams then you probably need to collaborate a bit better.

This is exactly the situation I´m currently in. Company decided to migrate from big on-prem kubernetes to AWS. Now every team got their own account and well... good luck, you´re on your own now. We´re a small team of three developers. Although we have three certifications under our belts (AWS Dev, CKA, CKAD) it took us almost three months to configure AWS and set up the Terraform pipeline and define processes like "upgrading cluster". The "enabling" part was basically missing in the whole cloud strategy of the company. It was more like: good luck, you´re on your own now.

In fact we made contact with a neighboring team. Only to find out that their use case was so different from ours that collaboration didn´t make any sense. For them Kubernetes was not a good fit, for us it was the way to go.

Speaking of sharing a cluster or AWS ressources: we figured out that it is not allowed due to billing reasons. Company policy is: One product per AWS account.

If you ask me I see a shift of paradigms happening here. Now you hear a lot about "enabling teams" instead a dedicated team for infrastructure providing services (e.g. the Kubernetes podcast from Google). I´m not convinced yet. I think this is more like kicking down responsibility down the chain. And then it feels more like: Someone needs to do the dirty work but nobody wants to do it.

It might work if you don´t have to provide Service-level agreements (in our case: we don´t). For us it is just more work to do. And our work shifts from dev to ops. Instead of writing software we´re mostly busy with configuring cloud resources. This will ease a bit once everything is running. However: I see this whole change more as ... uh, strong word... ideologically motivated. Cui bono? Neither our team, nor our users nor our infrastructure bill.


> it is not allowed due to billing reasons. Company policy is: One product per AWS account.

That's kinda funny because half the reason why AWS has tags in the first place is to get finer granularity into understanding billing. Not to mention products like Kubecost. Sounds like whoever wrote the policy doesn't understand how AWS works.

> Someone needs to do the dirty work but nobody wants to do it.

There are plenty of people willing to do the dirty work, they're just already working for other companies and their salaries are quite high. The labor market is tight.

> Cui bono? Neither our team, nor our users nor our infrastructure bill.

HR benefits. Having open positions that HR is failing to fill is a bad look for HR.


Not everything can be tagged to get the source of the cost - for example I don't think you can differentiate which of your products generated egress network traffic (which is quite costly as soon as you reach certain scale). I might be wrong, I switched to separate-account-per-product a while back and never looked back.


It’s tough getting lectured by people who aren’t following the same level of discipline standards that you are. Infrastructure code usually looks more brittle than production code, because they aren’t specialists in high quality general purpose code. Not as bad as QA code, but not great.

I think the instinct is that if you’re going to take the moral high ground, you’d better walk up the hill and join us first. And the simplest way to do that seems to be the obvious one, which is to combine them under the same org chart and governance.


> you need to embed InfoSec, UX, IT, Customer Success, Product, Compliance, etc. etc. for exactly the same reasons

aka, a full-stack engineer! The idea that you have some specialist take care of each role in a team is just fantasy.

Get a smart person, and train them full-stack. Including customer success (aka, sales and after-sales support), compliance (i mean, GDPR is required understanding now, so might as well be the engineer who knows it).


Re this segment:

> "There were endless complaints about the time taken to get ‘central IT’ to do their bidding, and frequent demands for more autonomy and freedom. A cloud project was initiated by the centralised DBA team to enable that autonomy. [...] Cue howls of despair from the development teams that they need a centralised DBA service"

Author makes it sound like users didn't know what they wanted. This is not true -- I have seen this play in practice, and what author omits is _it was a different set of people_ who were complaining before and after.

If a dev team has at least 2 engineers who are happy working with infrastructure, then the team will benefit from autonomy. If there is no one like that on the dev team, they will cry in despair.


> _it was a different set of people_ who were complaining before and after.

That's something I've observed as well. Seems to me there are (at least) two developer personas, one kind only wants to deliver their task, specialize in what they do well, and generally can't care less if their DB is oversized or has no maintenance windows set or no recovery plans or who has access to it etc. They usually lack the cloud/platform skills as well, and won't develop much in that regard because they don't care to. Even if they did, they're unlikely to get rewarded much for that effort. They are easy to make happy, and you rarely hear from them other than the occasional "thanks" in some Slack channel.

The other kind is either internally very curious about the subject, or already has the experience, or at least they think they have it. They want to have full access, invent things anew in the "right way" they believe. For them there is nothing worse than relying on another team while they could geek out on the subject themselves and they believe they could do it better. Sometimes they're right about it, and other times they're either oversimplifying the work needed, or optimizing locally around themselves/their team/their task. There seems to be no way to make them really happy other than giving them complete freedom to do whatever.

Seems to me most developers (I've worked with) are of the first kind, and they can be made happy after some level of maturity is reached within the company, but the second kind is way more vocal and they won't ever be happy with whatever a central team builds.

More and more I'm getting convinced that the only way to really win both personas is to build two products instead of one. So you build the golden path, the Helm chart or the portal or whatever for the first kind, and give ownership and loosely govern with compliance/policy tooling with the second kind.

This optimizes for the short/mid-term satisfaction, but of course it can also go wrong since team compositions are not set in stone and what one builds may not be maintained properly by the other, and there'll be some duplication of efforts and quality of solutions built might vary between the teams. I guess for some companies this is acceptable, and for others it won't ever be.

tl;dr ¯\_(ツ)_/¯


For the second group, I don't think it is about inventing things anew.. most of the time, it's just an efficient way to get things done.

A lot of time the centralized ops team is very slow or just not very good. Your tickets may take weeks to be processed, or critical requirements are ignored, or maybe central ops team only cares about closing tickets and does minimal possible work to satisfy the letter of the requirement.

If there is no one on your team who can do better.. well you suffer and work with central team. But if your team has someone who can do this and the autonomy to proceed, then your can work much faster -- no need to wait weeks for to allow the other team to access your data, you can grant the permission yourself in under a day.


I’d probably fall into the latter category of “complainers”, but I actually care very little about doing infrastructure work even though I’m interested in it, I just want a quick turn around on simple requests.

My current place is just awful. Over complicated architecture born from a platform team that couldn’t be less helpful, so people have worked around it with all sorts of hacks.


> For them there is nothing worse than relying on another team while they could geek out on the subject themselves and they believe they could do it better.

It may not be optimal, but it’s almost invariably faster.

Personally I like the way my company does it, where people have (more or less) full access to the AWS account, but there’s a lot of automated guardrails/scanning that alerts you when you’ve done something stupid (public S3 buckets etc.)


I've started to believe that product engineers should manage their own infrastructure. I think the key ingredient is _isolation_, so that it's not that they have to figure out how to fit their service into the unholy single production account with 15,000 running instances, it's that they get to start fresh from a basic template and then move from there. Most services, when isolated from the other microservices, are just not _that_ complicatedl

For what it's worth this is how AWS operates, and I think it's the mindset with which they build products. You certainly _can_ go your own way and run something like k8s on top of it and build a mini-cloud in the cloud, but it's incredibly expensive.

It's a mistake I've made repeatedly -- "Oh, I'll just add this little abstraction to make it easier for developers!" But now the poor developer has to understand both the tools I built on top of _and_ whatever I was thinking at the time, and inevitably it's an under-resourced area.

Now, at a certain scale for core services, sure, you'll end up with infrastructure specialized folks. But I'm unconvinced that the place you want to start is, "Okay, I need a new service, better go talk to the beleaguered central team that never has quite enough time for anyone."


It's a constant cost-benefit struggle. AWS can do it because they are printing money printers.

Sure, this does not excuse most traditional big corps that have huge internal engineering budget yet force a top-down rigid inefficient structure. (Though again, it takes a very principled way of doing things to be able to scale out and keep things sort of consistent and coordinated.)


I get it -- it's such a trap to say, "well, <tech giant> does it!" But in this particular case, I actually think they're walking the walk of having small teams that act like startups. You end up with small engineering teams owning a tiny little bubble, and it does a LOT to keep complexity down in terms of what one team has to manage.

(This is of course much more true for greenfield projects, early stage stuff. Of course the giant services are large and complicated.)


A bit of ranting...

As for me

1, HashiCorp is forcing enterprise upsales whenever possible, even if it'll hurt Adoption Rates and overall Development Experience

2. Existing TF design issues are ignored, which is causing people some state management trouble irrelevant for TFE. So, yet again, why fix something that will end up in upsales ?

3. MPL requires for the PR's to be available in case someone will really fix something, but it's near impossible to contribute into Terraform with any major design improvements.

4. Existing Providers issues are neglected, and Accepting Working PR's takes around 3-4 weeks...

5. Some Providers (helm) are neglected in favour of the New Product Release (Waypoint provider) and there a Forced Obsolescence Factor alongside with Forced Adoption.

Deficient Relationship Marketing is the Key Factor in deciding who Will actually write Terraform (maybe not even HashiCorp), Who will Wrap Terraform and Into What (terragrunt, terraspace, pulumi, crossplane etc or some custom gitops SaaS), and Who will Support the target providers when Hashicorp solutions will magically turn into an abandonware due to upsales.


If you're interested in more of the authors thoughts on DevOps, Kubernetes or writing, check out an interview I did with him recently: https://kubernetespodcast.com/episode/185-writing/


> The development team didn’t want to – or couldn’t – do the Ops work

Most devs I've spoke to are in this camp they don't want to do any Ops work at all. They want a 9-5 job without evenings are weekends wasted by services failing. No on call rota and all that jazz just writing code that's all.


People used to have same attitude with testing in reality it’s just way better quality and often velocity if folks dont just throw it over the fence.


That isn't a good comparison since testing can still be part of your 9-5. There are good engineers who adamantly never want to work outside 9-5


Same as there were "good engineers" who adamantly only wanted to write code, not fix bugs or test it, or work on anything infrastructure/tooling related even it doesn't involve any oncall.


I could be wrong, but 20 years of experience tells me that company size has a lot to do with this.

Tiny organisms like amoeba can be simple. But as organism size increases, so too does complexity. They eventually need a nervous system, circulatory system, extra sensors, a more powerful brain to process sensory information and handle movement, motion tracking for hunting. Suddenly, packs of these animals will hunt together, so they'll evolve communication: signals, sounds, language...

Well, if you're a 4-person start-up sitting in the same room, decisions can be made quickly, you don't need departments, managers. But as you grow your need to be extremely careful that you build a nervous system, circulatory system, sensors ... "management brain".

The biggest failures in ops aren't "who does X?". It's about creating right-sized teams that own functions that are important enough to have specific owners. With further growth, certain functions get more complex, and suddenly you might need dedicated network, database & security teams. And if it gets huge, then you probably need to need multiple copies of those specific functions embedded inside large subsections of the organisation. And they all need to communicate effectively with each other. It's a constant dance. You can't make a single rule and just stick rigidly to it. You need to keep tabs on complexity, workload, morale, lead times. You need to be ready to refactor your teams.

When I hear stores like "it was taking 8 weeks to get a DB provisioned" I think "if that company makes it to IPO and the CTO gets a few $100M, there's absolutely no justice in the world".


There is good stuff in this article, though I wish more writers would hire editors to help trim these articles (I always hire an editor when I write something this long). I think this is the heart of it, though you have to go pretty far into the article to get to this bit:

"What’s the point of this long historical digression? Well, it’s to explain that, with a few exceptions, the division between Dev and Ops, and between centralisation and distribution of responsibility has never been resolved. And the reasons why the industry seems to see-saw are the same reasons why the answer to the original question is never simple."

It is true that the answer is context dependent. I consult with several startups, I give different answers to different CTOs, depending on what stage their organization is at, and how much they will actually need devops in the future (I recently consulted for Paireyewear.com, a company that relies on Shopify to provide the public facing store through which they sell. As such, they will never need much in terms of devops. Instead I brought in Chris Clarke, one of the best devops talents I know, and he consults with them part-time, and that is as much devops talent as they need.)


> and between centralisation and distribution of responsibility has never been resolved.

Its never been resolved because people try to have their cakes and eat it too. There's pros and cons to both ways, but people refuse to deal with the cons. Dealing with that in my current org, where a decision was made to distribute a specific subset of responsibilities, and as soon as it gets even a little difficult, they start centralizing again (within that subset), even when there's solutions to the problems.

So we end up in this weird kind of Frankenstein organization, and that's the worse of all worlds.


I couldn't agree more. As a recovering BOFH currently working on something like what the author describes as a platform team... the amount of times I've had developers bemoan the sentinel and other guardrails in place while unwilling to accept responsibility to meet the requirements of the various regulators and stakeholders is very discouraging.

"plus ça change, plus c'est la même chose"


What a long winded article to say "it depends", I liked the history though.

It got me thinking, here at amazon, we deliver "infrastructure as code" using the Cloud Development Kit: https://aws.amazon.com/cdk/

We expect engineers (not devops) to define their infrastructure in typescript and configure it through code. That code gets turned into cloudformation scripts and stands up the how cloud system for the api you're building.

I think this is a great hybrid approach. Knowing what you want is different than knowing all the intricate details of defining, say an API gateway. But the CDK lets me stand up an API Gateway and configure it with a swagger and security policy and be done. This lowers the barrier for devs to do devops work, and lets teams own and move fast when making changes.


The Amazon CDK is a breath of fresh air and I love working with it, coming from a CloudFormation background. For myself, I can just `cdk synth` and take a peek at the output to conclude "yes, that's what I would have written and the TypeScript saved me hours of reading CF documentation".

However, for others on my team coming in without the CF background, it feels a little like voodoo and as soon as they tread off the "standard path" I find myself getting pulled in to do the "hard stuff".


The layers of AWSCDK leads to a lot of brittleness, and that was a huge turn-off for us. I like the thought of building systems in a functional way but the tooling just isn't there yet. I haven't dived too deep into Terraform CDK yet, and Pulumi just had too many problems.


I am finding the same. Even a relatively simple deployment built on CDK has no end to issues and headaches, ranging from rollbacks that don't completely roll back to a previous state, dangling resources that aren't cleaned up properly, and the issues go on and on. This is mainly due to CDK depending on cloudformation which is, in my most humble of opinions, a non-starter for starting up anything more complex than a single ec2 instance.

Had we built this out in terraform, state cleanup and tracking would have been more robust, the ability to retry resource creation would have been more stable, the project overall would have been much more of a pleasure to use. The functional/declarative aspect of terraform in relation to cloudformation is so much more polished.

Edit: declarative


What were your issues with Pulumi?


tldr; it wraps Terraform providers poorly and can fail building proper infra diffs from time to time...

https://github.com/pulumi/pulumi-aws-native is nowhere near GA state, just scroll through the issues...

and the respective terraform wrapper https://github.com/pulumi/terraform-provider-aws is somewhat neglected in favour of the tf2pulumi native port above.

Crossplane, on the other hand, does better with the Terrajet codegen, and all the infra drifts are a part of the reconciliation cycle, which is very handy on simpler deployments but doesn't work with more complex ones due to excessive drift polling model.


Wondering what your issues were with Pulumi also?

I'm leading the engineering at a startup based in the UK and we actively chose Pulumi over CDK, TF etc....

Been going 12months now and we're really pleased with the decision so far.


1. both pulumi and crossplane just wraps the Terraform providers as is on many occasions, and quite poorly. There are a lot of pending issues with the dependency graphs, state refresh and proper state diffs. Although a lot of the most troubling issues had been resolved, it's still a mine field run.

2. Both TFCDK and dagger.io can be used for multistage TF deployments, although I prefer dagger myself...

Terraform has a major state management design flaw that had been ignored by hashicorp to force TFE upsales. It's impossible to perform multi stage deployments with a single `terraform apply`. You have to manually identify the deplyoment targets for every stage, terraform providers do not support `depends_on` block and they are not a part of the resource dependency resolution graph. i.e. You can't deploy Vault than configure it with the respective provider - terraform will try to perform both deployment and configuration simultaneously and will fail.

3. This is due to strong Sales Opinion that a Single Plan is of a Positive Product Value for Terraform. While in practice it turned out to be False, the actual Product Value of Terraform is in Single Consolidated Infrastructure state, which can be analyzed by the respective static analyzers (infracost, tfsec, checkov, inframap, driftctl etc). And it's a strong pro compared to both Pulumi and Crossplane...

Having a single state is a blessing for large companies with a tight operational schedule - having multiple states with a single lock can cause conflicts quite often, with volatile outcomes. Yet again, an upsale point for TFE.

Even though Terraform "has more providers" you have to be able to support 'em all by yourself, HashiCorp does not provide a Viable Support Plan for the existing Official Terraform providers (on my xp - maybe someone was more lucky).

That's why I'm often saying that DevOps is not a title, it's a methodology... and every DevSecOps guy should be well versed in golang to be able to support, test and extend the respective tools and operators (k8s automation).

* answered in a bit more detail below https://news.ycombinator.com/item?id=32405064


The challenge with this is that many orgs have hundreds if not thousands of standards that must be adhered to in this situation. So if a engineer wishes to use a database, they can stand one up with the CDK no problem. But it's going to fail every audit unless they are aware of these requirements. And even if they are aware of these requirements, doing the integration for things like privileged access management, billing and security monitoring are a pain in the ass that provide no incremental value.

On the flipside, having a single platform team write the IaC components that do all of this grunt work tends to reduce the degrees of freedom that the engineer has to build the application architecture exactly how they want it.


As a SWE, I don't think it's great. I'm more interested in writing product features and fixing bugs. I'm not that interested in writing CDK stacks to create databases or what not.


If your features don't involve utilizing stuff like S3, SNS, SQS, Lambda, DynamoDB, CloudWatch Metrics / Alarms, you're probably missing out on a lot of value. Nobody wants to spend time "writing CDK stacks to create databases or what not," but implementing tough features without these services can easily involve exponentially more work / maintenance / expertise.


That's fine if that's what you want. But I'd suggest that your career might be improved by being more willing to take on harder challenges and grow as an engineer. Product features and fixing bugs are intricately linked to infrastructure, and limiting yourself to only code is going to limit your ability to debug hard problems.

Early in my career, I worked for a business that still used mainframes, and we had this random bug that caused processes not to communicate and drop messages. Because I was willing to dig into the intricacies of server administration, I was able to diagnose the problem as the IPC queues size being set to the kernal default, and the default only allowed for a few seconds of messages to back up. It was a quick fix, deploy a new kernal parameter to allow for bigger IPC buffers, but if I'd refused to do "devops" work, we'd never have found the problem.

As my old boss said, you're not paid to do only the easy stuff. You're paid to do the hard stuff too.


> But I'd suggest that your career might be improved by being more willing to take on harder challenges and grow as an engineer.

Indeed. That's what I have done (I do infra stuff + product features). I don't like it, but I do it.

> As my old boss said, you're not paid to do only the easy stuff. You're paid to do the hard stuff too.

This intrinsically states "product features == easy", "infra struff == hard". While I think that certain topics related to infrastructure are hard, certain topics related to product feature are hard as well. What I think your boss wanted to say is "I don't want to pay an extra paycheck to an infra engineer. So, I'll pay you 20% more so you do the infra stuff instead. And you get to 'grow as an engineer'. It's a win-win!".


The CDK seemed to be an attempt to supplant the Serverless framework and Pulumi even. Are those options barred for use internally or just under promoted?


Amazon loves dogfooding. I don't think they're barred, but fall under the "nobody ever got fired for using CDK" domain.


The author calls out a few reasons why DevOps fails for organizations all of which I agree with - however the one that I've never completely understood: Regulatory reasons for keeping Ops centralized.

I work in healthcare which I guess should fall under this rule - but in practice I haven't really seen that impeding DevOps. Teams that have the capabilities to build the full stack get handed a subscription to a cloud provider and they go off and do so. They still fill out and track change logs, audit changes and seek approvals - but after that's done, it's still the team who presses "the button".

Anybody in a regulated industry where you've hit hard walls that prevent you and your team from going full on DevOps? If so, what rules were quoted that stopped you.


I am not in a regulated industry, but we have recently gone through the process of getting SOC2/ISO27001 certified.

This is what was cited for us.

ISO27001:2013 A.6.1.2: Segregation of Duties. Conflicting duties and areas of responsibility must be segregated in order to reduce the opportunities for unauthorized or unintentional modification or misuse of any of the organization's assets.


Surely that means that no one individual can push a change they created without involving someone else, but that it is still fine as long as any two people (even if they're on the same team) are involved? You could solve this by e.g. forcing GitHub to require a review.


What exactly is a "Conflicting duty"? What's stopping a company from stating that developing, deploying and supporting software is a single duty?


Nothing ... except compliance.

The idea comes from finance -- to require collusion to execute a fraud. It's not perfect, but it's something.


Maybe I should rephrase that: Is it impossible for a company that defines Dev+Ops as a single responsibility to be compliant?


Not impossible. even in a prescriptive framework like ISO 27001, adequate SOD is a judgement call between you and the auditor. Generally speaking, if a single dev can push a code change to prod, in a way that would escape audit or not require a second pair of eyes, that would not be compliant. So if a dev writing code, also manages the deploy environment, that may not pass muster.

But it's not that cut and dried. There are degrees of rigor.


No. Assuming a well configured continuous deployment type environment; you just need to have peer review on code before it can hit production, and you need to have controls in place over the who, what and when of elevated access to production being granted


This all breaks down as soon as audit realise the Devops team is also admin of the ci/cd stack and therefore all controls put in place to make it harder for a single actor to do bad stuff can be bypassed via this all powerful system.


It seems like the description is vague enough that this is entirely up to whoever gives you your certification.


There’s another reason that’s a bit older but there’s a line item in section 404 of SOX called “segregation of duties” which many bureaucrats interpreted to mean “developers must not have access to production” when that’s not what the regulatory requirement means. It essentially means checks and balances for accountability and auditability. If nobody can cowboy code their way into prod it’s fine. In fact, rogue ops engineers modifying code in production is an example of how separating ops and dev won’t really protect from insider threat vectors either. What really must happen is that there is a sure way to verify that code is approved by another stakeholder for deployment and tracked at traceability levels appropriate to who can fix it or should be able to view the info.

When people keep yammering on about devops as a principle of people and processes they’ve already lost because processes are meant to replace people, so really all that matters are the processes and the services that fit into the process SLA and OLA.

Note that in a big organization what really matters are your particular regulators and arguing with your regulators claiming to know it better than them is probably one of the fastest, reliable ways to get fired I can imagine that won’t result in a criminal lawsuit against you.


Author here. That's interesting, as I've not worked with healthcare too much.

Others here have cited segregation of duties, which is definitely a factor, but the other one less mentioned in finance is the 'one throat to choke' principle: it's simpler from a management and regulatory perspective to have the responsibility for failures in one place rather than across many teams.


Ah - that makes sense. This might be a bit easier in healthcare as I believe it's pretty common to have many different ops teams each responsible for different parts of the business.

I feel like most of the time "compliance" is blamed when really, it's your first point in that section (Absent an existential threat, the necessary organizational changes were more difficult to make) that is the real holdup.


PCI doesn't _stop_ us from distributing these duties, but it sure does make it harder. Having change management processes in place puts in place all sorts of additional controls. Sharing code with this system and the main system creates either friction, or a lack of DRY code.


Same here - made and make health apps, ISO13485. No requirement for a central ops team.


SOD - Segregation of Duties.


The thought leadership seems to be to get Dev and Ops to work directly together and avoid handoffs by creating a totally separate department called DevOps and having them do all their handoffs with dev and ops. You can call them platform engineering so nobody figures it out, though.


> But despite a lot of effort, the vast majority of organisations couldn’t make this ideal work in practice, even if they tried.

This matches my on-the-ground experience. The teams who lived the dream of DevOps were teams which built their software as cloud native (instead of later trying to migrate to the cloud). This is purely because the PaaS tooling let them efficiently be both Devs and Admins.

When you involve many teams instead of just a smallish group of devs, you have momentum to deal with. Plus, specialization - some of these ops people just don't like coding, or at least not the kind of coding you need to be doing to be effective DevOps engineers.

Indeed this leads to SRE - just because "Buying it" is usually easier than "Building it".


Indeed. I think the other big elephant in the room not mentioned in the article is architecture.

If you've built your project as cloud-native from day one, you'll have a lot less DevOps work to do. You're basically just writing code or templates that apply cloud-based configuration. That's not to say there's no complexity, but it's not unreasonable.

If your org has already gone all-in on microservices and Kubernetes, there is a much stronger case to be made for centralized Ops. The amount of understanding, care, feeding, and training necessary is much higher. You won't be able to get by with one or two contributors occasionally making changes to Terraform templates as necessary. Clusters are expensive, require occasional upgrades, and centralized metrics and logging don't come for free and require their own access control. It's still better than it's ever been, but it's a lot like building and maintaining your own cloud, which quickly becomes a full-time job.


If the team building the application is designing the infrastructure, then that's almost the DevOps ideal that orgs dream about. It's really the stuff that happens once you've been launched for a few months that things start to drift. It might start with a redirect rule that isn't in code, an ask to do cost optimization, a security audit. Maybe one dev raises their hand and a few months later they're the DevOps lead with a DevOps team who become the dumping ground for all the non-coding task, on-call, budgeting. Even if your platform is all PaaS and SaaS and serverless, stuff will still break and someone needs to answer the pages.


> The teams who lived the dream of DevOps were teams which built their software as cloud native

Or the total opposite. People that not drink the cloud-narrative and the operational simplicity make devops no-brainer.

ie: You are truly made for the massive overenginering of cloud (because you ACTUALLY need that) or you keep things simple enough to fit in a single head.

What is problematic is when you try to do be first, or present to be second.


The immediate following sentence from the one you quoted was:

> This is purely because the PaaS tooling let them efficiently be both Devs and Admins.

It wasn't the cloud which made them successful, it was the level of automation baked into the PaaS solution(s) they built on top of. I thought I was pretty clear on that, but I suppose I could have been more clear.

You could similarly get success if you had an OnPrem solution with really great orchestration layers. And in fact, I've heard of such successes (but not seen up close) of teams doing exactly that with Pivotal Cloud Foundry.


In my old employer, "Ops" and "Security" still had role for managing fundamental components of the system. In AWS terms, that means deploying AWS accounts in the organization, automation for best practices and compliance detection, IAM roles, VPCs, etc.

Security and network teams also built custom terraform modules which the deve teams were forced to use that were guardrails. You didn't use aws_s3_bucket, you used custom_aws_s3_bucket that mandated certain fields and prerequisites. This was the compromise struck to allow devs otherwise to go ham in their own AWS accounts and self-manage their deploys, databases, and so on.


At my company, I do. As a backend engineer. Guided by the devops team.

It is a nightmare of arcane copy pasting.

Terraform is an overengineered mess, a complex enemy I need to beat to deploy my simple changes.


Terraform really takes Conway's Law and manifests it in the real world, but I don't think terraform itself is an overengineered mess. It's quite simple when you boil it down. However, if your team structure and communication is bad, you're probably going to get bad terraform code. I've seen both, large teams writing extremely well-written terraform, and small teams writing a mess of spaghetti. Like anything in software, it mainly comes down to the people.


Quite simple for simple tasks. Automation is not always simple though. That is like asking someone to write a program that is static and only has inputs and outputs, but doesn't do loops well or anything else. Terraform is great if you are fine with duplicating things in 100 different places. Once you want to take values and plug them into various modules and run loops, then Terraform begins to suck and its weaknesses get exposed pretty quick.


So duplication is not inherently bad if each team has slightly different requirements and is expected to own its infrastructure. Or in other words, the overhead of communication can easily be higher than the overhead of duplication.


Yes, vanilla terraform can be a bit verbose at times but that doesn't bother me. If DRY and other software engineering-y concepts are important to you, I would recommend a wrapper that assists with that, something like terragrunt.


Compared to CDK? its less verbose. The problem with Infrastructure as Code is that its not code at all. It's just describing the infrastructure that with typical web console manipulation is easy but becomes complicated when you try to describe every moving pieces.


Exactly this. Terraform is not really code, it's configuration. Sure, you have variable substitution and some limited looping.


What is a Platform? An excuse for an executive to adopt a trend and pass the buck to the next exec after he gets promoted for accomplishing his initiative (but before it's apparent that it was all a sham).

How we got here? Business doesn't want to pay for a well designed enterprise and the organization is shitty, so hire people who aren't very good at tech to build an unnecessarily complicated engineering organization that [after they waste millions poorly building cloud tech without prior experience, realize is] still a cost center and tell them to chase fads.

Factor One: Non-Negotiable Standards. Tell everyone they have to do the same thing, even if it makes no sense for what they're building or supporting.

Factor Two: Engineer Capability. Make sure you put unrealistic deadlines in the hands of amateur engineers and then turn up the scope creep.

Factor Three: Management Capability. Make sure your management can always blame somebody else for why your ridiculous initiative and poorly managed company didn't achieve its goals by its stated deadline. Market timing and "I didn't have enough resources" are good stand-bys.

Factor Four: Platform Team Capability. Pay a million in salaries to some middling full-time engineers, put them in a silo, make them build really basic tech from scratch that 50 different managed service companies sell for pennies. Don't Scrum with the teams that will be forced to use it. Make sure everyone is required to use the platform, even when it's not actually ready to go live, so that building any kind of product at all is mostly infeasible, incredibly slow, and painful.

Factor Five: Time to Market. Do everything you can to avoid value chain analysis, training employees on standard practices, unified communications, or getting stakeholders to work with you on initiatives. When your competitor lands a feature a year earlier than you planned, blame the consultants/contractors you never listened to.

Who should write the terraform? An overworked systems engineer in a siloed team. Definitely not someone working on the product. This way they can write 5 layers of unnecessary module abstractions, be unaware of how non-functional the module is from not actually running it on the product [and watching it fail 6 ways from sunday], and still not provide what the business needs.


Developers should be able to do the work.

"we aren't living in 2016 anymore, and the cloud moves fast. Platform teams are expensive and hard to do, offer a mediocre service at best, destroy velocity, and create bad incentives." [1]

[1]https://twitter.com/iamvlaaaaaaad/status/1534489514818686976...


I've had some thoughts around this issue more recently after moving from DevOps -> Software Engineering.

I love the idea of cross functional teams, but from what I have seen of the most recent implementation of it that I'm working in, there are as always, issues of definition around what a cross functional team actually is and should be.

IMO grabbing a bunch of backend SEs and making them handle their own DevOps is a joke of Academy Awards host level proportions. The shit I see as an ex DevOps dude is horrific. The notion that a bunch of people who've never done the role can somehow figure it out without specific training doesn't work, from my experience.

A cross functional team should actually be cross functional, where you have an engineer, whose specialty is the work you intend for them to complete within that team. Otherwise we're just being overburdened with extra shit that we frankly will never get the time to actually complete in a meaningful way, and it just generates more and more technical debt.


This misses the point a bit. Even if app teams write terraform, there is no way a security constrained company will let them deploy it without running a security check (OPA, Checkov).

So, either way, a large organization is going to punt that terraform/cfn/cdk template down a pipeline with a bunch of automated compliance reviews. Whether the App team or Ops team wrote it.


My experience being on a team that owned its infrastructure was that it wasn't really a terrible experience per se, but there was so much time between stories that required infra changes that the context decay was massive. We always managed, but it would take a lot of time to rebuild context and remember where everything was and generally how Terraform worked.


I've been in a team where Platform/Infrastructure Engineers handled everything Terraform and it was great. You just described what you wanted to them and they did it. Developers never touched a .tf file.

Then I moved to a team where Ops write Terraform but also expected developers to contribute. They pitched this as "Developers should be able to make small changes". Turned out we had very different understandings of the definition of "small".

I'm currently in a team with no Ops and developers are fully responsible for managing infra all the way to production. The Terraform implementation is an absolute mess. There is, however, an understanding that it needs fixing and Ops support has been promised.

My answer to "Who should write Terraform" is it's the Platform Engineers. A developer can maybe optionally pitch in if they feel confident enough but ultimately Platform Engineers should own the platform.


Wow that was excellent, very thorough but also easy to read and with minimal fluff

My personal take is that DevOps doesn't work (for me, and probably many others) because it amounts to context-switching (recently featured on HN: https://news.ycombinator.com/item?id=32390499). By being responsible for both Dev and Ops, my time (and my brain) gets split 50/50 into two entirely different sets of:

- Concerns

- Languages

- Tools

- Mindsets

This is both super draining, and counter-productive, for me. If Ops can be made so simple (by a platform team or otherwise) that it doesn't amount to a whole separate headspace, then great, I'll manage instances myself. But as long as it's a whole separate domain, trying to have one foot on each side of the fence is just not going to be workable.


I think real DevOps works here just fine as long as your organization allows the time for deep work.

If you are writing application code with 0% of your brain lent to operational concern, you are writing it in a disconnected dream cloud and your organization will have problems that come from it.

The answer to context switching is not do it several times per day. The answer is NOT to ignore operational concerns and mindsets.

Many engineers pride themselves on having a wide set of skills and enjoy switching contexts, languages, tools and mindsets for variety and to reduce boredom.


The answer to "Who Should?" anything in my organization is "me". I go from writing ruby, to terraform, to javascript, html, css, to bash scripts, SQL, etc. Oh, and I have to manage people, and do code reviews, and support, and meet with clients...

Help... me...

Anyway, I've got the members of my dev team writing terraform for their changes now too. It's working, more or less. They are excited to do it because it pads their resumes, because it's new. But we continue to increase demands on devs, they need to get paid for their trouble or the responsibilities must be diffused.


On "how we got here" you use "bulleted list" rather than "numbered list". This is important as "If you rearrange the items in a bulleted list, the list's meaning does not change. "

Credit to https://developers.google.com/tech-writing/one/lists-and-tab... which pointed this out and has stuck in my craw ever since.


Good article, I enjoyed it. I agree with the premise that every company is different and they need to adopt what works for them. Ownership alone is a huge issue I've run into in the past.


The answer to this question is that you should never be writing Terraform/CDK from scratch, you are wasting time.

1. Scaffold your infrastructure with simple point & click in web console.

2. Generate terraform/CDK code by scanning your AWS account with typically available tools.

3. Edit an update said Infrastructure as Code as needed, swapping out the parameters with the vectors you need to change according to CI/CD

The whole "i want to write infrastructure as code from day 1" is not only stupid , its a waste of resources.


> "i want to write infrastructure as code from day 1" is not only stupid , its a waste of resources

I tend to disagree. Depends on the scale... and after you've scaled and grown DevSecOps absence becomes a source of detraction, affect your delivery cycle and indirectly your Sales. Proper DevOps defines some of the business lifecycle operations as well, like BI and A/B testing, which essentially helps in validating pending Business Assumptions. It's something that can help differentiating the market and Validate the actual Product Viability - prove that your MVP actually has any V in it.

Operations wise, First and foremost you have to keep track of the issues that are currently present in AWS solutions and automate workarounds, and there are a lot of security automation and organizational means which can't really be solved with a "Click in Web Console" efficiently.

For instance, setting up a proper EKS cluster by hand, without any hardening, would require at least three hours of clicking through, with all the IRSA roles and EKS specific IAM permissions. While, on the other hand, Terraform automation has ready to use OpenSource modules shipped by both the community and AWS itself (terraform-aws-modules, aws-ia), which introduces some advanced EKS management practices, without any added effort. 10 lines of IaC can easily replace half an hour of click-through.

The cost of Integration is nearly Zero during the product bootstrap phase, but when you're growing integrating proper Organizational Management with AWS Organizations and Control Tower, reordering your AWS Accounts, transferring resources, and hardening security boundaries tends to rise in complexity and cost a lot. Especially if you'll ever want to perform proper security Audits or need some HIPAA/GDPR compliance.

For some Disney companies, for instance, who choose to perform org management by developing custom tools after 5 years of operation, proper integration with AWS Organizations remained a dream, and their unreasonably tight Operational Schedule and On-call deficiency became a source of detraction. The integration cost rose to eight figures.

The cost of DevSecOps hardening basically doubles every quarter, if you're growing fast enough and lack automation.

As for myself, automating everything allowed me manage Kubernetes complexity and develop a fine tuned vertically scalable solution (VPA+HPA on Keda with cluster autoscalers) - about 30 different k8s services deployed in a mix of x86 and Arm instances, with continuous placement and resource limits/requests optimization, completely downscalable. My AWS bill is only 7% of my raw income.

So, if you can hire a DevOps consultancy, and can Actually Measure how much time is wasted during the manual operation compared to the automated one, able to self reflect without a confirmation bias, do that ASAP.


> The development team didn’t want to – or couldn’t – do the Ops work

But this is just because companies wanted to put the "ops" work into the shoulders of developers. What should be done is to hire one (or more) specific "ops/platform" engineers per team. Such engineers are the gateway for the team for all platform-related stuff. I'm not talking about SREs here. I think SREs are more about making the products as performant and efficient as possible (while platform engineers per team are more about setting up infrastructure). Sure both roles (in addition to the SWE role) do their job best if they are working together in the same team.

What I see nowadays in small and mid-size companies is either:

1. There is a "platform" team. They own infrastructure repositories, but they let product teams to make PRs to such repos (e.g., the platform team usually creates some kind of guidelines for managing infrastructure, like "How to create a staging mongo db"). The "platform" team is on charge of reviewing such PRs and merge them. Now, there are certain aspects of the infrastructure that only the "platform" team can actually work on (because the product teams either don't care about it or don't know about it). This doesn't work because the "platform" team becomes a bottleneck when the number of product teams starts to grow (the #platorm Slack channel becomes a nightmare with dozens of requests per day. Many platform engineers end up burned out because they see themselves as "customer service" for developers)

2. Developers pushing "product features" and at the same time they do "infrastructure" stuff. Companies usually call this "you build, you run it". In reality it's just cheap management (companies don't want to hire infrastructure engineers and they think the developers are excited to learn "docker/k8s/aws/gcp/terraform", so let them have fun). This is ultimately a nightmare for many developers because they end up burned out ("I want to work on product features! I don't want to fix GitLab pipelines").

I think the original idea of DevOps is totally valid. Just don't force your SWEs to work on infrastructure stuff. Instead, hire one or more infrastructure engineers for every product team you have. This way SWEs (dev) and infrastructure engineers (ops) can work close together and push stuff faster. Obviously almost no company is doing this because it is more expensive than the alternatives stated above.

Would you let your SWEs to design your frontpage? No. They obviously have a voice in the process of designing the frontpage, but ultimately the ones that should design it are your Product Designers (obviously for this to work, both your SWEs and your PDs should work in the same team).


I've thought for a while that sysadmin, operator, "devops engineer" and sre were all more or less the same job, but always felt like saying it would be silly of me.

In the future, I'll just link to this piece.


If you're at a company that doesn't have a Platform team, but that still struggles with wanting centralized guardrails and best practices, and a consistent set of patterns across services and teams, the answer to this question might be a developer experience platform like what we're building at Coherence (I'm a cofounder). In this case, someone else writes the terraform, and you just tell us how to map it onto your code. This lets us give you nice things like a dashboard to manage deployments, cloud IDEs, branch preview environments, etc. while still giving your dev/devops folks total control and visibility, since it runs in your own cloud...

Would love anyone interested to give it a spin at withcoherence.com and please feel free to ping hn@withcoherence.com with any feedback or issues!


At Terrateam[0] we specialize in Terraform automation with GitHub Pull Requests and GitHub Actions.

We talk to a lot of Terraform users from a lot of different companies. The most popular way of doing things is having your SRE/DevOps team write the bulk of the Terraform modules for your organization. Other members of engineering then consume these modules to create resources for their platform/application/etc. This code can either live in a Terraform monorepo or inside an application-specific repo. We've seen many approaches.

Scaling Terraform inside your organization is incredibly convenient with Terrateam as we leverage many pieces of GitHub.

[0]: https://terrateam.io


With the history lesson part I find this article totally omits the actual technology change where we stopped hand cranking individual servers and started treating infrastructure as code…


Usually the answer on my circles is "the DevOps team".


No one


Nobody should write terraform


A: “Our Terraform is so bloated it’s starting to rot!”

B: “Hashicorpse strikes again.”


Everything should be in code, I feel for sys admins but they had all the time to become software professionals. No more mercy


Yikes!

I presume this self-assured statement comes from ignorance and youth? Can you detail the sysadmins that you've worked with that didn't understand code or the underlying infrastructure they are tasked with supporting better than you or your team?


Sure, it comes from ignorance but none of these people are young, myself included. Possessing domain knowledge cannot be enough anymore on its own as it implies making room for limiting solutions like terraform that only exist to make the life easier for those who call themselves tech professionals without knowing how to write code. Everyone should be a software engineer. Infrastructure should be in code, and follow software development practices. I hope that the CDKs will become the norm. This also means that nobody will need to “support” anybody but rather “work with” a teammate because they would all be software engineers, no more devops, qa, software walls. It is possible, anybody can still be a SME in something while also being a software engineer, people are just lazy because the industry pays you 6 digits even if you only know terraform and aws.


Devs writing Terraform means ops doesn't move fast enough. Fix ops instead of forcing the teams to roll their own.


But, as the article points out, centralized ops teams many times do not have the capacity to handle all the requests coming from the dev teams.


So fix that by hiring more ops folks...? Ops should scale with the dev teams, if they aren't you're burning them out and will have even more problems.


Scaling ops teams is a lot like scaling dev teams - stick with the two pizza plan. That's where you get start finding the need for enablement teams (cross cutting) to ensure compliance, standardization, etc.


I don't think that is fair or accurate. My experience with larger enterprises is you have legacy ops and a whole bunch of people that know how to keep a few things running. They are like the systems administrators, but when it comes to development and automation they don't like change because that means for the most part they are irrelevant. With all the segregation in bigger companies if you asked a systems administrator to do cloud networking, storage, etc. they are going to not so politely tell you to F off. They are comfortable with just knowing how to manage vSphere or SQL Server. The more automation makes them feel uneasy because they simply don't want to adapt to the change and learn something new. They'd rather throw a fit and try to make your job more difficult.


No matter how capable and well-staffed ops is, it will always take them much longer to provision my stuff than if i did it myself. I like having more eyes on stuff before it hits prod, but waiting days to get things setup in dev is not just unproductive, but demoralizing.


> waiting days to get things setup in dev

not really an ops thing. This is just the reality of large corps. Microservices are the same way, with devs partitioned off. You put in a Jira ticket and wait a month for a turnaround on a task that would take an hour to do. Eventually management wants it done now and, like always, they bust open and manipulate the process for their own purposes. Rules for thee, not for me.


Who owns it when it goes to production? You, by yourself? What happens when it breaks, you're on call 24/7? You really want that? You're going to stay on top of keeping it up to date, in line with governance/security requirements? When do you have time to work on features and bugfixes?


We don't use Terraform or similar, though we do manage our own VMs. We devs can setup our own stuff for dev, and then ops will do test/staging and production.

Seems like this should solve GP's complaint while allowing ops/support to "own" once it leaves dev.


> Fix ops

Good luck. Let me know how that works out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: