Software Engineer at Spacelift[0] here - a CI/CD specialized for Infra as Code (...

steveBK123 · on Aug 9, 2022

I think a platform team taking ownership is the correct model, but the early product teams need to have "embeds".

The platform team owning base terraform functionality works well for the product teams that are the 3rd or 4th user of said functionality.

For the early days of the platform, and the early users.. your product is constantly in dependency & priority battles with said platform team. This is where "embeds" help continually unblock while making sure the work is done in a platform centric manner that will be reusable for other product teams.

Simply saying the product teams need to go down into the weeds at this level just puts too much disparate responsibility on product teams who exist to deliver a single product. Similarly it encourages vastly different approaches to similar problems, with all the wasted duplicate & re-work.

lhorie · on Aug 9, 2022

I tend to think of embeds as being very similar to the open source contribution model: you want some sort of BDFL entity that drives the overall direction of the platform, but also some sense of community/collaboration where individuals can feel empowered to contribute features to scratch their own itches, or bring up discussions, etc.

Having a team owning the platform doesn't necessarily need to mean shutting yourself in a cave. Granted, promoting cross-functional collaboration is a challenge in and of itself, but similar to OSS, projects that invest in the community aspect are the ones that eventually gain critical mass and set themselves apart from the rest.

danwee · on Aug 9, 2022

Having a single "platform" team per company is a bottleneck as soon as the number of product teams is greater than N.

> ...you'll probably start having as many approaches to your infrastructure as there are teams, complexity will explode, and implementation/verification of compliance requirements will be a chore. Just a few people responsible for handling this will yield huge benefits.

Agree with the centralization of "how infrastructure should be managed/defined". A "platform" team composed of M platform engineers (where each platform engineer works 80% of their time for a given product team) can handle such centralization.

PragmaticPulp · on Aug 9, 2022

> Having a single "platform" team per company is a bottleneck as soon as the number of product teams is greater than N.

This is my experience as well. Having a single platform team has been a great experience for laying foundations, establishing shared architectures, and centralization documentation.

As soon as two or more teams need something from the platform team, it becomes a battle of priorities. A good platform team will recognize this and work on a division of labor and coordination strategy that can start to scale. A bad platform team will treat this as an opportunity to claim the company’s wins for themselves and leverage their bottleneck position for political gain.

The company’s management of the platform team is key. I’ve also seen a single platform team abused as the engineers who are expected to own all the hard work while other teams get to walk all over them with demands. This results in a lot of employee turnover, which is the opposite of what you want on a team tasked with holding the core knowledge of the company’s infrastructure.

james_s_tayler · on Aug 9, 2022

You can have more than one platform team.

I think reality is more complicated than a one size fits all approach. It's going to be specific to your org, your project, the stage it's at etc. To add to that, the right thing to do is often in flux.

Dedicated capacity is necessary, as is embedding. Not always at the same time or in that order. That's where only the information found inside the walls of your organisation can help you decide what is necessary to solve your problem.

codeduck · on Aug 9, 2022

It also creates the unrealistic expectation that one size fits all. An architecture that works well for stateless microservices fails spectacularly when faced with monolithic session-bound legacy telecoms services.

Yet so many people insist that the one is the same as the other, when one is a duck and the other is an elephant wearing two swimming fins on its face.

bigiain · on Aug 10, 2022

> Platform Engineering teams doing the bulk of the work, including all the fundamentals, guidelines, safety railing, and conventions, while Software Engineers only use those

So, sysadmins and programmers - but with new 2020s vintage titles? (and renumeration...)

NateEag · on Aug 10, 2022

Basically, yeah, but with the difference that these sysadmins are generalizing and abstracting the patterns they've learned over years.

I personally think of "devops" not quite so much as being about "dev" and "ops" collaborating (though that is a noble and worthwhile goal) as about having "developer-operators", people who know how to do operations effectively and who can turn that knowledge into automated, generalized software systems.

The abstract modules and tools can live in their own repositories (or folders, in a monorepo), and your devoperators can work closely with the product teams to use them (and abstract specific changes to meet projects' individual needs to be more generally applicable).

mountainriver · on Aug 9, 2022

Seen this at a couple companies and it doesn’t work well. The platform team becomes a bottleneck and the devs don’t want to have to deal with or learn the mess that is terraform.

It’s time for the ecosystem to move beyond the half baked config language known as HCL

james_s_tayler · on Aug 9, 2022

Pulumi

Too · on Aug 10, 2022

What do you suggest instead?

mountainriver · on Aug 10, 2022

Pulumi seems a lot more sane, trying to back all the complexity of infrastructure into a config language just doesn't add up at the end of the day. This is why we have general purpose languages.

We could also probably use more abstractions similar to Pulumi, theres been talk on HN about storing all of the state for applications like this in the tags of the underlying cloud resources. There are some caveats with this approach, but it would provide a interesting tradeoffs

frederich · on Aug 10, 2022

Have a look at GruCloud, an alternative to Terraform/Pulumi/CDK, which generates the infrastructure code automatically from a live infrastructure. Disclaimer, I am the author.

Terretta · on Aug 10, 2022

Disclaimer: a kind of disavowal, rejection of responsibility

Disclosure: put hidden info forward, e.g. source of potential bias

Spivak · on Aug 9, 2022

I think this would easier to adopt if it could be plopped into an existing agnostic CI system. We built something like this in-house on top of Gitlab CI and it works really well for us. Locking isn't as much of an issue as you make it seem in the pitch, we just have our infra contaiers wait to acquire and renew a distributed lease while they're running. Some kinds of failures just release the lock and others panic and stop the world for human intervention.

Presumably your core competency isn't building CI systems or job runners so why bother? I'm sure at the core of your own infra it's job agnostic The value-add is the management plane on top of it.

cube2222 · on Aug 9, 2022

The semantics of standard CI/CD providers are in practice very ill-suited to more advanced Infrastructure-as-Code use cases (triggering other stacks based on changes, multi-repo workflows, etc.), so building on top of them would add a lot of complexity. I don't want to go too much into it.

Overall, if your setup works for you and you're happy with it - keep using it!

We've seen a lot of companies (many now our customers :) ) try to build their own on top of existing systems (GitHub, Gitlab, Jenkins, etc.) and waste a ton of time and engineering resources, while ultimately not achieving anything that works well.

What Spacelift does is it gives you a bunch of much better-suited building blocks which let you build your required workflow very quickly.

And it obviously does integrate very deeply with your VCS provider - Commits, Pull Requests, Comments, etc. - everything is supported and customizable using - amongst others - Push Policies[0].

[0]: https://docs.spacelift.io/concepts/policy/git-push-policy

zwischenzug · on Aug 9, 2022

Author here: yup

Pendulum back to the center

cube2222 · on Aug 9, 2022

Partly yes, but not fully.

The idea is not to go back to the Software Engineer asking the Ops team "Hey, can you provision a Postgres database for me please?" and then waiting a week for it.

It's that the Software Engineer takes a module that was prepared by the Platform team - i.e. "terraform-postgres-mycompany" - which already includes all the requirements the company has for handling databases (think backups, monitoring, encryption, etc.). They can then proceed to use it in the small service-specific Terraform configuration, which really is just putting such ready-made modules together.

The important bit being - the Platform team isn't a bottleneck here.

steveBK123 · on Aug 9, 2022

Sure, but early in this transition it is slow&painful.

At my last org..

The old process of "Hey, can you provision a Postgres database for me please?" was managed in a web ticketing system, change managed, and had 24 hours turnaround! As was most other requests - VMs, NFS shares, FTP servers, network/FW changes, etc.

The new process was "Hey is there a terraform module for Postgres?" followed by weeks/months of prioritization and specification battles with the platform team. Somehow despite the platform existing 18 months, we were the first team to need a.. database? Then they didn't want to support Postgres and were forcing everyone onto Aurora. Then it look N months for it to come up in their queue.

Rinse&repeat with every fundamental building block of a cloud offering, ad infinitum.

This is why I strongly prefer the embed model until the platform has proven itself to at least be at the "80% solved" end of the spectrum.

swozey · on Aug 9, 2022

As an Ops, this sounds like a very overworked ops team. I can write a production worthy postgres module in less than a day, maybe a few hours. Or... just a subpar team, as much as I hate to say that.

yarosh · on Aug 9, 2022

As an Ops, who literally replaced Patroni with a Terraform CTS module, in about a week... can say that it would be nearly impossible to do in a non Pizza size team due to communication and confirmation biases, alongside the respective anti-patterns.

steveBK123 · on Aug 9, 2022

Mostly overworked but also not a team of veterans who knew what they were doing already. Platform Ops:Product App Devs ratio was like 1:30, and they had very little implemented yet so just completely overscheduled and understaffed.

swozey · on Aug 9, 2022

I've been ops for about 15 years and understaffing us is such a huge problem. I just joined a company with 2 other Ops for 80 developers over 8-10 teams, my previous had 7 for 70 and was one of the few places I felt like I could take a breath and relax, easily take days off, etc. I felt more like a SWE than ops with regard to work-life balance/on-call rotations/ability to do creative & research work, etc.

I'm glad those 2 ops at my current spot finally have help but they were alone for years while the company was generating a ton of revenue which bothers me but seemingly was their own decision; nobody told them they were overworked until us more SR ops people arrived and went out of our way to pull work away from them.

Having all sorts of different teams (style, culture, language) relying on you to not gate-keep is.. stressful at times. I try to context switch as little as possible but sometimes it just can't be avoided.

chii · on Aug 10, 2022

> They can then proceed to use it in the small service-specific Terraform configuration, which really is just putting such ready-made modules together.

now, a new requirement comes in for the product team(s) - a new service which bulk reads/processes another team's data (e.g., an export service).

Turns out this is too slow if the bulk reads use the usual web api route. So direct access to another team's provisioned database is required.

This now falls onto the platform team to produce a method for doing so.

Imagine this, but multiplied by N, where N is the number of different features being worked on at the same time, and all of them needs time from the platform team to produce something new for them!

zwischenzug · on Aug 9, 2022

Agreed.

historynops · on Aug 9, 2022

Thank you for the plug. Surprised such an open commercial promotion got top comment.