We have environments with thousands of services and it scales fine. Why would dev (or your lowest integration environment) be broken most of the time?
> except that means the Kafka team
That (a Kafka team, or a DB2 team) is a bit of a red flag for me. Many (but not all) “tech” teams like that are part of the problem. Cross functional delivery teams work much better, because running tech X in isolation is often not valuable.
The one exception to this is “x as a service” teams. Eg DBaaS teams, Or even a team that offers a pub/sub service. But in those cases such teams are literally measured by uptime SLOs, so if a pub/sub team can’t deliver uptime, they’d need to improve.
Where are you that has thousands or services and it scales fine? Twitter and Facebook famously both don't have a separate staging environment because they have thousands or services and it doesn't scale fine. They do canary releases and feature flags so as to do gradual deployment and testing in prod. If they can't solve the problem but you're somewhere that has, my next question is are you hiring?
Dev is broken because devs are doing dev on it. I mean, it generally works, but it's the bleeding edge of development so there's no real guarantee that someone didn't push something that doesn't work in a way that the rest of the company is relying on.
What is the DBaaS or pub/sub team's commitment to uptime in the staging environment? It's staging. if they have to commit to a reasonable uptime, they can't actually use it as staging for themselves. Saying they need to improve is trying to handwave out the fact that they need a staging environment where they get to run experimental DBaaS or pub/sub things.
I worked at AWS. Each team/service had their own alpha/beta/delta/gamma development environments, as well as one-box and blue/green production deployment environments, and deployed to waves of regions from smaller groups at the start to bigger groups at the end.
> except that means the Kafka team
That (a Kafka team, or a DB2 team) is a bit of a red flag for me. Many (but not all) “tech” teams like that are part of the problem. Cross functional delivery teams work much better, because running tech X in isolation is often not valuable.
The one exception to this is “x as a service” teams. Eg DBaaS teams, Or even a team that offers a pub/sub service. But in those cases such teams are literally measured by uptime SLOs, so if a pub/sub team can’t deliver uptime, they’d need to improve.