I think unfortunately the conclusion here is a bit backwards; de-risking deployments by improving testing and organisational properties is important, but is not the only approach that works.
The author notes that there appears to be a fixed number of changes per deployment and that it is hard to increase - I think the 'Reversie Thinkie' here (as the author puts it) is actually to decrease the number of changes per deployment.
The reason those meetings exist is because of risk! The more changes in a deployment, the higher the risk that one of them is going to introduce a bug or operational issue. By deploying small changes often, you get deliver value much sooner and fail smaller.
Combine this with techniques such as canarying and gradual rollout, and you enter a world where deployments are no longer flipping a switch and either breaking or not breaking - you get to turn outages into degradations.
This approach is corroborated by the DORA research[0], and covered well in Accelerate[1]. It also features centrally in The Phoenix Project[2] and its spiritual ancestor, The Goal[3].
> The reason those meetings exist is because of risk! The more changes in a deployment, the higher the risk that one of them is going to introduce a bug or operational issue.
Having worked on projects that were perfectly full CD and also projects that had biweekly releases with meetings with release engineers, I can state with full confidence that risk management is correlated but an indirect and secondary factor.
The main factor is quite clearly how much time and resources an organization invests in automated testing. If an organization has the misfortune of having test engineers who lack the technical background to do automation, they risk never breaking free of these meetings.
The reason why organizations need release meetings is that they lack the infrastructure to test deployments before and after rollouts, and they lack the infrastructure to roll back changes that fail once deployed. So they make up this lack of investment by adding all these ad-hoc manual checks to compensate for lack of automated checks. If QA teams lack any technical skills, they will push for manual processes as self-preservation.
To make matters worse, there is also the propensity to pretend that having to go through these meetings is a sign of excellence and best practices, because if you're paid to mitigate a problem obviously you have absolutely no incentive to fix it. If a bug leaks into production, that's a problem introduced by the developer that wasn't caught by QAs because reasons. If the organization has automated tests, it's even hard to not catch it at the PR level.
Meetings exist not because of risk, but because organizations employ a subset of roles that require risk to justify their existence and lack skills to mitigate it. If a team organizes it's efforts to add the bare minimum checks to verify a change runs and works once deployed, and can automatically roll back if it doesn't, you do not need meetings anymore.
This is very well said and succinctly summarizes my frustrations with QA. My experience has been that non-technical staff in technical organizations create meetings to justify their existence. I’m curious if you have advice on how to shift non-technical QA towards adopting automated testing and fewer meetings.
Hi, senior SRE here who was a QA, then QA lead, then lead automation / devops engineer.
QA engineers with little coding experience should be given simple automation tasks with similar tests and documentation/ people to ask questions to. I.e. setup a pytest framework that has a few automated test examples, and then have them write similar tests. The automated tests are just TAC (tests as code) versions of the manual test cases they should already write, so they should have some idea of what they need to do, and then google / ChatGPT/ automation engineers should be able to help them start to translate that to code.
People with growth mindsets and ambitions will grow from the support and being given the chance to do the things, while some small number will balk and not want anything to do with it. You can lead a horse to water and all that.
We are in the early stages of something like this in my org. QA has been writing tests in some form for a while, and it’s mostly been at a self-led level. We have a senior engineer per-application responsible for tooling and guidance, and the QA testers have been learning Java/script (depending on the application, teams we don’t interface with are writing theirs in C# iirc). With the new year, we are starting a phased initiative to ramp up all of QA to be Software Engineers in Testing - each phase will teach and guide and impart the skills needed to be fully sufficient to write automation tests in tandem with engineers writing features.
It’s an interesting and bold initiative imo, as I’ve often worked at places that let QA do whatever felt best which is good from the standpoint of letting them work within their comfort zone, and it also means that testing will largely plateau. I haven’t seen a real push for automation _not_ come out of the engineering department personally (because I’m the one pushing it every time), though I know this place has at least done some work with various automation systems in the past.
> The main factor is quite clearly how much time and resources an organization invests in automated testing.
For context, I think it's worth reflecting on Beck's background, eg as the author of XP Explained. I suspect he's taking even TDD for granted, and optimizing what's left. I think even the name of his new blog—"Tidy First"—is in reaction to a saturation, in his milieu, of the imperative to "Test First".
I tend to agree. Whenever I've removed artificial technical friction, or made a fundamental change to an approach, the processes that grew around them tend to evaporate, and not be replaced. I think many of these processes are a rational albeit non-technical response to making the best of a bad situation in the absence of a more fundamental solution.
But that doesn't mean they are entirely harmless. I've come across some scenarios where the people driving decisions continued to reach for human processes as the solution rather than a workaround, for both new projects and projects designated specifically to remove existing inefficiencies. They either lacked the technical imagination, or were too stuck in the existing framing of the problem, and this is where people who do have that imagination need to speak up and point out that human processes need to be minimised with technical changes where possible. Not all human processes can be obviated through technical changes, but we don't want to spread ourselves thin on unnecessary ones.
So this seems quantifiable as well - there must be a number of processes / components that a business is made up of, and those presumably are also weighted (payment processing has weight 100, HR holiday requests weight 5 etc).
I would conjecture that changing more than 2% of processes in any given period is “too much” - but one can certainly adjust that.
And I suspect that this modifies based on area (ie the payment processing code has a different team than the HR code) - so it would be sensible to rotate releases (or possibly teams) - this period this team is working on the hard stuff, but once that goes live the team is rotated back out to tackle easier stuff - either payment processing or HR
The same principle applies to attacking a trench, moving battalions forward and combined arms operations.
Now that is of course a “management” problem - but one can easily see how to automate a lot of it - and how other “sensory” inputs are useful (ie which teams have committed code to these sensitive modules recently
One last point is it makes nonsense of “sprints” in Agile/Scrum - we know you cannot sprint a whole marathon, so how do you prepare the sprints for rotation?
I am really interested in organizations capacity of soaking the changes.
I live in B2B SaaS space and as much as development goes we could release daily. But on the receiving side we get pushback. Of course there can be feature flags but then it would cause “not enabled feature backlog”.
In the end features are mostly consumed by people and people need training on the changes.
I think that really depends on the product. I worked on a on-prem data product for years and it was crucial to document all changes well and give customers time to prepare. OTOH I also worked on a home inspection app and there users gave us pushback on training because the app was seen as intuitive
> ...there users gave us pushback on training because the app was seen as intuitive
I would weep with joy to receive such feedback! Too often the services I work on have long histories with accidental UIs, built to address immediate needs over and over.
I agree entirely - I use the same references, I just think it's bordering on sacrilege what you did to Mr. Goldratt. He has been writing about flow and translating the Toyota Production System principles and applying physics to business processes way before someone decided to write The Phoenix Project.
I loved the Phoenix Project don't get me wrong, but compared to The Goal it's a like a cheaply produced adaptation of a "real" book so that people in the IT industry don't get scared when they read about production lines and run away saying "but I'm a PrOgrAmmEr, and creATIVE woRK can't be OPtiMizEd like a FactOry".
So The Phoenix Project if anything is the spiritual successor to The Goal, not the other way around.
That's indeed how I wrote it, but I could have worded it better. Very much agree that the insights in The Goal go far beyond the scope of The Phoenix Project.
> By deploying small changes often, you get deliver value much sooner and fail smaller.
Which increases the number of changes per deployment, feeding the overhead cycle.
He is describing an emergent pattern here, not something that requires intentional culture change (like writing smaller changes). You’re not disagreeing but paraphrasing the article’s conclusion:
> or the harder way, by increasing the number of changes per deployment (better tests, better monitoring, better isolation between elements, better social relationships on the team)
You are not. The conclusion of the article is the same, you "need to expand the far end of the hose" by increasing deployment rate or making more, smaller changes. What was your interpretation?
My reading was that there were two paths the author highlights:
1) Increase deployment capacity (which I'm reading as frequency, and I fully agree with)
2) Increase change capacity per deployment by making it less likely that a set of changes will fail through tests, monitoring, structural, and team changes
#2 is very much geared to "ship more changes in one deployment" which is where my disagreement lies. I think you should still do all those things, but that increasing the size of the bundle is explicitly an anti-goal.
I think you're better off, as a rule of thumb, making fewer changes per deployment if you want to reduce risk.
My reading is that the author posits there is a fixed amount of change that can be safely made in a single deployment. The solution is to make it possible to deploy more frequently. This is hard, so organizations will often instead introduce overhead that slows down changes. Engineers might be tempted to blame the overhead and try to eliminate it, but that won't be successful and may even backfire. They need to tackle the underlying issue of deployment capacity instead.
this isn't even a software things. Its any production process. The greater amount of work in progress items, the longer the work in progress items, the greater risk, the greater amount of work. Shrink the batch, shorten the release window window.
It infuriates me that software engineering has had to rediscover these facts when the Toyota production system was developed between 1948-1975 and knew all these things 50 years ago.
I am trying to expound a concept I call “software literacy” - where a business can be run via code just as much as today a company can be run by English words (policy documents, emails etc).
This leads to a few corollaries - things like “If GPUs do the work then coders are the new managers” or we need whole-org-test-rigs to be clear about the impacts of chnages.
This seems directly related to this excellent article - to my mind if all the decision makers are not looking at the code as the first class object in a chnage process (is opposed to Jiras or project plans) then not all decision makers are (software) literate - and this comes up a lot in the threads here (“how do I discuss with non-technical management”) - the answer is you cannot - that management must be changed. This is an enormous generational road block that I thought was a problem thirty years ago but naively assumed would disappear as coders grew up. Of course the problem is that to “run” a company one does not need to code - so until not coding is something embarrassing like not writing is for a newspaper editor we won’t get past it.
The main point is that we need companies that can be run with the new set of self-reinforcing concepts - sops, testing, not meetings but systems as communication.
I will try and rewrite this comment later - it needs work
You had me at "whole org test" harness. This is a very, very interesting idea. Especially in conjunction with the concept of corporation as "slow AI" that I don't hear referenced often enough.
I don't see why you call it "literacy," though. I think Maturana & Varela's term "autopoiesis" more closely orbits the kernel, and I'll bet Stafford Beer's Autopoietic Systems would contribute to a good intellectual foundation.
At a certain point, though, I wonder if a purely software "business" doesn't just look like... SaaS?
The organisation will actively prevent you from trying to improve deployments though, they will say things like “Jenkins shouldn’t be near production” or “we can’t possibly put things live without QA being involved” or “we need this time to make sure the quality of the software is high enough”. All with a straight face while having millions of production bugs and a product that barely meets any user requirements (if there are any).
In the end fighting the bureaucracy is actually impossible in most organisations, especially if you’re not part of the 200 layers of management that create these meetings. I would sack everyone but programmers and maybe two designers and let everyone fight it out without any agile coaches and product owners and scrum master and product experts.
Slow deployment is a problem but it’s not the problem.
You sound very defeatist about fighting bureaucracy. If you work at an org with too much management, you can slowly push to move it in the direction you hope for or leave. If you keep ending up at places that seem impossible to change, perhaps you should ask more questions about this during the interview. I've worked at many small companies where there wasn't crazy bureaucracy because that's definitely what I preferred. I also currently work at a megacorp and yes there is difficulty, but being consistent and persuasive has lead to many things slowly heading in the right direction. Things take time. You have to realize why people have made things some way and then find convincing arguments to make things better. Sometimes places do just suck so don't stick around. But being hopeless doesn't seem helpful.
This is more or less Musk’s approach at Twitter - and ignoring the enormous baggage any discussion with Musk brings (if possible) - I would love to see a real academic case study on the effects of that to Twitter - there will be a lot to unpick but my bias is on your side here.
All of which sounds completely reasonable to me, in many situations.
Jenkins is the Wordpress of software development. It's gigantic state loop that runs plugins with no privilege separation. Giving your jenkins instance administrative credentials in production might very well be equivalent to giving root keys to that lone guy in Romania who authored that plugin you never audited. I can understand perfectly why that might not be desirable to everyone.
.. which neatly leads on to
> we can’t possibly put things live without QA being involved
If you deploy stuff in production that never passes QA, why do you even have QA? To fix stuff later?
If they are not empowered they will never have the chance to do a good job or have any pride in their work.
> we can’t possibly put things live without QA being involved
> we need this time to make sure the quality of the software is high enough
I've only developed software professionally since 2012, but in that time not only have I never encountered such sentiments, but (and, perhaps, because) it has always been a top priority of leadership to emphatically insist on the very opposite: day one of any initiative is Jenkins to production—often directly via trunk-based development—and quality is every developer's responsibility.
At the IC level, there was no "fighting bureaucracy," although I don't doubt leadership debated these things vigorously, from time to time, especially as external partners and stakeholders were often intimately involved.
> I would sack everyone but programmers and maybe two designers and let everyone fight it out
That works for me! But it doesn't scale. We definitely have to keep at least one product "owner" or "expert" or "manager" to enqueue stakeholder priorities and, while this can be a "hat" that devs and designers trade off, it's also a skill at which some individuals uniquely excel.
All that being said, I don't want to come across as pearl-clutching, shocked Pikachu face about this. I understand that many organizations don't operate this way. The way I've helped firms make this change is via the introduction of a single, experimental team of volunteers dedicated to these practices—one protected (but not dictated to) by a mandate from on high.
A marginally related point but I do not know if others faced the following situation: I worked in a place with a CI pipeline room ~25 minutes with the unit/integration tests (3000+) taking 18 minutes.
When something happens in production we ended up placing more tests; and of course when things goes south at least 50 minutes were necessary to recover.
After a lot of consideration we decided to focus on the recovery and relax and simply some tests and focus on recovery (i.e. have the full thing in less than 5 minutes) combined with a canary as deployment strategy (instead rolling updates).
At least for us was a so refreshing experience but sounded wrong in some ways.
I’ve often said that it is the speed of deployment that matters. If it takes you 50 minutes to deploy, it takes you 50 minutes to fix a problem. If it takes you 50 seconds to deploy, it takes you 50 seconds to fix a problem.
Of course all kinds of things are rolled up in that speed to deploy, but almost all of them are good.
The reason by boss tends to give is that it’s made by AWS, so it cannot possibly be bad. Also, it’s free. Which is never given as anything more than a tangentially related reason, but…
This is just anecdotal but I have found anytime a network interface is involved, it can slow down the deployment. I had a case where I was deleting lambdas in a VPC, and connected to EFS, that the deployment was rather quick but it took ~20 minutes for cloudformation to cleanup and finish.
> A bit tangential but why is CloudFormation so slowww?
It's not that CloudFormation is slow. It's that the whole concept of infrastructure-as code-as-codd is slow by nature.
Each time you deploy a change to a state as a transaction, you need to assert preconditions and post-conditions at each step. If you have to roll out a set of changes that have any semblance of interdependence, you have no option other than to deploy each change as sequential steps. Each step requires many network calls to apply changes, go through auth, poll state, each one taking somewhere between 50-200ms. That quickly adds up.
If you deploy the same app on a different cloud provider with Terraform or Ansible, you get the same result. If you deploy the same changes manually you turn a few minutes into a day-long ordeal.
The biggest problem with IaC is that it is so high-level and does so much under the hood that some people have no idea what changes they are actually applying or what they are doing. Then they complain it takes so long.
> It's that the whole concept of infrastructure-as code-as-codd is slow by nature.
> If you deploy the same app on a different cloud provider with Terraform or Ansible, you get the same result.
Nope, Terraform is way faster. Anyone who has switched between them on the same project can attest to this.
Also, Terraform does not get into “UPGRADE_ROLLBACK_FAILED”-style unrecoverable states nearly as easily. This happens to me all the time with Cloudformation/CDK. So my second question after “Why is Cloudformation so slow?” would be “Why is Cloudformation more error-prone when it’s also slower?”
50-200ms per poll is one thing, but realistically we’re talking 30+ seconds for the smallest of changes even on new resources. Why does it take so long to spin up an ec2 instance (when fargate can do it in seconds assuming you’re not rate limited by the API) or lambda can do it also in milliseconds. Those machines are already running, why does it take 3 minutes to deploy Ubuntu or Debian from a blessed AMI?
Fargate is running containers, Lambda functions. They use Firecracker microVM while EC2 uses full VM. EC2 instances does lot more setup, using bigger image, and user setup. My guess is Firecracker is designed for smaller VMs and can’t support EC2 features that people need.
FWIW, my approach to IaC has been to focus on the “I” with CloudFormation — the networking, storage, IAM, other AWS primitives and etc. This stuff doesn’t change as often, and safe/reliable deployments are more valuable than quick ones.
The behavioral parts (aka. application, stuff running in a VM of some kind or something declarative like EventBridge rules or StepFunctions) I keep separate and prioritize quick turns. CodeDeploy can, for example, update code on EC2s in single-digit seconds.
I’m building systems that are a little more integrated in AWS than most folks, perhaps, which makes this approach a good fit. I do dozens of deployments a day (not an exaggeration — 21 so far today on a light day), including a couple infrastructure updates.
I think the secret here is not buying into meme-like simplifications and instead deliberately design an approach that works for your goals.
I have personal experience with this in my professional career. Before Christmas break I had a big change, and there was fear. My org responded by increasing testing (regression testing, which increased overhead). This increased the risk that changes on dev would break changes on my branch (not a code merging way, but in a complex adaptive system way).
I responded to this risk by making a meeting. I presented our project schedule, and told my colleagues about their expectations, I.e. if they drop code style comments on the PRs they will be deferred to a future PR (and then ignored and never done).
What we needed is fine grained testing with better isolation between components. The problem is is that our management is at a high level, they don’t see meetings as a means to an end, they see meetings as a worthy goal in and of itself self to achieve. More meetings means more collaboration, means good. I’d love to see advice on how to lead technical changes with non-technical management.
I think this was the meme before moduliths[1][2] where people conflated the operational and code change aspects of microservices. But it's just additional incidental complexity that you should resist.
IOW you can do as many deploys without microservices if you organize your monolithic app as independent modules, while keeping out the main disadvantages of the microservice (infra/cicd/etc complexity, and turning your app's function calls into a unreliable distributed system communication problem).
An old monolithic PHP application I worked on for over a decade wasn't set up with independent modules and the average deploy probably took a couple seconds, because it was an svn up which only updated changed files.
I frequently think about this when I watch my current workplace's node application go through a huge build process, spitting out a 70mb artifact which is then copied multiple times around the entire universe as a whole chonk before finally ending up where it needs to be several tens of minutes later.
Even watching how php applications get deployed these days, where it goes through this huge thing and takes about the same amount of time to replace all the docker containers.
I avoid Docker for precisely that reason! I have one system running on Docker across our whole org - Stirling-PDF providing some basic PDF services for internal use. Each time I update it I have to watch it download 700mb of Docker stuff, instead of just doing an in-place upgrade of a few files.
I get that there are advantages in shipping stuff like this. But having seen PHP stuff work for decades with in-place deploys and no build process I am just continually disappointed with how much worse the experience has become.
One approach I've seen rather successfully is to have a container that just contains the files to deploy, and another one for the runtime. You only need to update the runtime container ~ once a week or so (to get OS security updates), and the files container is literally just a COPY command to a volume.
I've only seen that in one place, ever. Most people just do the insane 40 minute docker build -- though I've also seen some that take over 4 hours...
Not sure what you mean about either of those two things? Never had any issues with instance state in our primary production environments, which were several instances of load balanced web servers. No idea what you're referring to as "slow"?
> I think this was the meme before moduliths[1][2] where people conflated the operational and code change aspects of microservices.
People conflate the operational and code change aspects of microservices just like people conflate that the sky is blue and water is wet. It's a statement of fact that doesn't go away with buzzwords.
> IOW you can do as many deploys without microservices if you organize your monolithic app as independent modules, while keeping out the main disadvantages of the microservice (infra/cicd/etc complexity, and turning your app's function calls into a unreliable distributed system communication problem).
This personal opinion is deep within "not even false" territory. You can also deploy as many times as you'd like with any monolith, regardless of what buzzwords you tack on that.
What you're completely missing from your remark is the loosely coupled nature of running things on a separate service, how trivial it is to do blue-green deployments, and how you can do gradual rollouts that you absolutely cannot do with a patch to a monolith, no matter what buzzwords you tack on it. That is the whole point of mentioning microservices: you can do all that without a single meeting.
While there may be some things that can come for free with microservices (and not moduliths), your mentioned ones don't sound convincing. Blue-green deployments and gradual rollouts can be done with modulith and can't think of any reason that would be harder than with microservices (part of your running instances can run with a different version of module X). The coupling can be just as loose as with microservices.
It’s a monkey’s paw solution, now you have 15 kinda slow pipelines instead of 3 slow deployment pipelines. And you get to have the fun new problem of deployment planning and synchronizing feature deployments.
> It’s a monkey’s paw solution, now you have 15 kinda slow pipelines instead of 3 slow deployment pipelines.
Not a problem. In fact, they are a solution to a problem.
> And you get to have the fun new problem of deployment planning and synchronizing feature deployments.
Not a problem too. You don't need to synchronize anything if you're consuming changes that are already deployed and running. You also do not need to synchronize feature deployment if you know the very basics of your job. Worst case scenario, you have to move features behind a feature flag, which requires zero synchronization.
This sort of discussion feels like people complaining about perceived problems they never bothers to think about, let alone tackle.
> Not a silver bullet; you increase api versioning overhead between services for example.
That's actually a good thing. That ensures clients remain backwards compatible in case of a rollback. The only people who don't notice the need for API versionin are those who are oblivious to the outages they create.
The more/smaller microservices you have, the more frequently your APIs must change. It’s more fruitful to recognize that this is a dimension of freedom rather than a binary decision.
I mean I recommend that you don't go micro right away. But a few well-placed service boundaries that align with your eng org chart pay dividends and help build discipline and rigor. Even if the services are all running on the same box to start and just IPCing over localhost or w/e. I prefer my teams build habits while it's easy and fun rather than once the monolith gets unwieldy and drawing boundaries becomes painful.
You avoid a lot of things in a monolith. But normal services mapped to your org chart tend to be pretty nice. I hate how people spam "micro" service talk as if that's the end game for putting a physical boundary between software.
As long as every team managing the different APIs/services don’t have to be consulted for others to get access.
You then get both the problems of distributed data and even more levels of complexity (more meetings than with a monolith)
You can do this with a monolith architecture as others point out. It always comes down to governance. With monoliths you risk slowing yourself down in a huge mess of SOLID, DRY and other “clean code” nonsense which means nobody can change anything without it breaking something. Not because any of the OOP principles are wrong on face value, but because they are so extremely vague that nobody ever gets them right. It’s always hilarious to watch Uncle Bob dismiss any criticism with a “they misunderstood the principles” because he’s always completely right. Maybe the principles are just bad when so many people get them wrong? Anyway, microservices don’t protect you from poor governance it just shows up as different problems. I would argue that it’s both extremely easy and common to build a bunch of micro services where nobody knows what effect a change has on others. It comes down to team management, and this is where our industry sucks the most in my experience. It’ll be better once the newer generations of “Team Topologies” enter, but it’ll be a struggle for decades to come if it’ll ever really end. Often it’s completely out of the hands of whatever digitalisation department you have because the organisation views any “IT” as a cost center and never requests things in a way that can be incorporated in any sort of SWE best practice process.
One of the reasons I like Go as a general purpose language is that it often leads to code bases which are easy to change by its simplicity by design. I’ve seen an online bank and a couple of landlord systems (sorry I can’t find the English word for asset and tenant management in a single platform) explode in growth. Largely because switching to Go has made it possible for them to actually deliver what the business needs. Mean while their competition remains stuck with unruly Java or C# code bases where they may be capable of rolling out buggy additions every half year if their organisation is lucky. Which has nothing to do with Go, Java or C# by the way, it has to do with old fashioned OOP architecture and design being way too easy to fuck up. In one shop I worked they had over a thousand C# interfaces which were never consumed by more than one class… Every single one of their tens of thousands of interfaces was in the same folder and namespace… good luck finding the one you need. You could do that with Go, or any language, but chances are you won’t do it if you’re not rolling with one of those older OOP clean code languages. Not doing it with especially C# is harder because abstraction by default is such an ingrained part of the culture around it.
Personally I have a secret affection for Python shops because they are always fast to deliver and terrible in the code. Love it!
While this is mostly correct it’s also just as irrelevant.
TLDR; software performance, thus human performance, is all that matters.
Risk management/acceptance can be measured with numbers. In software this is actually far more straightforward than in many other careers, because software engineers can only accept risk within the restrictions of their known operating constraints and everything else is deferred.
If you want to go faster you need to maximize the frequency of human iteration above absolutely everything else. If a person cannot iterate, such as waiting on permissions, they are blocked. If they are waiting on a build or screen refresh they are slowed. This can also be measured with numbers.
If person A can iterate 100x faster than person B correctness becomes irrelevant. Person B must maximize upon correctness because they are slow. To be faster and more correct person A has extreme flexibility to learn, fail, and improve beyond what person B can deliver.
Part of iterating faster AND reducing risk is fast test automation. If person A can execute 90+% test coverage in time of 4 human iterations then that test automation is still 25x faster than one person B iteration with a 90+% lower risk of regression.
I was on a team that went from every 3 weeks to multiple times per day. The number of incidents in production dropped drastically.
But much more important than that drop, was that when things went wrong is was MUCH MUCH faster to find the problem. It was also much safer and easier to roll back, since there were so few changes that would be rolled back. No one wants to back off 3 weeks of work. That's chaos.
That is the opposite of my experience. Slow deploys mean bigger deploys mean more complexity going live mean more nervousness and more testing mean more hesitation mean more chance that something unforeseen mean errors that no one understands mean war rooms.
In my experience, there's very little correlation. I've been on projects with 1 deployment every six weeks, and there were just as many production incidents as projects with daily deployments.
I had a boss who actually acknowledged that he was deliberately holding up my development process - this was a man who refused to allow me a four day working week.
Sounds like a process problem. 2024 development cycles should be able to handle multiple lanes of development and deployments. Also why things moved to microservices so you can deploy with minimal impact as long as you don't tightly couple your dependencies.
You don't need microservices to do this. It's actually easier deploying a monolith with internal dependencies than deploying microservices that depend on each other.
This is very accurate - microservices can be great as a forcing function to revisit your architectural boundaries, but if all you do is add a network hop and multiple components to update when you tweak a data model, all you'll get is headcount sprawl and deadlock to the moon.
I'm a huge fan of migrating to microservices as a secondary outcome of revisiting your component boundaries, but just moving to separate repos & artifacts so we can all deploy independently is a recipe for pain.
Network hop isn't needed if you're deploying your microservices correctly. So you can make pod groups inside of kubernetes and application that depends on another can call that lightweight container contained in that pod group. Pods inherently know the other is there in their group it has some or like network call without traversing hardware.
I know microservices and monoliths are a heated topic. However breaking up complicated code to preserve user experience is sometimes essential. However you can have machines that contain many services and that interact with each for performance if needed. You would put them into pod groups while deploying to kubernetes and have them call their service inside of the pod. This can increase performance and through put.
The author notes that there appears to be a fixed number of changes per deployment and that it is hard to increase - I think the 'Reversie Thinkie' here (as the author puts it) is actually to decrease the number of changes per deployment.
The reason those meetings exist is because of risk! The more changes in a deployment, the higher the risk that one of them is going to introduce a bug or operational issue. By deploying small changes often, you get deliver value much sooner and fail smaller.
Combine this with techniques such as canarying and gradual rollout, and you enter a world where deployments are no longer flipping a switch and either breaking or not breaking - you get to turn outages into degradations.
This approach is corroborated by the DORA research[0], and covered well in Accelerate[1]. It also features centrally in The Phoenix Project[2] and its spiritual ancestor, The Goal[3].
[0] https://dora.dev/
[1] https://www.amazon.co.uk/Accelerate-Software-Performing-Tech...
[2] https://www.amazon.co.uk/Phoenix-Project-Helping-Business-An...
[3] https://www.amazon.co.uk/Goal-Process-Ongoing-Improvement/dp...