Limiting breakage with a software deployment checklist

pnevares · on May 7, 2018

The blog post refers to "RCA" three times and "RSA" once, and doesn't seem to define either acronym after reading it once.

Also this?

> And once you fail a build, then every team member in your team has to do deployments and go through the deployment checklist.

Sounds like there's a piece of context missing from the section before it. You have to do the checklist to deploy, and if you fail once, then every member of your team has to do the checklist as well?

zacherates · on May 7, 2018

"RCA" likely means "Root Cause Analysis". I'm not sure how to interpret RSA other than as a typo for RCA.

bigiain · on May 7, 2018

In some contexts, RSA in conjunction with RCA would stand for "Random Safety Audit" - but that doesn't make as much sense here as the assumption that it's just a typo for RCA and it's intended "Root Cause Analysis" meaning.

iandanforth · on May 7, 2018

This is a good place to start but you also need to commit to automating these steps. Copying and pasting URLs is a waste of time, rollbacks should be automatic, contributor lists should be auto-generated etc. Since it takes time to automate each new thing a checklist is still a good idea, but you have to recognize the danger of engineers subverting/rejecting the process if you let the list grow much at all.

sitharus · on May 7, 2018

Agreed. At my job we've integrated all the tools. Developers have to put the ticket number in the commit manually, but then the tooling will attach the commit to the ticket automatically. There's also a bot that checks what you're committing and adds checklists for riskier things.

Then the ticket is picked up on merge to master which creates a change list. Everything gets deployed to QA where it's tested. QAs go over the changes and approve it, then developers push to production. Afterwards a random developer from a different team is picked to do a post-deploy review which spreads knowledge and sometimes picks up other changes.

If something can be automated it should be :)

Annatar · on May 7, 2018

"We don’t restrict a deployment trigger to specific people. As soon as you are done, go ahead."

So they have no change management process in place and are basically hacking on it 'till it works. Very professional.

Does not look like they ever heard of the capability maturity model, either.

ownagefool · on May 7, 2018

Deploying when it's done doesn't necessarily mean there's no change control.

Peer review (as in git workflows with pull/merge requests and audit), automated testing, pipelines, CI & CD are all just a modern form of change control, where we've remove the checklists from humans.

mnd999 · on May 7, 2018

One persons professional is another’s bureaucratic. There’s no one approach right for every team in every situation. It sounds like they’ve find something that works for them for now and that’s great.

Annatar · on May 7, 2018

I disagree. United States department of defense, advanced research project agency and the software engineering institute at Carnegie-Mellon university disagree as well.

I'm very much inclined to defend their methodology and position on this since that is their core area of expertise, and I've experienced it working on a very large scale (tens of thousands of servers) rather than a bunch of hacking efforts at some company on the InterNet.

https://www.amazon.com/Capability-Maturity-Model-Guidelines-...

https://www.amazon.com/Managing-Software-Process-Watts-Humph...

Cthulhu_ · on May 7, 2018

Move fast and break things. If your go-live is fast, any mistakes are fixed fast. Besides, if it's a webapp, chances are anything they break is relatively benign, not catastrophic and hurting their bottom line - that's assuming they have a bottom line.

Plus, if they provide a good service, people will forgive them. Twitter was terrible in the first years it was growing in that regard, yet people kept coming back.

protomyth · on May 7, 2018

On the subject of database indexes, know when you should deploy the indexes. This comes up when adding tables that get populated during deploy. Sometimes, depending on your database, its a really bad move to add the index to the empty table. Its an interesting problem because it might be time critical if you are doing enterprise deploys where you are taking down the whole system for the duration. Probably a less common circumstance these days. Also, getting rid of temporary code needed only for the conversion is a super good thing to remember.

brockers · on May 7, 2018

I know Dev shops don't like processes and forms, but that "checklist" is exactly that. It simply shows that processes and procedures are tools that are useful if that use can be limited.

Cthulhu_ · on May 7, 2018

It's a means of formalizing the deployment before automating it. Which makes sense, when automating you need to understand what you're trying to automate first.

runlevel1 · on May 7, 2018

We do something similar for changes that are risky, complicated, or manual:

https://sendgrid.com/blog/change-management-keep-it-simple-s...

Doing it formally for every deployment seems like it would kill productivity.

spc476 · on May 7, 2018

Where I work, our programs are first installed in QA, then staging and finally production. For each step, there's a web form we fill out, listing the release number, tracking system ID, what program, any config changes, any database changes, any special instructions and what testing have been done. Once submitted, the OPs team pulls the proper program from the build server, updates the config and pushes the stuff automatically (I know they use something like chef or ansible, but I'm not in that department (which is on the other side of the country) so I'm not sure of the exact details).

For the final push into production, the developers have to be online (2:00 am Eastern), along with QA and OPs. QA or the developers can abort the deployment [1] for any reason, and rolling back is trivial. So far in my seven years at The Company, I've had to abort a production deployment once (yes, I noticed an issue and aborted the deployment---it was totally my call).

[1] Our customers are the various Monopolistic Phone Companies. We have scary SLAs. We get approval for deployment from them. Downtime costs us Real Money. I don't get to deploy stuff all that often (stuff that doesn't talk directly with the customers is easier to deploy---unfortunately, most of what I work on talks directly with the customers).

trollopTheJope · on May 7, 2018

what an unfortunate metaphor

dang · on May 7, 2018

Since you're not the only commenter who complained, we've taken that bit out of the title above.