"Processes are to blame, not people" It's called Disaster Recovery and it has a ...

fsckboy · on Sept 19, 2022

> Humpty Dumpty: UK nursery rhyme nominally about an egg.

there's an interesting piece on the web about how HD is not necessarily an egg, that that idea comes from an illustration added long after the nursery rhyme was well established.

https://literature.stackexchange.com/questions/1489/how-do-w...

jasonlotito · on Sept 19, 2022

Untested backups and recovery plans are worthless.

If you can't recover in peace time, what makes you think you can do it in war time?

vanviegen · on Sept 20, 2022

Worthless, or just worth less?

nrb · on Sept 20, 2022

Worthless. The process has a 0% chance of success until proven otherwise.

zaphirplane · on Sept 20, 2022

I get the underlying message, the statement is not quite how probability works.

Reminds me of Douglas Adams take on 1 in million odds

rypskar · on Sept 20, 2022

Terry Pratchett, not Douglas Adams https://wiki.lspace.org/Million-to-one_chance

jasonlotito · on Sept 20, 2022

It's not a matter or probability. It's a matter of value.

How much should you value untested plans? You should value them as worthless.

vanviegen · on Sept 21, 2022

In case of catastrophic failure, would you rather have untested backups or no backups to work with? If you have even a slight preference for the former (I actually have a rather large preference for it), the untested stuff apparently has some value to you.

But of course, testing creates more value.

nrb · on Sept 20, 2022

Understandably, but good luck explaining why you have a weeklong outage because you ignored disaster recovery process since probability made you confident enough to not test your backups.

Until you have a provably working restore, the backup is nonexistent for all intents and purposes. The sort of calculation you’d need to perform to justify not doing so borders on alchemy. Unless your infra is extremely exotic this should be rather straightforward and inexpensive, and you’re one failed restore away from this this process changing immediately anyway.

bhawks · on Sept 20, 2022

Making the business case for solid disaster recovery and continuity of business (along with resources for regularly testing it) can be quite a challenge. In an immature or unhealthy culture this is something that can always be pushed to next quarter and if disaster strikes more effort will be put into political spin versus engineers doing preventative work.

At some point a company needs to adopt practices like Netflix's chaos monkey or Google's DiRT (disaster readiness testing) to purposefully excerise continuation of business plans as well as recognize the effort that is required to keep things running. Otherwise other incentives will drown out any intrinsic motivations individuals may have to improve reliability.

robertlagrant · on Sept 20, 2022

I think if the right people don't put processes in place and people following processes then it won't work - it is ultimately about people.

hulitu · on Sept 20, 2022

> it is ultimately about people.

... and tailoring. And "agile" when some people are aleeady working on a new version when the old one is not even tested.

It seems that with processes is like with deaths: a disaster needs to strike for people to implement a working process.

robertlagrant · on Sept 20, 2022

Processes sometimes exist in dysfunctional orgs because a disaster strikes (possibly that a process wouldn't have prevented) and people who've been waiting for a chance at some more power jump in to create a process. Then in the future when it happens again the process is to blame, but no one ever questions why they have the process if it doesn't work.

And the bureaucracy expands to meet the needs of the expanding bureaucracy (-:

bhawks · on Sept 20, 2022

Set the iron law of bureaucracy.

https://www.jerrypournelle.com/reports/jerryp/iron.html

thunderbong · on Sept 20, 2022

That's very well put. I would love to hear about your BC and DR processes and infrastructure. Have you written about it somewhere? We are a small company as well and we certainly do backups, but it'll be great to be able to implement BC and DR more formally.

wahnfrieden · on Sept 20, 2022

you may start simply with documenting the steps, including roles, and documenting at minimum annual fire drill results and audits of procedures, dating/documenting any updates along the way once set. ensure recovery is possible from the worst circumstances, such as an attacker deleting everything your prod account has access to (sometimes prod accounts are able to overwrite backups etc).

gerdesj · on Sept 20, 2022

Wot they said and some more!

Think safety first - this has to come from the top and it is a bit boring until you suddenly find yourself single handedly rescuing quite a few people's livelihood in the face of a disaster of some sort.

There are no real shortcuts but you can build yourself up to a decent position incrementally and erratically or you can do a formal analysis and create a plan and follow your plan - yeah right!

Start off with the basics: Do you have backups? Actually, do you have enough backups? You should have a complete copy of your data available on site (not a cluster replica) and another copy off site that might be a bit older, depending on your taste for data loss. Really work on evaluating how much data you can afford to lose. You should also have an offsite copy of your data that is immutable - ie can't be deleted or encrypted.

If you can get yourself into the safety first mood but don't know how to do it online then get a removable, USB connected disc and use that for your offline backups that you know can always be recovered from.

Now check your backups. Do some recoveries of files.

I don't know how important your company is to you but I suspect it is very important. Take some time out every now and then and do some due diligence "doo dill".

I also should do it more often ...

moffkalast · on Sept 20, 2022

> Processes are to blame, not people

Well maybe the people that enforced those processes, wouldn't you think?

pjscott · on Sept 20, 2022

Theres a bit of a miscommunication going on here. A more precise (but less punchy) way of saying it would be “When things break dramatically in production, resist the too-easy urge to heap acrimonious blame on whoever was most closely involved. That’s usually unhelpful, and the defensive reaction it produces tends to be counterproductive. Instead, calmly look at the whole causal chain leading up to the incident and figure out which of its links would have been easiest to break. Then, if it seems worth the effort, design tooling or procedures that would have done so if they’d been adopted earlier.”

arkh · on Sept 20, 2022

Lot of people criticize the GDPR but part of it is about getting some risk assessment about your data handling done and having processes to recover from disaster. Articles 25, 32 and 35 for reference.

But all those processes and certifications are the kind of things people bemoan in big corporations. And start-up don't have the resources nor time to do it, even if the sooner it's implemented the cheaper it is. That's IMO where incubators could add some value: have a team of people whose job is to setup and follow those processes for your smaller start-ups.

thunky · on Sept 21, 2022

> Backups are mentioned almost casually and there is this: "but no process was implemented for ElasticSearch databases".

From TFA, next sentence after the one you quoted:

Also, that database was a read model and by definition, it wasn’t the source of truth for anything. In theory, read models shouldn’t have backups, they should be rebuilt fast enough that won’t cause any or minimal impact in case of a major incident. Since read models usually have information inferred from somewhere else, It is debatable if they compensate for the associated monetary cost of maintaining regular backups

The article discusses how they executed an actual recovery process from the other data sources, but it took longer than it should have (6 days), so...

We ended up with a mix of the two. We refactored the process to a point it went from 6 days to a few hours. However, due to the criticality of the component, a few hours of unavailability still had significant impacts, especially during specific timeframes (e.g. sales seasons). We had a few options to further reduce that time but it started to feel like overengineering and incurred a substantial additional infrastructure cost. So we decided to also include backups when there was a higher risk, like during sales seasons or other business-critical periods.

BOOSTERHIDROGEN · on Sept 20, 2022

Any books recommendations on DR and BC ? Thanks.

shswkna · on Sept 20, 2022

This holier-than-you comment could also have been formulated as an addition to supplement the insights given by the blogger.

Instead it comes across as a veiled condescension. It doesn’t impress.