It's called Disaster Recovery and it has a best mate called Business Continuity. If you don't actually have a process for something then it will fail unconditionally, without you even knowing it is going to happen, until it does.
OK, it's easy for me to snipe but nowhere did I see terms like those. Backups are mentioned almost casually and there is this: "but no process was implemented for ElasticSearch databases".
So, not only were critical parts of the system not actually backed up but there seems to have been no attempt to even discuss how to put Humpty Dumpty (1) back together again if the silly sod falls off the wall.
Then we get the meat of the "Processes are to blame, not people" section. It discusses avoiding fucking up by making some small changes to working practices and so on but completely misses the real point. The blogger lacks a process for recovery. It's all very well worrying about avoiding a fuck up involving a DB deletion but how do you recover? That should be only one entry in your DR plan. BC also needs some work ...
I own a small company and I spend quite a lot of time worrying about an awful lot of things. This "contrite" article that implies that the writer has actually learned a lesson worth divulging is extremely concerning to me. I suppose it's a good start but I feel it falls rather short of identifying the real problem, learning from it and implementing a proper ... process.
(1) Humpty Dumpty: UK nursery rhyme nominally about an egg. Bit more to it than than but good enough for this discussion.
> Humpty Dumpty: UK nursery rhyme nominally about an egg.
there's an interesting piece on the web about how HD is not necessarily an egg, that that idea comes from an illustration added long after the nursery rhyme was well established.
In case of catastrophic failure, would you rather have untested backups or no backups to work with? If you have even a slight preference for the former (I actually have a rather large preference for it), the untested stuff apparently has some value to you.
Understandably, but good luck explaining why you have a weeklong outage because you ignored disaster recovery process since probability made you confident enough to not test your backups.
Until you have a provably working restore, the backup is nonexistent for all intents and purposes. The sort of calculation you’d need to perform to justify not doing so borders on alchemy. Unless your infra is extremely exotic this should be rather straightforward and inexpensive, and you’re one failed restore away from this this process changing immediately anyway.
Making the business case for solid disaster recovery and continuity of business (along with resources for regularly testing it) can be quite a challenge. In an immature or unhealthy culture this is something that can always be pushed to next quarter and if disaster strikes more effort will be put into political spin versus engineers doing preventative work.
At some point a company needs to adopt practices like Netflix's chaos monkey or Google's DiRT (disaster readiness testing) to purposefully excerise continuation of business plans as well as recognize the effort that is required to keep things running. Otherwise other incentives will drown out any intrinsic motivations individuals may have to improve reliability.
Processes sometimes exist in dysfunctional orgs because a disaster strikes (possibly that a process wouldn't have prevented) and people who've been waiting for a chance at some more power jump in to create a process. Then in the future when it happens again the process is to blame, but no one ever questions why they have the process if it doesn't work.
And the bureaucracy expands to meet the needs of the expanding bureaucracy (-:
That's very well put. I would love to hear about your BC and DR processes and infrastructure. Have you written about it somewhere? We are a small company as well and we certainly do backups, but it'll be great to be able to implement BC and DR more formally.
you may start simply with documenting the steps, including roles, and documenting at minimum annual fire drill results and audits of procedures, dating/documenting any updates along the way once set. ensure recovery is possible from the worst circumstances, such as an attacker deleting everything your prod account has access to (sometimes prod accounts are able to overwrite backups etc).
Think safety first - this has to come from the top and it is a bit boring until you suddenly find yourself single handedly rescuing quite a few people's livelihood in the face of a disaster of some sort.
There are no real shortcuts but you can build yourself up to a decent position incrementally and erratically or you can do a formal analysis and create a plan and follow your plan - yeah right!
Start off with the basics: Do you have backups? Actually, do you have enough backups? You should have a complete copy of your data available on site (not a cluster replica) and another copy off site that might be a bit older, depending on your taste for data loss. Really work on evaluating how much data you can afford to lose. You should also have an offsite copy of your data that is immutable - ie can't be deleted or encrypted.
If you can get yourself into the safety first mood but don't know how to do it online then get a removable, USB connected disc and use that for your offline backups that you know can always be recovered from.
Now check your backups. Do some recoveries of files.
I don't know how important your company is to you but I suspect it is very important. Take some time out every now and then and do some due diligence "doo dill".
Theres a bit of a miscommunication going on here. A more precise (but less punchy) way of saying it would be “When things break dramatically in production, resist the too-easy urge to heap acrimonious blame on whoever was most closely involved. That’s usually unhelpful, and the defensive reaction it produces tends to be counterproductive. Instead, calmly look at the whole causal chain leading up to the incident and figure out which of its links would have been easiest to break. Then, if it seems worth the effort, design tooling or procedures that would have done so if they’d been adopted earlier.”
Lot of people criticize the GDPR but part of it is about getting some risk assessment about your data handling done and having processes to recover from disaster. Articles 25, 32 and 35 for reference.
But all those processes and certifications are the kind of things people bemoan in big corporations. And start-up don't have the resources nor time to do it, even if the sooner it's implemented the cheaper it is.
That's IMO where incubators could add some value: have a team of people whose job is to setup and follow those processes for your smaller start-ups.
> Backups are mentioned almost casually and there is this: "but no process was implemented for ElasticSearch databases".
From TFA, next sentence after the one you quoted:
Also, that database was a read model and by definition, it wasn’t the source of truth for anything. In theory, read models shouldn’t have backups, they should be rebuilt fast enough that won’t cause any or minimal impact in case of a major incident. Since read models usually have information inferred from somewhere else, It is debatable if they compensate for the associated monetary cost of maintaining regular backups
The article discusses how they executed an actual recovery process from the other data sources, but it took longer than it should have (6 days), so...
We ended up with a mix of the two. We refactored the process to a point it went from 6 days to a few hours. However, due to the criticality of the component, a few hours of unavailability still had significant impacts, especially during specific timeframes (e.g. sales seasons). We had a few options to further reduce that time but it started to feel like overengineering and incurred a substantial additional infrastructure cost. So we decided to also include backups when there was a higher risk, like during sales seasons or other business-critical periods.
It's called Disaster Recovery and it has a best mate called Business Continuity. If you don't actually have a process for something then it will fail unconditionally, without you even knowing it is going to happen, until it does.
OK, it's easy for me to snipe but nowhere did I see terms like those. Backups are mentioned almost casually and there is this: "but no process was implemented for ElasticSearch databases".
So, not only were critical parts of the system not actually backed up but there seems to have been no attempt to even discuss how to put Humpty Dumpty (1) back together again if the silly sod falls off the wall.
Then we get the meat of the "Processes are to blame, not people" section. It discusses avoiding fucking up by making some small changes to working practices and so on but completely misses the real point. The blogger lacks a process for recovery. It's all very well worrying about avoiding a fuck up involving a DB deletion but how do you recover? That should be only one entry in your DR plan. BC also needs some work ...
I own a small company and I spend quite a lot of time worrying about an awful lot of things. This "contrite" article that implies that the writer has actually learned a lesson worth divulging is extremely concerning to me. I suppose it's a good start but I feel it falls rather short of identifying the real problem, learning from it and implementing a proper ... process.
(1) Humpty Dumpty: UK nursery rhyme nominally about an egg. Bit more to it than than but good enough for this discussion.