Hacker News new | past | comments | ask | show | jobs | submit login
What we learned after I deleted the main production database by mistake (medium.com/hugo.oliveira.rocha)
152 points by fernandopess1 on Sept 19, 2022 | hide | past | favorite | 126 comments



"Processes are to blame, not people"

It's called Disaster Recovery and it has a best mate called Business Continuity. If you don't actually have a process for something then it will fail unconditionally, without you even knowing it is going to happen, until it does.

OK, it's easy for me to snipe but nowhere did I see terms like those. Backups are mentioned almost casually and there is this: "but no process was implemented for ElasticSearch databases".

So, not only were critical parts of the system not actually backed up but there seems to have been no attempt to even discuss how to put Humpty Dumpty (1) back together again if the silly sod falls off the wall.

Then we get the meat of the "Processes are to blame, not people" section. It discusses avoiding fucking up by making some small changes to working practices and so on but completely misses the real point. The blogger lacks a process for recovery. It's all very well worrying about avoiding a fuck up involving a DB deletion but how do you recover? That should be only one entry in your DR plan. BC also needs some work ...

I own a small company and I spend quite a lot of time worrying about an awful lot of things. This "contrite" article that implies that the writer has actually learned a lesson worth divulging is extremely concerning to me. I suppose it's a good start but I feel it falls rather short of identifying the real problem, learning from it and implementing a proper ... process.

(1) Humpty Dumpty: UK nursery rhyme nominally about an egg. Bit more to it than than but good enough for this discussion.


> Humpty Dumpty: UK nursery rhyme nominally about an egg.

there's an interesting piece on the web about how HD is not necessarily an egg, that that idea comes from an illustration added long after the nursery rhyme was well established.

https://literature.stackexchange.com/questions/1489/how-do-w...


Untested backups and recovery plans are worthless.

If you can't recover in peace time, what makes you think you can do it in war time?


Worthless, or just worth less?


Worthless. The process has a 0% chance of success until proven otherwise.


I get the underlying message, the statement is not quite how probability works.

Reminds me of Douglas Adams take on 1 in million odds


Terry Pratchett, not Douglas Adams https://wiki.lspace.org/Million-to-one_chance


It's not a matter or probability. It's a matter of value.

How much should you value untested plans? You should value them as worthless.


In case of catastrophic failure, would you rather have untested backups or no backups to work with? If you have even a slight preference for the former (I actually have a rather large preference for it), the untested stuff apparently has some value to you.

But of course, testing creates more value.


Understandably, but good luck explaining why you have a weeklong outage because you ignored disaster recovery process since probability made you confident enough to not test your backups.

Until you have a provably working restore, the backup is nonexistent for all intents and purposes. The sort of calculation you’d need to perform to justify not doing so borders on alchemy. Unless your infra is extremely exotic this should be rather straightforward and inexpensive, and you’re one failed restore away from this this process changing immediately anyway.


Making the business case for solid disaster recovery and continuity of business (along with resources for regularly testing it) can be quite a challenge. In an immature or unhealthy culture this is something that can always be pushed to next quarter and if disaster strikes more effort will be put into political spin versus engineers doing preventative work.

At some point a company needs to adopt practices like Netflix's chaos monkey or Google's DiRT (disaster readiness testing) to purposefully excerise continuation of business plans as well as recognize the effort that is required to keep things running. Otherwise other incentives will drown out any intrinsic motivations individuals may have to improve reliability.


I think if the right people don't put processes in place and people following processes then it won't work - it is ultimately about people.


> it is ultimately about people.

... and tailoring. And "agile" when some people are aleeady working on a new version when the old one is not even tested.

It seems that with processes is like with deaths: a disaster needs to strike for people to implement a working process.


Processes sometimes exist in dysfunctional orgs because a disaster strikes (possibly that a process wouldn't have prevented) and people who've been waiting for a chance at some more power jump in to create a process. Then in the future when it happens again the process is to blame, but no one ever questions why they have the process if it doesn't work.

And the bureaucracy expands to meet the needs of the expanding bureaucracy (-:



That's very well put. I would love to hear about your BC and DR processes and infrastructure. Have you written about it somewhere? We are a small company as well and we certainly do backups, but it'll be great to be able to implement BC and DR more formally.


you may start simply with documenting the steps, including roles, and documenting at minimum annual fire drill results and audits of procedures, dating/documenting any updates along the way once set. ensure recovery is possible from the worst circumstances, such as an attacker deleting everything your prod account has access to (sometimes prod accounts are able to overwrite backups etc).


Wot they said and some more!

Think safety first - this has to come from the top and it is a bit boring until you suddenly find yourself single handedly rescuing quite a few people's livelihood in the face of a disaster of some sort.

There are no real shortcuts but you can build yourself up to a decent position incrementally and erratically or you can do a formal analysis and create a plan and follow your plan - yeah right!

Start off with the basics: Do you have backups? Actually, do you have enough backups? You should have a complete copy of your data available on site (not a cluster replica) and another copy off site that might be a bit older, depending on your taste for data loss. Really work on evaluating how much data you can afford to lose. You should also have an offsite copy of your data that is immutable - ie can't be deleted or encrypted.

If you can get yourself into the safety first mood but don't know how to do it online then get a removable, USB connected disc and use that for your offline backups that you know can always be recovered from.

Now check your backups. Do some recoveries of files.

I don't know how important your company is to you but I suspect it is very important. Take some time out every now and then and do some due diligence "doo dill".

I also should do it more often ...


> Processes are to blame, not people

Well maybe the people that enforced those processes, wouldn't you think?


Theres a bit of a miscommunication going on here. A more precise (but less punchy) way of saying it would be “When things break dramatically in production, resist the too-easy urge to heap acrimonious blame on whoever was most closely involved. That’s usually unhelpful, and the defensive reaction it produces tends to be counterproductive. Instead, calmly look at the whole causal chain leading up to the incident and figure out which of its links would have been easiest to break. Then, if it seems worth the effort, design tooling or procedures that would have done so if they’d been adopted earlier.”


Lot of people criticize the GDPR but part of it is about getting some risk assessment about your data handling done and having processes to recover from disaster. Articles 25, 32 and 35 for reference.

But all those processes and certifications are the kind of things people bemoan in big corporations. And start-up don't have the resources nor time to do it, even if the sooner it's implemented the cheaper it is. That's IMO where incubators could add some value: have a team of people whose job is to setup and follow those processes for your smaller start-ups.


> Backups are mentioned almost casually and there is this: "but no process was implemented for ElasticSearch databases".

From TFA, next sentence after the one you quoted:

Also, that database was a read model and by definition, it wasn’t the source of truth for anything. In theory, read models shouldn’t have backups, they should be rebuilt fast enough that won’t cause any or minimal impact in case of a major incident. Since read models usually have information inferred from somewhere else, It is debatable if they compensate for the associated monetary cost of maintaining regular backups

The article discusses how they executed an actual recovery process from the other data sources, but it took longer than it should have (6 days), so...

We ended up with a mix of the two. We refactored the process to a point it went from 6 days to a few hours. However, due to the criticality of the component, a few hours of unavailability still had significant impacts, especially during specific timeframes (e.g. sales seasons). We had a few options to further reduce that time but it started to feel like overengineering and incurred a substantial additional infrastructure cost. So we decided to also include backups when there was a higher risk, like during sales seasons or other business-critical periods.


Any books recommendations on DR and BC ? Thanks.


This holier-than-you comment could also have been formulated as an addition to supplement the insights given by the blogger.

Instead it comes across as a veiled condescension. It doesn’t impress.


> In the fifteen minutes I had before the next meeting, I quickly joined with one of my senior members to quickly access the live environment and perform the query.

Don't do stuff in a rush like this. That's when I almost always make my worst mistakes. If there is a "business urgency" then cancel or get excused from the upcoming meeting so you can focus and work without that additional pressure. If the meeting is urgent, then do the other task afterwards.


And then, after you're done hand-rolling queries with Postman, immediately start planning actual, safe, tools to do this kind of thing when the need crops up.


This story is actually about a really smart person making a stupid mistake and humbly admitting it.

Hugo has some other interesting takes on his blog.

I don’t know him. But just because you haven’t done the exact same thing — or never been is a position of responsibility where your mistakes can affect many other people, doesn’t mean that everyone doesn’t make mistakes.

It just means that not everyone can learn from others mistakes.


(very late reply, only just saw this)

I'm not claiming to have never made a similar mistake. I have messed up in production a lot. But I would definitely recognise from this episode that the lack of good tooling led to risky, rushed, manual processes that can easily go wrong.


Now this meeting will be get many more urgent meetings.


Which you don't have to attend, because you're the lead IT guy whose job is to solve the crisis, not reassure the meeting people.

I work at an energy company at the moment, the servers are overloaded because more old-fashioned IT and there's an energy crisis ongoing tripling people's monthly expenses and putting them into poverty. Everyone's panicking, but I hope they're not trying to force quick fixes or skipping due process. We're still doing code reviews and going through our regular deployment processes.


For me, an interesting statement was "However, it took 6 days to fetch all data for all 17 million products." - in my experience of DB systems, 17 million entries is significant but not particularly much, it's something that fits in the RAM of a laptop and can be imported/exported/indexed/processed in minutes (if you do batch processing, not a separate transaction per entry), perhaps hours if the architecture is lousy but certainly not days.


I think this is a very clear disadvantage of the microservice architecture they chose in this case, and the post does allude to that. To recreate this data they needed to query several different microservices that would not have been able to sustain a higher load.

If I calculated this right the time they mention comes down to 30 items per second. Which is maybe not unreasonable for something that queries a whole bunch of services via HTTP, but is kinda ridiculous if you compare it to directly querying a single RDBMS.

You could probably fix this by scaling everything horizontally, if that is possible. But the real solution would be as you say to have bulk processing capabilities.


Yes, adding a "return X items" mode to the same microservices often is a way to get a significant performance boost with only minor changes, where even if your main use case needs only one item, it enables mass processing without incurring the immense overhead of a separate request per each item.


This goes beyond microservices. I've done a fair share of optimization in our own code to introduce batch-oriented calls rather than singular. Even though your service is on the same LAN as the DB server, fetching thousands of rows one select at a time is very slow compared to a single, or a few, selects.

In one case, the application was running on a laptop over WiFi, which increased the network latency by 10x. Suddenly a 30 sec job turned into a 5 minute job.

Since one can easily implement a singular version using a batch size of 1, it's a drop-in replacement in most cases.

Also, since one can easily implement a batch-style API using a singular version, you can write the API batch-oriented but implement it using the singular version if that's easier. This allows you to easily swap out the implementation if needed at a later date.


Totally agree. If a frontend wants exactly one of something then implement an API that calls the same batch code, but exposes it as a singular.

The thing I've found for object retrieval (as opposed to search) is that you might want to break GET semantics and have people POST in a list of IDs. Otherwise you might hit the query string size limit. Random tip.


If this is the case, you've done something horrible in your query program.

threads exist, use them. If you're waiting on 500 full sequential RTTs that's your fault. network requests on local fabric can be faster than storage.


Sure, threads exist, but they also make things more complicated. Simple batching can make a single thread go a lot further, was my point.


> Any kind of operation was done through an HTTP call, which you would otherwise do with a SQL script, in ElasticSearch, you would do an HTTP request

There you go.


It's recommended you send ES thousands of records at a time.


What?


Read his post about over-engineering a micro services architecture. A great read, and helps explain the context of this problem. https://medium.com/@hugo.oliveira.rocha/handling-eventual-co...


That kind of depends on how big each record is. And it sounds like these records are denormalized from multiple sources, so you probably have several transactions for each record. It's possible to do batching in that situation, but it definitely isn't always easy.


I once worked on an app where the staging database was used for local testing, all devs used the same shared credentials with write access, and you switched environments by changing hosts file entries (!!!). This resulted in me accidentally nuking the staging database during my first week on the job because I ran a dev script containing some DROPs from my corporate Windows system and failed to flush the DNS cache.

I had already called out how sub-optimal this entire setup was before the incident occurred but it rang hollow from then on since it sounded like me just trying to cover for my mistake. The footguns were only half-fixed by the time I ended up leaving some time later.


> I had already called out how sub-optimal this entire setup was before the incident occurred but it rang hollow from then on since it sounded like me just trying to cover for my mistake

That's what blameless post mortems are supposed to prevent. The only valid things to consider are questions like "is it a cost-effective prevention?", "what is the timeline of implementing this prevents / how should we prioritize?", and "does it meaningfully overlap with other changes that would prevent this?". Things like "well person A is new and that's why this happened" is not an acceptable position. Now of course, people can be not honest / not self-aware enough and are coming from that position without stating it anyway. That being said, that's why you try to focus on the objective evaluation of the recommendation and nothing else - that tends to make the bias irrelevant. Any EM/tech lead worth their salt would instantly say "is it actually important that staging and dev databases share creds"? Certainly prod should never. The other follow up to track down would be "if DNS flushing is so important when changing hosts, how have we automated it so that it happens automatically before every script that's run". I would also recommend to get rid of (ab)using the hosts file immediately and switch to explicit configuration files that you have to specify (perhaps automatically picking up "<username>.config" as a default so that scripts only ever run on your dev zone by default & locking down production.config so that only CI has the ability to touch that).


I absolutely concur with what you wrote but unfortunately not every team, company, or industry shares such a mindset. My teammates meant no offense and my mistake became a running joke but we never even considered doing an RCA for the problem. We did end up fixing the hosts file thing by moving to a different server-side architecture to rectify more pressing DX issues but the shared credentials issue persisted. One or two devs would actually lock the account by trying bad creds and block everybody's access - it took me diving into some RDBMS innards to figure out the responsible party and block them by hostname until they were able to correct the issue.


You don’t have to share the mindset but it means you’ll be out competed if another team or competitor does. So you should do it first


This is how most early stage companies (see: YC) are ran.

I’ve now taken 3 companies from horrid MVP (literally just got seed funding) to something a team enjoys working on. Each time I’m amazed that the app went without issues long enough to get funding (and long enough for me to refactor the whole thing.)


This is how I accidentally had a test message sent as a push notification to a few ten thousand people; our test/dev mongodb instance was on the same machine as the production instance, the boss asked me "can you switch it to the test db so we can send a test message", I was like "sure ok it's on test now" but it was not. Whoops.


Oof, that brings up some bad memories for me.


What would bd the best practice for that use case?


Don't use shared credentials, don't give schema write access to devs.


And don’t use the hosts file for switching databases - use a proper configuration file in your app!


Man, that's such a bizarre idea...that's gotta be just about the worst example of hosts file abuse I can think of. But then again, I'm almost afraid of what other examples posters here might share, lol


People, please don't post things in medium, because it wants people to sign up. Use GitHub pages or anything else really.


I'm torn on this, honestly.

We want an internet with less ads, but good writers deserve to get paid. They can get paid via Medium (though how much, I don't know) through subscriptions. Is that the worse than ads or newspapers?


If all it wanted was for me to chip in a dollar I wouldn't care. I pay for plenty of substacks.

Medium is just painful to use.


And what is it with all these websites that somehow disallow zooming so I can't even see what the diagram is? And why does my browser listen? Is there a browser I can switch to that will always allow zooming? I don't care if it breaks the flow or whatever, I just want to read your diagrams!

Not a front-end dev obviously


Yes you can force this in most browsers, generally hidden behind a flag. Be aware it will break websites such as Google Maps and similar stuff.


Oh awesome, thank you for the education. I'll look into that.


Five dollars here, five dollars there. It gets expensive.


My rule is that if I spend an hour a week (on average) using a source, they deserve a bit of money.

There's a videogame streamer I like. I probably watch 4-8 hours of her content, as she live translates Japanese games while playing them. At $5/mo, that is the cheapest source of entertainment available aside from used books.

For blogs I read once a blue moon, I don't typically contribute unless it's worth supporting.


Ideally they'd be self hosting a blog and make more money than whatever medium lets them have after their fee


How? Through a membership system or through ads or sponsors?

Lots of people hate the ad system, won't pay for an individual blog, and get tired of hearing/seeing thr same 10 generic internet sponsors (nordvpm, express vpn, skillshare, etc.)


Ask for donations. Say you have a blog with a thousand regular users, what do you get maybe $1 for that many views on an ad? One user donating $20 once a month otoh would make you a lot more money per user on average than relying on ads.


What you say is true, but that doesn't mean it should be posted to HN. The purpose of this site is to discuss articles. This one's behind a paywall. Even if you sign up for a free account, you may have used up your two free articles per month. That invites people to comment without reading the article. That's not why HN exists. (I actually checked the comments hoping someone posted a copy of the article.)


An incognito window is all that is needed for medium.

I've seen other pay walls posted in HN that take more to get around.


Medium annoys me far more than e.g. Bloomberg, to the point that I actively avoid Medium links unless it's an author I know or the comments indicate substantive content.


I enjoy good writing, but the only writing I'm willing to pay for is print books (I just bought a copy of J.R.R. Tolkien's "The Fall of Gondolin", the hardcover, illustrated one by HarperCollins). I don't want to pay for newspapers, for investigative journalism, or for long form article magazines like The Atlantic or The New Yorker. Nevermind Medium of all places, because Medium has no barrier of entry. No gatekeeping (and, given how easy it is to merely write a blurb of text, I have rather high standards for what I choose to pay to read). I'd rather consume from the likes of Amazon and have them run these writing platforms (e.g. WaPo) at a loss. Which means I'm paying for writers, in the end, just in a very indirect way. This sits well with me.

But if the choice before me was to pay for writers directly (like Medium), or let non-book writers as a profession disappear, I'd opt for the latter. You may criticize this attitude. I assume the responsibility for that and I'm being honest.


I want a world where I pay creators directly.


Brave browser allows me to send tips to content creators. I like it.

https://support.brave.com/hc/en-us/articles/360021123971-How...


Exactly. Won't read because it requires sign up.


It seems rude to derail an entire thread because of author's choice of CMS.


I too generally avoid Medium, but I wouldn't demand someone else change their choice of publisher to please me.

If from the comments here, or wherever is referring me to an article, I get the impression that it will be interesting/useful to me, I'll maybe go look, if not then I won't. Or sometimes the discussion the link starts is enough such that I get useful references of things to look at in other places instead (or just learn what I might want to from the discussion directly). Heck, sometimes just the subject is enough to start searching for other references.


There is also hashnode.com


If you're on Firefox there's an extension to bypass this (only for Medium's free articles) - https://gitlab.com/magnolia1234/bypass-paywalls-firefox-clea...


I'm not keen on playing the browser plugin escalation game with fundamentally UX hostile sites like Medium. They clearly have no respect for the human being at the end of the line trying to simply read a document.


This is extremely melodramatic. They literally just want money so they don't have to run ads


For this article, anyway, they don't want money. At least, the form claims that signing up is free -- so they just want an account created.

(Which I'm not OK doing, because that tells me that they want to track me, presumably for advertising purposes.)


It's worse than that. They took at least $163m in venture money: https://www.crunchbase.com/organization/medium/company_finan...

They've been flailing around for a decade in search of a model that works. If UX hostility is what it takes to have a story that retroactively justifies the investment, UX hostility is what we'll get.


i want money toooooo.


There is also LibRedirect[1] which automatically redirects to an alternative frontend.

[1] https://github.com/libredirect/LibRedirect


archive.ph cuts through the medium paywall too.


As does basic cookie hygiene.

At least I assume that's what's happening, I haven't seen a medium paywall yet.


Words no dev or admin ever wants to hear or think when working with a large database: "Hmmm... This seems to be taking longer than it should. Weird."

I did something like this 23 years ago. I could bore you with specifics, but I'm sure you can guess, or worse, can fill in the details from your own similar experience. It's sorta like grabbing a hot iron skillet with your bare hand - you're not likely to ever do it twice, and like Mark Twain's cat, you won't even pull one out from the cabinet without getting an oven mitt first. Sometimes we have to learn lessons the hard way.


Being one click away from a DELETE vs a GET sounds like a serious foot-gun that I would wrap a check around. “Are you sure? This operation will delete 17M entries.”


This is the Postman HTTP method selection dropdown that you can see on the screenshots on this page (“GET”): https://learning.postman.com/docs/sending-requests/requests/...

Postman doesn’t know that sending a single DELETE request to that URL will delete 17 million records.

Arguably, REST interfaces shouldn’t allow deleting an entire collection with a single parameterless DELETE request.


I work in an environment where Postman is an administrative and testing tool for our developers and it worries me.

How do you produce repeatable test results when you’re just passing around Postman configurations (they don’t want to commit the to GitHub in case there are embedded credentials)? How do you know your services are configured correctly?


I agree. I feel like deleting a collection should require at least two requests, one to get a deletion authorization token, and another to perform the deletion with the token. The RESTful equivalent of "are you sure?". It's terrifying to think millions of records are just one DELETE request away from oblivion.


I'm going to remember this approach!


I'd be seriously scared of putting any production credentials with write access into my Postman/Insomnia/whatever. Those tools are meant for quickly experimenting with requests, they don't have any safety barriers.


I mean, it shouldn't really be very easy to even get a read-write token to a production database, unless you're a correctly-launched instance of a publisher service. This screams to me that they're ignorant of, and probably very sloppy with, access control up and down their stack.


This is actually discussed in the article. Basically, at least with older versions of elasticsearch, without X-pack elasticsearch didn't have granular permissions. Either you had access or you didn't.


This is when you put a gateway-type layer on top of a datastore that enforces your own company-specific authn/authz.

In this case, the datastore uses a REST API, so that should be fairly easy to implement. You could even do it in Nginx or Envoy.


Honestly I'd make the case for writing a simple python script for this kind of thing.

`requests.get(url)` is a lot harder to mis-type as `requests.delete(url)`.

At $dayjob we would sometimes do this sort of one-off request using Django ORM queries in the production shell, which could in principle do catastrophic things like delete the whole dataset if you typed `qs.delete()`. But if you write a one-off 10-line script, and have someone review the code, then you're much less likely to make these sort of "mis-click" errors.

Obviously you need to find the right balance of safety rails vs. moving fast. It might not be a good return on investment to turn the slightly-risky daily 15-min ask into a safe 5-hour task. But I think with the right level of tooling you can make it into a 30 min task that you run/test in staging, and then execute in production by copy/pasting (rather than deploying a new release).

I would say that the author did well by having a copilot; that's the other practice we used to avoid errors. But a copilot looking at a complex UI like Postman is much less helpful than looking at a small bit of code.


I've written Python scripts like that as well, using requests delete and post actions. I went a step further and hard-coded the URLs, too.

And because the main branch was protected, getting the script to run meant being forced to do a PR review.

But it made things easier to not fuck up. The URL that's being modified is right there, in the code. The action being performed is a delete, it's in the code.

Makes things slightly more inconvenient but adds extra safety checks.

Its funny when the safety checks aren't enough, though. Back in the olden days, I had to drop thousands of records from a prod database because they were poisoning some charts.

Well, I was smart, you see. I first did a SELECT on the records I expected to be safe. I looked at the results, everything is okay, I'm in the right db and this is the right table. Then I did a SELECT of the records I wanted to delete. Only a couple thousand records, everything looks good.

Now all I have to do is press the up arrow key once, modify the SELECT to DELETE, and run the command.

So I pressed the up arrow key but nothing happened. I must've not pressed it hard enough to register so I pressed it again and it worked. I see the SELECT command, change it to delete, run it, aaaand it deleted hundreds of thousands of records.

What must've happened is there was a bit of lag and all my up arrows registered at once, taking me back to the select command where I looked at the good records.

Obviously I wasn't being safe enough, because I should've double checked the DELETE command I was about to run. But I thought I was being safe enough.

I had done a backup before all that, so everything was fine. But I'm still traumatized like 10 years later. I quadruple check commands I'm about to run that will affect things in a major way.


Transactions are your friend! Even inside a transaction deletes scare the hell out of me, but at least I have that extra layer of defense.


For tasks like these I have either an input prompt (“is this testing or production?”, “write ‘yes’ to confirm”), or an explicit flag like —production.

DataGrip also allows you to color-code the entire interface of a connection, so even my cowboy-access read-write connection is brightly red and hard to miss.


Funny to think that the issue here is just a relative of the 'no-preserve-root' feature rm (now) has: it's easy to let the user use the same actions equally on the branches of a hierarchy as you could the leaves, but should they?

Pretty recently corporate changed something on my work laptop that resulted in a bunch of temporary files generated during the build getting redirected to OneDrive. I went in and nuked the temp files and shortly thereafter got a message from OD saying 'hey noticed you trashed a ton of files, did you mean to do that?'

The developer side of me thought 'of course I did, duh' but I can imagine that's useful information for most users that made an innocent yet potentially costly mistake.


Cases like that make me wonder why there's no wider adoption in out industry of some trivial ideas such as (my first inspiration would be the aviation industry):

* checklists/runbooks with sample commands so that you copy/paste instead of manually typing or relying on your shell's history

* explicit scripts, named in a verbose and clear way, to be executed for dangerous steps (e.g. to delete something or deploy to prod) instead of allowing typing random commands

* for really critical operations, a mandatory code review/a second pair of eyes in some way (the blog seems to mention something of this nature)


We do this in all of my projects. It’s a pita and tedious but on big enough projects with big enough clients (governments) things like deleting a prod database means CNN shows up on your front lawn.


Checklists with copy paste commands are used some places, but it really requires that the team maintaining the process feels like they are getting value out of them for the checklists to be kept up to date.

One place I worked we had detailed step by step checklists with ready to copy paste commands for patching server. The Windows process was forever drifting and getting out of date while the Linux process was being kept up to date.

The difference was that the Windows admins were all morning people who enjoyed showing up to work at 7am or earlier if they could justify it and just doing it from memory. While the Linux team were all night owls who preferred to roll out of bed 5min before the update schedule, remote in, execute the process with as much automation as possible in the hope of getting some more sleep before having to got to work.

Basically the Windows process was only ever actually followed when someone outside of the Windows admins had to cover for them and the Linux process was used every time by the Linux team.


>checklists/runbooks The last couple places I've worked have called these Standard Operating Procedures (SOPs). Effectively just the manual version of something that can be automated

From the article, it sounds like it would have helped to formulate the correct command in a non-production environment before moving to production


Because they do exist, have for 20 years, they just aren't used because that sounds like work that isn't a minimum viable product, or a waste of time to most cowboy "developers" / startups.

Testing / separation of dev environments is almost black magic.


Because we have no industry standards for ops. There are certainly IT standards, like ITSM, but the day to day product operations are ignored. And even if there were industry standards, nobody would implement them until they were forced to, same as the rest of the standards. Product people don't want anyone restricting them, and their priority isn't to ensure operations are reliable.


ElasticSearch is not to be used as your primary data store. It's a search tool and you should have a process to rebuild an index from your primary data store.

Outside of that, I highly recommend using a managed service, either AWS ElasticSearch service, or Elastic.co to make recovery and management easy. AWS does a snapshot on every index every couple hours and it's relatively simple to restore a deleted index.

Also, I'm sure the author knows this now, but don't ever run any command on a live production data store without triple checking it.


Elasticsearch provides some great tools to manage backups out of the box. It has built in snapshot functionality that sends incremental data snapshots to object storage or the file system for you and is really easy to setup.

You can use a tool like elasticsearch-curator or even cron to manage running backups or use the built in scheduling (snapshot lifecycle management)


> Since it was essentially a read model, it wasn’t the source of truth for any specific information

I agree. And luckily that was the case for them.


And further, because this is the case, he didn’t really delete the primary source of truth, he deleted a cache.


Having an endpoint that can just delete… everything seems kinda risky.

It is interesting they have all that architecture and an endpoint that can just delete everything.


It's an ElasticSearch endpoint that automatically comes with every index. I'm not even sure if there's a way to disable. Maybe some sort of auth.


That’s wild to me. Seems like an accident(s) waiting to happen.


«We had backups for most databases but no process was implemented for ElasticSearch databases.» - that’s all you need to know


I've been there. In my case it was an authoritative internal MySQL database. I actually remember being annoyed as I was doing it: "Stop prompting me to confirm MySQL, I know what I'm doing!" LOL. Luckily my diligent IT guy had set up automated nightly backups so he was able to restore in about 30 minutes and we only lost a day's work.


An old discussion arose about the need for backups. We had backups for most databases but no process was implemented for ElasticSearch databases. Also, that database was a read model and by definition, it wasn’t the source of truth for anything. In theory, read models shouldn’t have backups, they should be rebuilt fast enough that won’t cause any or minimal impact in case of a major incident. Since read models usually have information inferred from somewhere else, It is debatable if they compensate for the associated monetary cost of maintaining regular backups

My biggest concern about restoring that Elasticsearch backup would be that the restored backup would be inconsistent with the real source of truth and it might be hard to reconcile to bring it up to date.


The backup only needs to last long enough until the production database is rebuilt from the source of truth, and then swapped back to the most recent search database.

In other words, it only has to be good enough for a few days (ideally - hours).


While everything there is true, why not having a backup anyway? I have Elasticsearch backups and even used it once (with success) when I terraformed the index away. The delta was sourced then on the fly.


I still don't have backups, any hint on proven method? I admit being lazy, but not so far couldn't imagine how to make consistent backup across the cluster, something similar to --single-transaction in mysql..



They already had a reconcile process. They ended up running it against an old schema from a couple days ago but it sounds like they probably could have ran it against a backup (that was more recent) as well.


Process seems like it didn’t really improve in the end. Obviously being arm chair ops right now but using postman for business critical updates of your very important search index seems inherently bad. I understand wanting to run one off commands but thats why having a script is super nice. Elasticsearch had libraries in most languages and the difference between a misclick on Postman and typing out “delete_index” is readily obvious and could be caught more easily in code review. Either way it sucks and hindsight is 20/20 but definitely avoidable.


May 17th is Database Awareness Day here, in commemoration of a founding engineer blowing away a production database by accident. We all rejoice and test that our backups can be restored successfully.


This is when doing event-sourcing is a godsend. You just replay all the events and rebuild your ENTIRE system from a single table(or wherever you store/backup your events).


For all the pain one has to endure to implement event sourcing, one would sure hope it might have at least one or two benefits.

Also, what happens if somebody nukes the event table? How is your precious event sourcing supposed to save you then, huh?


I learned that my boss calls me a plonker when I screw up. So many different ways to cock up things I've found, never to be repeated again.


I learned (many years ago) that I won't be fired on spot, for some reason.

This made me appreciate how desperate is the industry for software engineers. :)


I'm sure there are more learnings that are bound to show up in your next performance review...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: