Really well written story. As a software engineer, I have a couple stories like ...

cdogl · on Nov 29, 2023

Your excellent story compelled me to share another:

We rarely interact directly with production databases as we have an event sourced architecture. When we do, we run a shell script which tunnels through a bastion host to give us direct access to the database in our production environment, and exposes the standard environment variables to configure a Postgres client.

Our test suites drop and recreate our tables, or truncate them, as part of the test run.

One day, a lead developer ran “make test” after he’d been doing some exploratory work in the prod database as part of a bug fix. The test code respected the environment variables and connected to prod instead of docker. Immediately, our tests dropped and recreated the production tables for that database a few dozen times.

cachvico · on Nov 29, 2023

Verbatim from my current code:

    if strings.Contains(dbname, "prod") {
        panic("Refusing to wipe production database!")
    }
    Truncate(db)

devsda · on Nov 29, 2023

Ours are not named with a common identifier and this also needs constant effort to maintain while refactoring and there's still scope for a mistake.

*ideally* devs should not have prod access or their credentials should only have limited access without permissions for destructive actions like drop/truncate etc.

But in reality, there's always that one helpful dba/dev who shares admin credentials for a quick prod fix with someone and then those credentials end up in a wiki somewhere as part of an SOP.

masklinn · on Nov 29, 2023

That‘s why you do credentialing via ssh keys, and keys are explained and map to a user, and non-dba keys should expire.

If you need access for a quick prod fix, your key gets added to the machine with that explanation and a week (or lees) lifetime.

amluto · on Nov 29, 2023

I also have a table with one row in it indicating whether the database is prod.

manfre · on Nov 29, 2023

I've added a similar safety to every project. It's not perfect, but this last line of defense has saved team members from themselves more than once.

For Django projects, add the below to manage.py:

    env_name = os.environ.get("ENVIRONMENT", "ENVIRONMENT_NOT_SET")
    if env_name in TEST_PROTECTED_ENVIRONMENTS and "test" in sys.argv:
        raise Exception(f"You cannot run tests with ENVIRONMENT={env_name}")

hunterrr · on Nov 29, 2023

I think runtime checks like this using environment variables is great however what has burned me in the past is that when debugging problems, not knowing what happened at runtime when the logs were produced was problematic. So when test protected environments environment variable needed to be updated I might have a hard time back tracking to it

ericbarrett · on Nov 29, 2023

Everybody replying to you that this is fragile is missing the point. This kind of code isn't the first line of defense—it's the last.

RankingMember · on Nov 29, 2023

Exactly- it's layers of prevention rather than being just one screwup away.

jerf · on Nov 29, 2023

And when your last line of defense fires... you don't just breath a sigh of relief that the system is robust. You also must dig in to how to catch it sooner in your previous lines.

For instance, test code shouldn't have access to production DB passwords. Maybe that means a slightly less convenient login for the dev to get to production, but it's worth it.

toasted-subs · on Nov 29, 2023

Yup, I have 3 prompts if you want to wipe anything.

One of the reasons I put interactions between databases behind a cli.

csomar · on Nov 29, 2023

This is bad because if someone forgot to add prod or for whatever reason the code executed beyond the panic, you’ll wipe out the db.

There is no code that will protect your db/data. Only replication to a read-only storage will help in such situations.

saagarjha · on Nov 29, 2023

If code is executing past a panic, I think it is unlikely that you can trust the integrity of your database anyways.

Andrex · on Nov 29, 2023

But what if you have a chron job that auto replicates and then deletes everything after you forward it?

augustk · on Nov 29, 2023

And then it turns out that the order of the parameters was mixed up... just kidding.

jve · on Nov 29, 2023

Just yesterday, I did C# Regex.Match with a supersimple regex: ^\d+ And it seemed not to work. I asked ChatGTP and he noted that I had subtle mistake: the parameters were other way around... :facepalm:

augustk · on Nov 29, 2023

That's indeed a drawback of function call syntax compared to method call syntax where the object comes before the name of the method.

stylepoints · on Nov 29, 2023

#metoo

bazzargh · on Nov 29, 2023

We had this - 10 years ago. In our case there was a QA environment which was supposed to be used by pushing code up with production configs, then an automated process copied the code to where it actually ran _doing substitutions on the configs to prevent it connecting to the production databases_. However this process was annoyingly slow, and developers had ssh access. So someone (not me) ssh'd in, and sped up their test by connecting the deploy location for their app to git and doing a git pull.

Of course this bypassed the rewrite process, and there was inadequate separation between QA and prod, so now they were connected to the live DB; and then they ran `rake test`...(cue millions of voices suddenly crying out in terror and then being suddenly silenced). The DB was big enough that this process actually took 30 minutes or so and some data was saved by pulling the plug about half-way through.

And _of course_ for maximum blast radius this was one of the apps that was still talking to the old 'monolith' db instead of a split-out microservice, and _of course_ this happened when we'd been complaining to ops that their backups hadn't run for over a week and _of course_ the binlogs we could use to replay the db on top of a backup only went back a week.

I think it was 4 days before the company came back online; we were big enough that this made the news. It was a _herculean_ effort to recover this; some data was restored by going through audit logs, some by restoring wiped blocks on HDs, and so on.

norman784 · on Nov 29, 2023

Our test suite expects that the database name has a `_test` suffix, so you can't run the tests even locally without the suffix.

masklinn · on Nov 29, 2023

Our test harness takes an optional template as input and immediately copies it.

It’s useful to distribute the test anyway, especially for non-transactional tests.

If the database initialisation is costly that’s useful even if tests run on empty, as copying a database from a template is much faster than creating one DDL by DDL, for postgres at least.

masklinn · on Nov 29, 2023

(distribute as in parallelise, possibly across multiple machines)

spixy · on Nov 29, 2023

Our test suite uses DB user that exists in docker DB but not in prod, so droping prod database cannot happen.

vichle · on Nov 29, 2023

This is why I always delete by ID when cleaning up after tests.

markmark · on Nov 29, 2023

At a place I was consulting about 10 years ago one of the internal guys on another product dropped the prod database because he was logged into his dev db and the prod db at the same time in different windows and he dropped the wrong one. Then when they went to restore the backups hadn't succeeded in months (they had hired consultants to help them with the new product for good reason).

Luckily the customer sites each had a local db that synced to the central db (so the product could run with patchy connectivity), but the guy spent 3 or 4 days working looooong days rebuilding the master db from a combination of old backups and the client-site data.

dotsam · on Nov 29, 2023

> logged into his dev db and the prod db at the same time in different windows

I am very worried about doing the wrong thing in the wrong terminal, so for some machines I colour-code my ssh windows, red for prod, yellow for staging and green for dev. e.g. in my ~/.bashrc I have: echo -ne '\e]11;#907800\a' #yellow background

prox · on Nov 29, 2023

This is a good idea, especially when tired, end of day, crunch time work is happening !

arethuza · on Nov 29, 2023

About 10 years ago I literally saw the blood drain from a colleagues face as he realised he had dropped a production database because he thought he was in a dev environment.

A DBA colleague sitting nearby laughed and had things restored back within a few minutes....

Feathercrown · on Nov 29, 2023

Isn't that almost exactly what happened at github too?

technimad · on Nov 29, 2023

This happened to Gitlab.

ThePowerOfFuet · on Nov 29, 2023

schemescape · on Nov 29, 2023

Anecdote: I ran a migration on a production database from inside Visual Studio. In retrospect, it was recoverable, but I nearly had a heart attack when all the tables started disappearing from the tree view in VS…

…only to reappear a second later. It was just the view refreshing! Talk about awful UI!

donatj · on Nov 29, 2023

Around 15 years ago, I was packing up getting ready to leave for a long weekend. One of our marketing people I was friends with comes over with a quick change to a customers site.

I had access to the production database, something I absolutely should not have had but we were a tiny ~15 person company with way more clients than we reasonably should have. Corners were cut.

I write a quick little UPDATE query to update some marketing text on a product and when the query takes more than an instant I knew I had screwed up. Reading my query, I quickly realize I had ran the UPDATE entirely unbounded and changed the description of thousands and thousands of products.

Our database admin with access to the database backups had gone home hours earlier as he worked in a different timezone. It took me many phone calls and well over an hour to get ahold of him and get the descriptions restored.

The quick change on my way out the door ended up taking me multiple hours to resolve. My friend in marketing apologized profusely but it was my mistake, not theirs.

As far as I remember we never heard anything from the client about it, I put that entirely down to it being 5pm on Friday of a holiday weekend.

kfrane · on Nov 29, 2023

That's why I always write a BEGIN statement before executing updates and deletes. If they are not instant or don't return the expected number of modified rows I can just rollback the transaction.

groestl · on Nov 29, 2023

That, and I start the line with /*, write the where clause first, and immediately before I execute the query I check the db host.

Oh, and I absolutely refuse to do anything but the most critical stuff against prod on Fridays.

xattt · on Nov 29, 2023

Lesson is never attempt to do anything on a Friday afternoon that will take far more time for recovery.

elaus · on Nov 29, 2023

Or is the lesson to _always_ attempt such critical changes on a Friday? After all, in this instance the client didn't notice any problems, apparently because they were already off to their weekend.

For me personally the much bigger issue would be harming the client, their business or our relationship. Doing a few hours of overtime to fix my mistakes would probably only feel as well deserved punishment...

askvictor · on Nov 29, 2023

One place I worked (some 20 years ago) had a policy that any time you run a sudo command, another person has to check the command before you hit enter. Could apply the same kind of policy/convention for anything in production.

yafbum · on Nov 29, 2023

I'm not sure this doesn't just lead to blind rubber-stamping unless this is done very very rarely

0xDEAFBEAD · on Nov 29, 2023

The trick is to have good access controls so confirmations happen often enough to be useful, but not so often to be rubber-stamped

j16sdiz · on Nov 29, 2023

I guess most of the common task are scripted / automated. Running a "raw" sudo command should be very very rare.

HenriTEL · on Dec 1, 2023

That's not a good advice IMO, as most sudo commands will mess-up just one host, and it's something you should generally be prepared for. You're more likely to develop a culture where engineers think about hosts as critical resources whereas they should be generally considered as instances that can be thrown away. It's better to identify hosts that are SPOF and be cautious on those only.

I can think of a larger blast radius when deleting files on a shared mount point for example but it's not representative to the regular use of sudo.

stickfigure · on Nov 29, 2023

I have a rule when working on production databases: Always `start transaction` before doing any kind of update - and pay close attention to the # of rows affected.

jeremyjh · on Nov 29, 2023

If you use postgres you can put

  \set AUTOCOMMIT off

In your .psqlrc and then you can never forget the begin transaction; every statement is already in a transaction, its just the default behaviour to automatically commit the statements for some ungodly reason.

Years ago I hired an experienced Oracle developer and put him to work right away on a SQL Server project. Oracle doesn't autocommit by default, and SQL Server does. You don't want to learn this when you type "rollback;". I took responsibility and we had all the data in an audit table and recovered quickly. I wonder if there are still people who call him "Rollback" though.

groestl · on Nov 29, 2023

> its just the default behaviour

That's good from the DBA perspective, but relying on that default as a user is risky in itself, when you deal with multiple hosts and not all are set up this way.

hannofcart · on Nov 29, 2023

What strikes me as remarkable in all such stories is how almost always, the person committing the mistake is a junior who never deserves the blame. And how cavalier the handoff/onboarding by the 'seniors' working on the projects are.

Having worked in enough of these though, I am aware that even they (the "seniors") are seldom entirely responsible for all the issues. It's mostly business constraints that forces cutting of corners and that ends up jeopardizing the business in the long run.

zimpenfish · on Nov 29, 2023

As I said on Slack the other day in response to a similar story, "If, on your first day, you can destroy the prod database, it's not your fault."

(One of my standard end-of-interview questions is "how easy is it for me to trash the production database?" Having done this previously[1] and had a few near misses, it's not something I want to do again.)

[1] In my defence, I was young and didn't know that /tmp on Solaris was special. Not until someone rebooted the box, anyway.

lostlogin · on Nov 29, 2023

> /tmp on Solaris was special.

I’ve had a search but can’t work out why it’s special.

gosub100 · on Nov 29, 2023

it gets wiped on reboot. I remember around 2007 on Gentoo Linux, this behavior changed. I was using /tmp as pretty much a "my documents" type folder, I updated, and one day all my stuff was gone! I was flabbergasted. But yeah, it was reckless to store things on a folder that pretty has "temp" in the name!

sbierwagen · on Nov 30, 2023

This is rude, but I'd like to reply a comment you deleted in a separate thread.

"why didn't they have a hot-spare" They do! Flight spares are complete, flight-rated copies of spacecraft built for exactly this contingency: https://en.wikipedia.org/wiki/Flight_spare After launch the flight spares are used for terrain testing and troubleshooting. (The "mars yard" has flight spares for Curiosity and Perseverance https://www-robotics.jpl.nasa.gov/how-we-do-it/facilities/ma... which were used to test some wheels to destruction after Curiosity started showing some wear https://www.planetary.org/articles/08190630-curiosity-wheel-... )

The blog post lays it on a bit thick with the $500 million number and the "launch only two weeks away" given that the article itself is illustrated with a photo of the Sojourner flight spare. Spirit had the SSTB1 test rover. If he had actually blown out the entire electrical system, they could have launched it instead. Swapping out the entire vehicle right before launch would have been an awful job, but it's not flat out impossible.

gosub100 · on Nov 30, 2023

Not rude at all! I appreciate the reply. Only reason I deleted my message was because right after posting, I scrolled down and saw someone asking the exact same question at the top level, so I felt like it was best to conserve effort and not repeat them.

I liked that other people pointed out that risk could have been eliminated by using polarized connectors (I hope they started doing this after the incident), but also made me wonder about "back-EMF" caused by solar flares. In other words, maybe all thick wires and ground/power planes should be hardened against current surges simply due to a solar event hitting mars (which may incidentally cover the case of back-powering the driver circuits).

lostlogin · on Dec 1, 2023

Thanks you.

I have been burned by this in some version of Ubuntu and have assumed it was normal behaviour ever since.

jeffreygoesto · on Nov 29, 2023

A friend once had to remotely do an OS update of a banking system. Being cautious, he thought he'd back up some central files, just in case and went "mv libc.so old_libc.so". Had to call some guy in that town to throw in the Solaris CD on prem at 2:30 in the morning...

xyzzy123 · on Nov 29, 2023

Never this simple and calling someone still probs right thing to but fixing stuff like this is what /sbin is for.

Tangurena2 · on Nov 29, 2023

One way I mistake-proof things in SQL Management Studio is to have different colors for production vs test databases.

To do that, on the "connect to server" dialog, click "options". On the tab "connection properties" in the "connection" option group, check "use custom color". And I pick the reddest red there is. The bottom of the results window will have that color.

edit: my horrible foul-up was restoring a database to production. The "there is trouble" pagers were all Iridium pagers since they loved climbing mountains (where there was no cell service back then). But then that place didn't use source control, so it was disasters all the way down.

ThePowerOfFuet · on Nov 29, 2023

>The very moment after I hit the Enter key, I realized I had made a mistake

This brief moment in time has a name: an ohnosecond.

https://en.wiktionary.org/wiki/ohnosecond

justo-rivera · on Nov 29, 2023

Seems this is very typical, first time launches usually lose some data.

We never hear about first time launch deploys that wipe ALL data because whoever is so unlucky probably never got to browse hacker news

totallywrong · on Nov 29, 2023

As a young consultant, I was once one Enter away from causing a disaster, but something stopped me. I still shudder even though it didn't actually happen. Nothing of the sort in many years since, so a great lesson in retrospect I guess.

cloths · on Dec 4, 2023

We used to have another engineer watch over your shoulder when you do Prod stuff, can be very helpful.

T-A · on Nov 29, 2023

That reminds me...

https://news.ycombinator.com/item?id=14476421

RHSman2 · on Nov 30, 2023

I’d love to know the long term physiological effect on the body of these events. Have had a few. Still feel shakey :)

alberth · on Nov 29, 2023

Hope you bought that guy a beer.

Great story, thanks for sharing.

PH95VuimJjqBqy · on Nov 29, 2023

to be fair, it's a rite of passage to do something like this.

But you should definitely have bought that man a beer :)

NL807 · on Nov 29, 2023

Hats off to the backup guys.

3abiton · on Nov 29, 2023

Backup saving the day.

toasted-subs · on Nov 29, 2023

Nightmare fuel