As a software engineer, I have a couple stories like this from earlier in my career that still haunt me to this very day.
Here’s a short version of one of them: Like 10 years ago, I was doing consulting work for a client. We worked together for months to build a new version of their web service. On launch day, I was asked to do the deployment. The development and deployment process they had in place was awful and nothing like what we have today—just about every aspect of the process was manual. Anyway, everything was going well. I wrote a few scripts and SQL queries to automate the parts I could. They gave me the production credentials for when I’m ready to deploy. I decided to run what you could call my migration script one last time just to be sure I’m ready. The very moment after I hit the Enter key, I realized I had made a mistake: I had just updated the script with the production credentials just before I made the decision to do another test run. The errors started piling and their service was unresponsive. I was 100% sure I had just wiped their database and I was losing it internally. What saved me was that one of their guys had just a couple hours earlier completed a backup of their database in anticipation of the launch; in the end, they lost a tiny bit of data but most of it was recovered via the backup. Ever since then, “careful” is an extreme understatement when it comes to how I interact with database systems—and production systems in general. Never again.
Your excellent story compelled me to share another:
We rarely interact directly with production databases as we have an event sourced architecture. When we do, we run a shell script which tunnels through a bastion host to give us direct access to the database in our production environment, and exposes the standard environment variables to configure a Postgres client.
Our test suites drop and recreate our tables, or truncate them, as part of the test run.
One day, a lead developer ran “make test” after he’d been doing some exploratory work in the prod database as part of a bug fix. The test code respected the environment variables and connected to prod instead of docker. Immediately, our tests dropped and recreated the production tables for that database a few dozen times.
Ours are not named with a common identifier and this also needs constant effort to maintain while refactoring and there's still scope for a mistake.
*ideally* devs should not have prod access or their credentials should only have limited access without permissions for destructive actions like drop/truncate etc.
But in reality, there's always that one helpful dba/dev who shares admin credentials for a quick prod fix with someone and then those credentials end up in a wiki somewhere as part of an SOP.
I've added a similar safety to every project. It's not perfect, but this last line of defense has saved team members from themselves more than once.
For Django projects, add the below to manage.py:
env_name = os.environ.get("ENVIRONMENT", "ENVIRONMENT_NOT_SET")
if env_name in TEST_PROTECTED_ENVIRONMENTS and "test" in sys.argv:
raise Exception(f"You cannot run tests with ENVIRONMENT={env_name}")
I think runtime checks like this using environment variables is great however what has burned me in the past is that when debugging problems, not knowing what happened at runtime when the logs were produced was problematic. So when test protected environments environment variable needed to be updated I might have a hard time back tracking to it
And when your last line of defense fires... you don't just breath a sigh of relief that the system is robust. You also must dig in to how to catch it sooner in your previous lines.
For instance, test code shouldn't have access to production DB passwords. Maybe that means a slightly less convenient login for the dev to get to production, but it's worth it.
Just yesterday, I did C# Regex.Match with a supersimple regex: ^\d+ And it seemed not to work. I asked ChatGTP and he noted that I had subtle mistake: the parameters were other way around... :facepalm:
We had this - 10 years ago. In our case there was a QA environment which was supposed to be used by pushing code up with production configs, then an automated process copied the code to where it actually ran _doing substitutions on the configs to prevent it connecting to the production databases_. However this process was annoyingly slow, and developers had ssh access. So someone (not me) ssh'd in, and sped up their test by connecting the deploy location for their app to git and doing a git pull.
Of course this bypassed the rewrite process, and there was inadequate separation between QA and prod, so now they were connected to the live DB; and then they ran `rake test`...(cue millions of voices suddenly crying out in terror and then being suddenly silenced). The DB was big enough that this process actually took 30 minutes or so and some data was saved by pulling the plug about half-way through.
And _of course_ for maximum blast radius this was one of the apps that was still talking to the old 'monolith' db instead of a split-out microservice, and _of course_ this happened when we'd been complaining to ops that their backups hadn't run for over a week and _of course_ the binlogs we could use to replay the db on top of a backup only went back a week.
I think it was 4 days before the company came back online; we were big enough that this made the news. It was a _herculean_ effort to recover this; some data was restored by going through audit logs, some by restoring wiped blocks on HDs, and so on.
Our test harness takes an optional template as input and immediately copies it.
It’s useful to distribute the test anyway, especially for non-transactional tests.
If the database initialisation is costly that’s useful even if tests run on empty, as copying a database from a template is much faster than creating one DDL by DDL, for postgres at least.
At a place I was consulting about 10 years ago one of the internal guys on another product dropped the prod database because he was logged into his dev db and the prod db at the same time in different windows and he dropped the wrong one. Then when they went to restore the backups hadn't succeeded in months (they had hired consultants to help them with the new product for good reason).
Luckily the customer sites each had a local db that synced to the central db (so the product could run with patchy connectivity), but the guy spent 3 or 4 days working looooong days rebuilding the master db from a combination of old backups and the client-site data.
> logged into his dev db and the prod db at the same time in different windows
I am very worried about doing the wrong thing in the wrong terminal, so for some machines I colour-code my ssh windows, red for prod, yellow for staging and green for dev. e.g. in my ~/.bashrc I have: echo -ne '\e]11;#907800\a' #yellow background
About 10 years ago I literally saw the blood drain from a colleagues face as he realised he had dropped a production database because he thought he was in a dev environment.
A DBA colleague sitting nearby laughed and had things restored back within a few minutes....
Anecdote: I ran a migration on a production database from inside Visual Studio. In retrospect, it was recoverable, but I nearly had a heart attack when all the tables started disappearing from the tree view in VS…
…only to reappear a second later. It was just the view refreshing! Talk about awful UI!
Around 15 years ago, I was packing up getting ready to leave for a long weekend. One of our marketing people I was friends with comes over with a quick change to a customers site.
I had access to the production database, something I absolutely should not have had but we were a tiny ~15 person company with way more clients than we reasonably should have. Corners were cut.
I write a quick little UPDATE query to update some marketing text on a product and when the query takes more than an instant I knew I had screwed up. Reading my query, I quickly realize I had ran the UPDATE entirely unbounded and changed the description of thousands and thousands of products.
Our database admin with access to the database backups had gone home hours earlier as he worked in a different timezone. It took me many phone calls and well over an hour to get ahold of him and get the descriptions restored.
The quick change on my way out the door ended up taking me multiple hours to resolve. My friend in marketing apologized profusely but it was my mistake, not theirs.
As far as I remember we never heard anything from the client about it, I put that entirely down to it being 5pm on Friday of a holiday weekend.
That's why I always write a BEGIN statement before executing updates and deletes. If they are not instant or don't return the expected number of modified rows I can just rollback the transaction.
Or is the lesson to _always_ attempt such critical changes on a Friday? After all, in this instance the client didn't notice any problems, apparently because they were already off to their weekend.
For me personally the much bigger issue would be harming the client, their business or our relationship. Doing a few hours of overtime to fix my mistakes would probably only feel as well deserved punishment...
One place I worked (some 20 years ago) had a policy that any time you run a sudo command, another person has to check the command before you hit enter. Could apply the same kind of policy/convention for anything in production.
That's not a good advice IMO, as most sudo commands will mess-up just one host, and it's something you should generally be prepared for. You're more likely to develop a culture where engineers think about hosts as critical resources whereas they should be generally considered as instances that can be thrown away. It's better to identify hosts that are SPOF and be cautious on those only.
I can think of a larger blast radius when deleting files on a shared mount point for example but it's not representative to the regular use of sudo.
I have a rule when working on production databases: Always `start transaction` before doing any kind of update - and pay close attention to the # of rows affected.
In your .psqlrc and then you can never forget the begin transaction; every statement is already in a transaction, its just the default behaviour to automatically commit the statements for some ungodly reason.
Years ago I hired an experienced Oracle developer and put him to work right away on a SQL Server project. Oracle doesn't autocommit by default, and SQL Server does. You don't want to learn this when you type "rollback;". I took responsibility and we had all the data in an audit table and recovered quickly. I wonder if there are still people who call him "Rollback" though.
That's good from the DBA perspective, but relying on that default as a user is risky in itself, when you deal with multiple hosts and not all are set up this way.
What strikes me as remarkable in all such stories is how almost always, the person committing the mistake is a junior who never deserves the blame. And how cavalier the handoff/onboarding by the 'seniors' working on the projects are.
Having worked in enough of these though, I am aware that even they (the "seniors") are seldom entirely responsible for all the issues. It's mostly business constraints that forces cutting of corners and that ends up jeopardizing the business in the long run.
As I said on Slack the other day in response to a similar story, "If, on your first day, you can destroy the prod database, it's not your fault."
(One of my standard end-of-interview questions is "how easy is it for me to trash the production database?" Having done this previously[1] and had a few near misses, it's not something I want to do again.)
[1] In my defence, I was young and didn't know that /tmp on Solaris was special. Not until someone rebooted the box, anyway.
it gets wiped on reboot. I remember around 2007 on Gentoo Linux, this behavior changed. I was using /tmp as pretty much a "my documents" type folder, I updated, and one day all my stuff was gone! I was flabbergasted. But yeah, it was reckless to store things on a folder that pretty has "temp" in the name!
The blog post lays it on a bit thick with the $500 million number and the "launch only two weeks away" given that the article itself is illustrated with a photo of the Sojourner flight spare. Spirit had the SSTB1 test rover. If he had actually blown out the entire electrical system, they could have launched it instead. Swapping out the entire vehicle right before launch would have been an awful job, but it's not flat out impossible.
Not rude at all! I appreciate the reply. Only reason I deleted my message was because right after posting, I scrolled down and saw someone asking the exact same question at the top level, so I felt like it was best to conserve effort and not repeat them.
I liked that other people pointed out that risk could have been eliminated by using polarized connectors (I hope they started doing this after the incident), but also made me wonder about "back-EMF" caused by solar flares. In other words, maybe all thick wires and ground/power planes should be hardened against current surges simply due to a solar event hitting mars (which may incidentally cover the case of back-powering the driver circuits).
A friend once had to remotely do an OS update of a banking system. Being cautious, he thought he'd back up some central files, just in case and went "mv libc.so old_libc.so". Had to call some guy in that town to throw in the Solaris CD on prem at 2:30 in the morning...
One way I mistake-proof things in SQL Management Studio is to have different colors for production vs test databases.
To do that, on the "connect to server" dialog, click "options". On the tab "connection properties" in the "connection" option group, check "use custom color". And I pick the reddest red there is. The bottom of the results window will have that color.
edit: my horrible foul-up was restoring a database to production. The "there is trouble" pagers were all Iridium pagers since they loved climbing mountains (where there was no cell service back then). But then that place didn't use source control, so it was disasters all the way down.
As a young consultant, I was once one Enter away from causing a disaster, but something stopped me. I still shudder even though it didn't actually happen. Nothing of the sort in many years since, so a great lesson in retrospect I guess.
As a software engineer, I have a couple stories like this from earlier in my career that still haunt me to this very day.
Here’s a short version of one of them: Like 10 years ago, I was doing consulting work for a client. We worked together for months to build a new version of their web service. On launch day, I was asked to do the deployment. The development and deployment process they had in place was awful and nothing like what we have today—just about every aspect of the process was manual. Anyway, everything was going well. I wrote a few scripts and SQL queries to automate the parts I could. They gave me the production credentials for when I’m ready to deploy. I decided to run what you could call my migration script one last time just to be sure I’m ready. The very moment after I hit the Enter key, I realized I had made a mistake: I had just updated the script with the production credentials just before I made the decision to do another test run. The errors started piling and their service was unresponsive. I was 100% sure I had just wiped their database and I was losing it internally. What saved me was that one of their guys had just a couple hours earlier completed a backup of their database in anticipation of the launch; in the end, they lost a tiny bit of data but most of it was recovered via the backup. Ever since then, “careful” is an extreme understatement when it comes to how I interact with database systems—and production systems in general. Never again.