Hacker News new | past | comments | ask | show | jobs | submit login
I Accidentally Deleted All Our Data (fausak.me)
97 points by taylorfausak on Nov 18, 2011 | hide | past | favorite | 74 comments



Happy to hear that you had backups and not too much data was lost because you caught it quickly.

But besides the obvious takeaway of always having recent, tested backups, I think it's also important to get out of the habit of ever using a REPL (or SSH for that matter) to connect to production servers. Write a one-off script, document when and why you made it, check it into your VCS of choice, and run it against the target server(s). Of course within the script itself feel free to make any connections you need; my point is to never do it "manually" from the command-line. This also prepares you for when you move past having a single live server for web or db. If you're in Python-land, Fabric is the go-to tool these days for automating this.


I usually write my scripts in a temporary file and run them through Django with something like:

    $ python manage.py shell <~/tmp-script.py
In this instance, I didn't even consider that what I was doing could be catastrophic. But you're right, you want to avoid doing things manually. In the future, I'll track all the scripts I run. And thanks for the tip about Fabric, I'll check it out.


For bonus points, clone your production db locally and test your script against it a few times first.

Or, you know, have a staging server.


Exactly.

This is the reason why enterprises always use staging servers with test suites.

It is irrelevant if you use NoSQL, SQL, or magic, if you don't have development+staging+production platforms for a business reliant on an information system, you're doing it wrong and it will bite you hard eventually.


The advice isn't about how you do the script.

You should never, ever be running test scripts and code against a production db.


Potential lifesaver for anyone using MySQL, you can start the client with --i-am-a-dummy to prevent DELETE or UPDATE statements without a WHERE clause:

http://sql-info.de/mysql/notes/I-am-a-dummy.html


I don't know how to do this for the MySQL client, but a very useful trick for your the PostgreSQL client is to turn autocommit off, which basically turns your session into a transaction that you have to manually commit before closing the client. So if you do a "DELETE FROM important_data WHERE condition_that_is_always_true" or something and it says "Removed 9001 rows" you can just ROLLBACK and your data is still save.


Oracle works this way, too. I fact I don't believe there is any such ridiculous feature as "autocommit" in Oracle, that is only something that might happen in a client application. Closest thing I can think of is that sql*plus will perform a commit upon normal exit.


  SET autocommit=0;


Years ago I developed the habit of writing WHERE clauses first and then hitting the home key to write the query.


Why is this not the default, with an option to DISable?


Because it's against the SQL standard (not that MySQL cares very much for that anyway). Just like SELECT * FROM table returns all the rows, DELETE FROM table deletes all the rows.


Let me rephrase: Why isn't the default mode of the mysql command line client to diverge from the SQL standard in the name of safety, with a command line option to enable "100% standard" (aka unsafe) mode?

I understand the SQL standard, and I understand the behavior. Why is the behavior of the software defaulting to "fuck up your life" in deference to some book on a shelf? (Especially considering you could revert to "fuck up your life" mode easily with a flag.)


Ha, this is actually pretty funny in retrospect. There I was, giving a talk on all of this great stuff we do with our data, having no idea that a good portion of that data had just vanished. The universe has many potent mechanisms for keeping us humble.


Something like this happened to me several years back... I accidentally deleted a table full of emails out of a spam appliance when I thought I was in my development window. Thanking my stars it happened at 5pm on a Friday afternoon, took me the entire weekend to reconstruct the database. Ever since then I set my production terminal backgrounds to bright red so I can't miss what window I'm in :)


Although direct database update isn't recommended in SAP, still for some strange reasons it was needed in one of the projects to modify a wrong entry done by functional consultant. One of my fellow programmer was assigned this task and this is what she did -

Update <table> set <field>=<value>

instead of

Update <table> set <field>=<value> where <condition>

The code was executed on development server(thankfully) and it created a big mess. Full day work of 5 guys was lost. Project Manager banned her from developing any critical program and told her she will only write abap/4 reports from then onwards.


We always try to wrap documentation, procedure and ceremony around the occasionally necessary direct bulk update. Test in dev. Test in staging. All good. But inevitably there's that one-off special case; it's critical, there are immediate business implications, perhaps the hair of one's biggest customer is on fire. And it's late.

So this happens, and I devise a complex, clever, surgical, just-this-once query. Focused on the correctness of the cleverness, naturally I miss the elided WHERE clause deep within the SQL of a derived table. I run the query, and then spend several reflective hours restoring from backup and writing angry SQL to make things be as they were.

Having learned my lesson, I meticulously document the incident and the recovery steps. I'll certainly never make that mistake again, but I'll be magnanimous and help the next poor fool. A few months later I'm facing the same situation. This time, thank goodness, I have a clear process well documented. There's even big red warning text. A little copying and pasting of that clever SQL, execute the query, and -- what the...?!

Like a virus, the bad query with the missing WHERE clause had worked its way into my documentation of the event -- as the good code. It was convenient to have the restoration queries ready to go, shaving a bit of time off the two hour restoration process. Never assume you're not the next poor fool, I guess.


Sounds like a poor manager to me.


Sounds like a poor manager to me.

Agreed, sounds like an absolutely abysmal manager. As long as the developer wasn't in the habit of making such mistakes, I would have put her in charge of any future similar changes.


Yes. He was a poor manager. He was actually a hard core marketing/sales guy with limited people(developer) management skills.


Wait... how can hosing a development environment cause people to lose work? The entire point of the development environment is so that when (not if: when) you hose it, you don't experience much pain at all.


I've done this a few times in my career. I'm smart enough now to always backup before I make any potential mistakes and test everything on a staging server.

The worst was executing "DELETE FROM <table>" instead of "DELETE FROM <table> where userid='X' on a production server.


A buddy of mine showed me a cool trick to avoid this type of problem. What you do is first use SELECT to see the records in question:

   SELECT * from some_table where idx >= 5
Then, once you are staring at the records that came back (and are sure they are the ones you want to delete), change SELECT * to DELETE (and change nothing else) and rerun.

   DELETE from some_table where idx >= 5
Pretty hard to get it wrong with this approach ;-)


Or even better, SELECT(*) FROM some_table - that way the developer won't be tempted to skip the step because of too many records scrolling through their screen... not that I learned that from personal experience or anything ;-)


I once updated every user's password to "prettyflowers" or something dumb like that while trying to reset one customer's password.

I realized what I did the instant I hit enter, and all the blood drained from my face.

Thank god for backups.


If I were to re-imagine SQL I'd invert the sequence of the clause and statement for this very reason (FROM TABLE WHERE USERID="X" DELETE).

As it, when I'm writing sensitive SQL to fix an issue I end up writing the where clause first as a SELECT, then issue it.. check the results, history up, ctl-a and replace with a DELETE. Harrowing stuff.


I put in a deliberate syntax error first, and then edit in the actual delete or update query. This protects me from hitting enter too early.


In situations where I can't avoid it (I've come into more projects with no development database and no concept of the idea than I'd like to say), I've taught myself that starting any data modification on a live server always begins with START TRANSACTION.


I think almost everybody has a story like this. when I was in grade 12, I got a job as the sysadmin of three linux labs at my highschool. I didn't have a lot of sysadmin experience, but part of grade 12 comp sci was to give students a chance to learn something.

One of my first perl scripts (c'mon. it was 12 years ago) was to adjust something about user accounts. I dont remember what it was, just that the result was deleting the home folder of every single teacher at the school.

a very honest mistake.


"We had a backup from earlier in the week"

week? That seems kind of off. I'd suggest setting up backups more often than that, especially for production data.

Sorry to hear about that. I destroyed a drive once a long time ago (~1994) as well. Happens to the best of us.


Because of this, we're going to back up parts of our database more frequently. A lot of our data isn't mission critical, so it wouldn't make sense to do daily backups for it. Other things, like family accounts, obviously are. We're going to back those up daily from here on out.


Is making a backup of everything that costly? My experience is you'll screw something up trying to be selective. With storage being so cheap these days, it's hardly worth the risk.

I don't use mongo so it's not 100% relevant, but I found setting up a postgres slave and running my backups from that is a nice solution. There's no additional load on my master and I just dump the backups into S3.


I'm not too paranoid but we have, at least, transaction logs, replicated backup server, nightly offsite backup of transaction logs, and twice monthly full backups. That covers a lot of failure cases and isn't too much to transfer. Transaction logs are a godsend -- you could have even found out what you did wrong.

Even non-mission critical data loss can be bad.


At work, our production databases are incrementally backed up every 2 hours with a full backup daily. The performance impact to the server is negligible and if little has changed the backup is small.


A lot of our data isn't mission critical, so it wouldn't make sense to do daily backups for it

disk is cheap enough for daily backups and it's simpler to backup everything than make a big project out of slicing and dicing your backup script.


A long time ago I once had a "rm -Rf ..." in a bash script. Missed a space, and yep, it deleted everything. Luckily the DB wasn't owned by the logged-in user so everything was recoverable. One of these "learning the hard way" examples of what NOT to put in a script (this may sound all very obvious but when you're in a hurry, really stupid mistakes do happen).


Postgresql has PITR backup. http://www.postgresql.org/docs/9.1/static/continuous-archivi...

This allows you to go back in time to an earlier snapshot of your data, without having to do a full database backup.

Unfortunately, it's still a pain to setup, even with tools like https://github.com/greg2ndQuadrant/repmgr. Wish there was more convention over configuration.

(I use Heroku's WAL-E to backup the data to S3. https://github.com/heroku/WAL-E)


You should consider having a backup replica with slaveDelay set to an hour or so. If you ever fat finger something in the future you won't necessarily have to go all the way to your backups.


Version data, like source code.

  $ chd version mydb
  $ chd revert -d <txn_id> mydb
Neither replication nor backups protect from accidentally deleting data. And restoring from binary logs requires downtime.

http://chronicdb.com/blogs/undelete_from_whoops


Wowwww! Great to hear you had backups and could mostly restore your loss though!

You said you were working from the interactive shell so you didn't have history? Now I always use IPython, which does keep a history (which I set to very long), isn't there some way you can config the default Python interactive shell to keep a history as well? Sounds like that would be useful for all sorts of unexpected things. Otherwise, get IPython, it's really good, but still a plain interactive shell only with extra fancy features.

Glad to hear it turned out mostly all right though. I always feel for these data-loss stories because I can imagine what it'd be like ... that sinking feeling. Brrrr.


Python's default interactive shell have history, but it only keeps it for the current session. Once you kill it, the history is gone forever. Once I realized what I'd done, I checked to see if Python keeps the history anywhere. It doesn't, but I found several wrapper scripts that do (and IPython).


I added a SAN LUN to a volume that turned out to be in use, and was storing our only copy of a 500GB (sic) backup database tracking 60k+ tapes. After laying out the partition on the volume someone noticed that the database disappeared, and we were able to revert the partition and get the data back. Ended up buying some pretty nice Cognac for shortening the life of few folks. No idea why they weren't backing up their database though.


I try to always keep in mind that you can deal with any error except data loss. Any bug can be fixed but if data is gone there is absolutely nothing you can do. It's pretty easy to forget about keeping good backups because it's not particularly exciting. Testing regularly to make sure you can recover from backups will give you a lot of peace of mind.

With that in mind, I'm going to do a test restore from our backups right now!


Duplicate values that should be unique is such a common problem if you don't use constraints.

This is a good example of why ACID and declarative constraints are a very good idea for data management.You can do a bad delete from query if you don't have the proper safeguards (though querying ability of SQL lets you more easily preview your changes), but the initial corruption of the data is an easily avoidable problem.


OUCH.

Doesn't django let you check if models are valid without saving them?

In rails I can do:

    Family.all.select {|f| !f.valid?}


Django's ORM does (obj.full_clean()), but he's not using Django's ORM, he's using mongoengine, I have no idea if it does.


mongoengine does do validation. It can be disabled on a per-model basis though.


Django offers a validate function that does exactly that. Unfortunately, the email uniqueness constraint is handled in our save function, due to some backend complexity. So it wouldn't have helped here. Plus, it was a typo that caused the problem, not the function :)


I'm struggling to think of what kind of backend complexity would make it infeasible to check email uniqueness in validation.

Care to elaborate?


> Unfortunately my screen session doesn’t have enough scrollback to see what I entered earlier in the day.

`defscrollback 100000` is probably the single most important thing in my .screenrc. And if you're working in production, you should really have `deflog on` too.


I wonder if it's possible that it was a MongoDB glitch and not a user error. From what I've read, it's notorious for randomly losing records. Maybe the loop save triggered it?


I definitely don't attribute this to MongoDB (or MongoEngine or pymongo or Django). I haven't seen it lose any of our documents to date. Silently dropping documents when saving them without making any changes would be a huge problem. It's astronomically unlikely.


Maybe something in the Mongo-ORM layer went wrong, so you were iterating over empty models, and then saving them back. Could that have created the absences seen later? (The way you talk about your client automatically fixing things on the load/save cycle suggests you might have some magic there.)

It just seems hard to believe you confused 'save' and 'delete'...


I'm pretty sure nothing fishy went on behind the scenes. The client "automatically" fixes things through our API when it tries to log in because it will create an account if it doesn't exist.

I, too, have a hard time believing I confused "save" and "delete", but I'm reasonably sure that's what happened. I think I misused Python's shell history. I'll bet I hit the up arrow, which resurrected "family.delete()" instead of "family.save()".


I had butterflies in my stomach just by reading your post. Congrats on having the backup in place!


Good post! And a great quick fix.


Hey guys, I accidentally all of our data. Is this dangerous?


I originally titled this post "I Accidentally All Our Data", but my girlfriend convinced me to change it. I also considered "Delete ALL the Data", but I went for the meme-free title.


Who works with a prod database in an interactive interpreter? My eyes are literally wide in shock.


It's quite common for most Ruby deployments I've seen actually.


You never connected to your live database using `mysql` (pg, whatever) with a privileged user? Using the frameworks shell is almost the same, except that all client-side validations work.


I do this from time to time. But I also type "BEGIN" before anything else.


Watch out for database locks if you do this.

I once opened a transaction, did an ALTER TABLE, and the site hung until I either committed or rolled back the transaction.


Yeah. Changing the schema is not something I'd do with code running against the database.

Also, my database typically refuses to start a transaction if the entire database is locked, and my application handles failing transactions by waiting a while and retrying.


Assuming you have tested the deploy script altering the schema to check that the transaction locking tables does not take too long then there is no reason to take down the application. In my personal experience this is the case for virtually all of our changes of tables.


Why not?


Timeouts probably.

If you mean why is it a problem at all, it's usually because of database size. On any large scale deployment (ie you have at least a million users) schema modifications will take hours. The only way to do reliable schema modifications is to have extra capacity and do it in stages. Also, your forward changes have to be backwards compatible. (AKA you're not allowed to both add and remove a column at the same time.)

The way to do it is to take some of your slaves out of the request pool and run the alter tables on them. You do this many times depending on your available capacity. (You probably can't just rip out half your slaves, you probably need to do at least 3 batches.) After you've altered all your slaves you can promote one to master and take the master offline to do its own alter. Then you push the code changes to production and add the old master back into the pool as a slave once it's done its alter.

In this scenario you need 3x the time the alter takes. So if the alter takes 6-7 hours (common in mysql if you have a large-ish table) it's going to take you at least 18 hours before you can push your code that depends on a database change.

Doing this manually at scale instead of an automated deployment process is extremely risky and will almost certainly be screwed up often.

This is one of the main reasons people are hoping schemaless databases work out in practice.


Agreed on backwards compatible, but for application semantics; not types of schema changes.

There's no reason to not be allowed to both add and remove a column at the same time, or to merge and split whole tables. In your example, these kinds of changes would not be possible.

There's also no reason to not be able to run old and new code at the same time, or to revert a schema change.

With ChronicDB we reduced schema changes to:

  $ chd change -f upgrade_map mydb
Schemaless databases don't solve this, just as an instantaneous ALTER TABLE won't solve this.


A lot of us are guilty of a quick data-munging hack at the command line from time to time, but the idea of altering the schema from a REPL makes me want to cry.


It depends, if you're working with a smaller database with a smaller team with a smaller client base, there's not as much harm. There could possibly be more harm in having more ceremony for doing simple things.


I disagree with this. Making a script and testing it against a copy of the database is a good idea in 100% of cases. The only reason you don't do that is because you're lazy and stupid. (I've done this because I'm lazy and stupid.)

When you defeat a safety interlock system, expect to be injured.


There is no efficient safety interlock system when modifying data.

Even if confident your test passed, what happens when you later realize it hadn't, and at what cost?


No one said it'll work 100% of the time. But it most likely would have saved the OP a lot of pain. Just because something isn't a perfect solution, it doesn't mean it's not worth doing.


You'd be surprised. Ever worked in the trenches?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: