A million inserts into a table a year is causing performance problems? Assuming 200 work days and all accesses being in a four hour period that is about one row inserted every three seconds. DBs are not my bag, baby, but that presumably should not be killing you.
At the day job, when we do a soft delete as defined here, we tend to create a view of the active rows in the table. Accessing through the view rather than the table prevents many of the "Whoopsie, missed a WHERE clause, now I'm summing over deleted records" errors. I'm told it also improves performance but take anything I say about DBs with a grain of salt.
We use soft deletes in our system (primarily because we have to maintain all data), and we use views to retrieve results. The views also flatten the data a bit, joining appropriate tables, which again simplifies queries.
For example, in our job board application, we might store open, expired and filled listings using a table-per-class strategy: [image of some class hierarchy]
That ain't no man's soft delete, boy. You want to know what soft delete is? Well that's too bad, cuz there ain't no delete. Move from one table to another? What in tarnation?
Boy, you need yourself the sequence of actions that produce some kind of result. You don't store the result itself, ya'll just go an' fuck it up or something later. "Posted a job offer for Tractor Operator at 2009-11-21.12:44:26." See, that there's an action. "Accepted job offer for Tractor Operator." There's another action. Now what's the status of that job? You get your list of actions, then figure it out. Yes, it's taken, you would have just injured yourself on that tractor anyway.
Dingwall lumps a number of separate issues under "soft delete": undo, audit trails, soft create, and performance in the presence of historical data. He presents several solutions, not all of which I would buy.
The is_deleted column is a pretty simple solution that we all use and there are a number of solutions to the problem of retrieving only the active columns, such as views.
Audit trails and performance are more interesting. A while back I worked for a web analytics company and we had the problem of the storage and performance costs of historical data. Only the 5% of the data was of any real interest, but the 95% historical data made writes slow because of the large number of indexes. They adopted the solution of historical tables with fewer indexes on cheaper drives.
I like the solution of serializing historical, deleted, and audit data and storing them in a NonSQL database of your choice. Then you can bring them back as individual undos, or into a data mining database for scenario playing.
I dont particularly like his suggestion of creating separate tables for each state of an element. I think that's needless complexity.
We were dealing with terabytes of data and 10's of billions of writes per month. Most tables had 4 to 6 indexes, resulting in multiple updates for each write. If your data sets and indexes are huge there is little locality of data. Each row returned may have several disk hits. If you can keep your table sizes small enough that the indexes can remain in memory, performance is mush better.
Even worse were the autogenerated queries that joined a dozen tables together. I saw some that were 1500 lines long.
Separating one table into multiple tables based on status has caused me a lot of pain over the years. Now I have multiple tables I need to join together more often than I would hope. Is this really the best way?
I read this article and the entire time I was thinking how easy this problem is solved with a good ORM, like in Django.
All I do is setup two managers, one the default so in the admin I can browse my records without any filtering and then a second one which I use exclusively in my views that simply adds .filter(hidden=False) to the get_query_set method. So simple, and have never ran into any problems.
So Richard basically claims that one of the big problems with soft delete is that, from now on, all queries against the table have to add a where clause excluded soft deleted rows; he claims this is some sort of tax.
Is he that ignorant of what a reasonable ORM (or at least, Ruby on Rails) will do for you? eg named_scope, etc...
Yes, he is. I could implement the soft-delete feature in the database, write one class (that sets up the queries chained off of it to be aware of deletions), and then substitute that class for the original via the dependency injection system. Total effort? 15 minutes. Total changes to the application logic and the code that actually queries the database? 0.
If you are writing PHP, where you hard-code a query and then print out HTML as you iterate over the result set, then sure, you're fucked. But not because of soft deletes.
That only works if everyone you work with religiously uses a single ORM and dynamic programming language, because it makes calls to that ORM the only (cumbersome) query language that gives correct results. It doesn't fix the ad hoc SQL query your marketing guys are pasting out of email from a developer who went to another startup last year (yes, I have seen this happen).
So your response basically boils down to "If you do everything else exactly right (and have since the application was designed), this won't bite you"? I'll agree that that's certainly true...
Even a decent ORM will only partially solve the problem.
For example, if you have an :active named_scope in rails that respects an object's active state, you still need to remember to do MyObject.active.find everywhere you want to exclude inactive results (meaning it's still just as easy to forget it). This got a little bit better with default_scope, but now you have the problem of trying to jump through hoops in the 1 or 2 cases where you do want to bring back inactive objects -- which, in my opinion, still imposes some sort of tax (albeit a slightly more readable one).
Granted, I'm still fairly new to rails, so if there really is a way to just be able to do MyObject.find in all cases where I want to ignore inactive records and MyObject.include_inactive.find in the 1 or 2 edge cases where its needed, then I will readily concede the absence of said tax.
At the day job, when we do a soft delete as defined here, we tend to create a view of the active rows in the table. Accessing through the view rather than the table prevents many of the "Whoopsie, missed a WHERE clause, now I'm summing over deleted records" errors. I'm told it also improves performance but take anything I say about DBs with a grain of salt.