DELETE is expensive at a deep fundamental level that we don’t think about much in computer science because we are more worried about losing data. The article is about Postgres but it generalizes. We don’t actually have any computer science for DELETE optimized databases. I’ve idly looked into delete-optimization in databases as thought experiments, since there isn’t much in the way of literature on it, and it is far more difficult than I think many people intuit. The article is partly a manifestation of this reality.
I think the nature of DELETE is one of the more interesting open problems in computer science. It is one of those things that, when we require precision, turns out to be very difficult to define.
For example, in physics, the paradox of Maxwells Demon is resolved when you consider the cost of deleting data:
"In 1982, Charles Bennett showed that, however well prepared, eventually the demon will run out of information storage space and must begin to erase the information it has previously gathered.[8][12] Erasing information is a thermodynamically irreversible process that increases the entropy of a system."
It is also difficult for humans to delete information. In my humble and only a little facetious opinion this is one of the main drivers for ever new "To Do" apps: the existing app gets full because deleting is too hard, so we start fresh with a new app. The app isn't the point, the starting fresh is.
The underlying reason there being that the cost of maintaining small (to do) notes can be greater than the value of the note, which is one of the reasons we still use little scraps of paper and other mechanisms that will effectively auto-delete.
Understanding the micronote lifecycle: improving mobile support for informal note taking
Every day I’d have more and more state accumulation on my machine - open apps, unsaved changes, open tabs. I’ve tried many methods for preventing this from happening over the years, but the only and most effective solution I’ve been using for the last year - a script I wrote that just quits every browser tab and every open app (leaving unsaved apps still running) every evening. I wake up and the machine is new and fresh. It’s amazing and has benefited my productivity so much. It changes the state game to where if I have a specific task I want to resume tomorrow I have to make myself a clear note to read the next day, and if I have tabs open that I care about, I have to bookmark them. What I’ve found is that the actual info that needs to be transferred between days is very small.
I have an endless reminders app and todo list. I wonder if something similar (items expire automatically unless you flag them as permanent or something) would help keep a clearer list. Sometimes ephemerality is best!
> Every day I’d have more and more state accumulation on my machine - open apps, unsaved changes, open tabs.
I resonate with your comment.
I grew up during a time when PC state was ephemeral (DOS days). Unsaved changes essentially meant lost data. Open apps were gone once the computer was shut down (and the computer was shut down daily - a 250W power supply had an expensive power draw; in contrast a Mac Mini only sips 1W when sleeping). This helped me develop a habit of keeping only necessary things open, bookmarking what I want to keep, and of habitually pressing Ctrl+S to save data. I never keep any tabs open (my browser loses all tabs upon close anyway), and my inbox is zero.
The cost I pay for this is context recovery -- every day, I have to essentially set everything up again. I do write notes or leave comments in code to remind myself where I left off, but I essentially started fresh. But there is an upside to this: I start each day from an uncluttered slate, which leads to clarity in my head. When context is ephemeral, I'm more likely to not be beholden to an existing state.
This actually helps me increase the quality of my writing and my code. It's akin to the heuristic of "throwing away your first draft and rewriting" to achieve higher quality. Write your code once, delete it, and write it again from scratch. The next version takes a quarter of the time to write but can be twice the quality because you've prototyped it in your head but you are not bound to your crufty first attempt. There was an HN discussion on this a while ago:
If you save state, you can resume more quickly, which is great if you're on a roll, but it's also an accumulation of more clutter that blocks certain more creative thoughts.
I like the idea of a todo list that comes with a built in auto-delete.
You either do your to dos, or it auto-deletes them for you. No worry about it getting full, but also some pressure to actually get them done or they'll be wiped.
And if you're happy they're wiped, then you probably didn't need to do it at all.
you can already do it, with Apple’s native Shortcuts app and Reminders app.
You can set an automation to delete all reminders after a fixed time has passed with the ‘Remove Reminders’ action
Otherwise you can just use cronjobs with a small python snippet to parse and decide what reminders tagged with which labels to delete after X,Y,Z many days and hit the APIs of most major todo apps like todoist, asana, etc to just delete those tasks. Heck works with your local markdown based todo lists too.
This was a pretty nice feature for me in the now defunct arc browser. It simply deletes tabs that you don't interact with for 12 hours. If its important I'll open it again!
This is the crux of what made Google Inbox so good. The UX encouraged "deleting" and starting fresh. This was made possible not just through the controls, but also through the promise that undelete would be possible. People want to start fresh, but they also don't want to lose anything; that's the conundrum.
What about LSM trees? Something like RocksDB is very efficient at deleting. A delete operation is a tiny write (a tombstone) and then the actual deletion is done via compaction in the background with entire tablets being freed at once when the live data is evacuated. It's actually most efficient when deleting large ranges - when just replacing data it's not so efficient due to the write amplification.
That said, I agree with you in general. An under-used technique is to simply encrypt everything and then use a key store that is guaranteed capable of deleting data. This makes it easy to comply with legal deletion requests even across backups, though of course, you need to manage the keys very carefully and ensure that they are also backed up.
Encryption that allows precise deletion of records in databases is quite famously pathological for performance and cost. Databases that work this way have existed for many decades and almost no one uses them because the performance is unacceptably terrible.
The reason may not be obvious. At the limit, database data structures and algorithms are a collection of data compression algorithms, though we typically don’t think of databases this way. Using encryption in front of that compression renders it worthless, and most database performance is predicated on the ability to use compressive representations of the data. Encrypting the data forces the most naive and inefficient data structures imaginable.
Yes, encrypted databases aren't that useful for fine grained data especially if you get into academic computable encryption schemes.
For document DBs where you're storing files like big JSON structures, PDFs, or for archival data, etc it can work out. Though mostly it's not worth it because key management is too hard.
Well tombstoning is fundamentally punting the operation, the data is still there taking up space and computation if the flagged entry does not get removed from varying levels of query plans.
I agree that it meets the requirements for batched DELETE, and that's likely as best as we can make it.
But I wonder if there was a better way. I know there are research DBs out there that experimented with reusing the tombstone entry for new INSERT/UPDATE operations, but these suck when you want to do batched INSERT/UPDATE on a range since they're scattered all about in a table, and you lose ordering + monotonic properties.
The way tombstones work in a sorted KV store like RocksDB is that queries walk up the levels, and the moment a tomb stone is hit the walk stops because it's known the keys are deleted. Then when the levels are compacted the live data is evacuated into a new tablet, so it's like generational GC. The cost scales with live data, not how much data there is in total. The problem of course is you pay that cost over and over again.
+1 This was our strategy at TempoIQ for our ts storage engine (built on top of rocks).
Very efficient and effective at managing tons of data ingestion (and deletion) at scale.
Not an easy out of the box tech to build on top of though when you have to build all the analytics/management pieces that something like PG gets you so I get the lack of public examples
There was a common thread through all of my algorithms and data structures class: Hash maps, b-trees, etc. are all beautiful structures until you add a delete operation and have to start dealing with all those little holes... Removing data complicates everything.
> We don’t actually have any computer science for DELETE optimized databases.
Depending on how much deleting and when, there might be engineering if not science for this.
If everything is deleted on a schedule, partitioned databases and dropping whole partitions as they expire is a well worn path. Soft delete and compaction also works pretty well if most, but not all things will be deleted. A generational garbage collection kind of thing.
As others said, fixed sized records are easier to manage deletion/replacement with, too.
It's absolutely true that we don't think about it much. When I was first taught balanced binary trees, deletion wasn't even a topic that needed to be learned. Same thing later when I was taught balanced binary trees. Then again in hash tables. It's an operation that's overlooked in CS education.
Generational GC optimizes for both. They assume that most objects die young, so choose to relocate live objects and just mark the entire region that was evacuated as empty. So this is a very efficient way to delete data.
If someone dumped that project in my lap and said fix it (and it I was more used to low level programming), I’d probably start be refreshing myself on the last 10+ years of GC advances since I stopped reading SIGPLAN. Particularly multithreaded sweep. Because essentially you want to decouple delete from free so you can not do 100% of the bookkeeping work in the middle of time sensitive operations. But not fall as far behind as Postgres can with its vacuuming albatross.
In a way, deletion is a form of eventual consistency. The user loses access to the data but the system still knows about it for a while.
Just off the top of my head, I would think for LSM systems, you would resort to snapshotting as the edit history became much larger than the retained row count, and as you delete old snapshots (two GC roots) you could compare the old and the new and drop everything that didn’t survive. You only have to finish well before the next snapshot interval, and if you maintain a queue you only have to process them on average as fast as the snapshot interval.
And for BTree systems you can amortize the deletes across every insert, the way some realtime systems clean up a few free pointers on every allocation.
> For example, deleting 1 million rows in a single transaction is a textbook case of what not to do. Instead, splitting the operation into smaller batches, such as deleting 10,000 rows across 100 iterations, is far more effective.
Why do I as a user have to do that? Why can't the database implement batching internally and automatically transform my 1-million-rows query into an appropriate list of batched queries?
(Edit: Thanks a lot for the answers, that makes more sense - in particular the point that this would also lock one million rows at once)
Before you execute the query, you should be able to query any of the data, and after you execute the query, none of the data should be available. The mechanisms to make a transactional database is tricky.
Some databases, like CockroachDB, provides some built-in TTL capabilities.
But also- if you are having to delete huge ranges of data and do not care about consistency, you are probably looking at an analytical workload, and there would be better databases suited for that, like Clickhouse.
As written, that's not required. The data should not be retrieveable by query, but ACID only specifies what happens at the client-server boundary, not what happens at the server-storage boundary (Durability prescribes that the server must persist the data, but not how to persist it). A database that implements DELETEs by only tombstoning the row and doesn't discard the data until the next re-index or vacuum operation would still be ACID-compliant.
One reason I can think of is that the database needs to maintain atomicity and isolate effects of any given operation (the A and I in ACID).
By manually batching the deletes, you are telling the database that the whole operation does not need to be atomic and other operations can see partial updates of it as they run. The database wouldn't be able to do that for every large delete without breaking its guarantees.
I think that gp’s comment can be reinterpreted as: why should this landmine exist when databases could notify a reader of its manual about this issue in an explicit way, for example:
DELETE FROM t WHERE … BATCH 100
Which would simulate batched queries when called outside of transaction. This would remove the need of a client to be connected (or at least active) for a duration of this lenghty operation.
If DELETE is so special, make special ways to manage it. Don’t offload what is your competence onto a clueless user, it’s recipe for disaster. Replace DELETE with anything and it’s still true.
ALTER DATABASE d SET UNBATCHED DELETE LIMIT 500000
I know a guy (not me) who deleted rows from an OLTP table that served a country-level worth of clients and put it down for two days. That is completely database’s fault. If its engine was designed properly for bigdata, it should have refused to do so on a table with gazillions of rows and suggested a proper way to do it.
Rather than batching, I would want a "NO ROLLBACK DELETE" sort of command. The real expensive part of the delete is rewriting the records into the transaction log so that a cancel or crash can undo the delete.
If you've gone to the effort of batching things, you are still writing out those records, you are just giving the db a chance to delete them from the log.
I'd like to save my ssds that heartache and instead allow the database to just delete.
In MSSQL in some extreme circumstances, we've partitioned our tables specifically so we can use the 'TRUNCATE TABLE' command as delete is just too expensive.
Yes the commercial databases make it easier to handle this.
One simple way in Oracle is to take a table lock, copy the data you want to preserve out to a temporary table, truncate the target table, copy the data back in.
So why does it need to be copied into the WAL log until vacuum runs?
And vacuum is not expected or required to be atomic, since it deletes data that was necessarily unreferenced anyway, so it also shouldn't need to copy the old data into WAL files.
Many DBMSs with index-oriented storage (MySQL, Oracle, MSSQL) use undo logging for a transaction's MVCC, so that for deletion the old version is put into the undo log of that transaction and referred to as an old version of the record (or page, or ...), immediately cleaning up space on the page for new data while the transaction is still goin on. This is great for short transactions and record updates, as a page only has to hold one tuple version at a time, but that is at the cost of having to write the tuples that are being removed into a log, just in case the transaction needs to roll back.
The space isn't immediately cleaned up because of Postgres's version-based MVCC. It should only need to record that it marked the row as deleted, and the vacuum shouldn't need to record anything because it isn't atomic.
You kinda have that already for certain databases[1] with DELETE TOP 100. We have a few cleanup tasks that just runs that in a loop until zero rows affected.
That said, I agree it would be nice to have a DELETE BATCH option to make it even easier.
Because they mean different things. Deleting all records in a SINGLE transaction means something different than many smaller transactions. Even if for most use cases, you want the latter, it may be the case that you need to lock the DB down to perform the deletions in a single step.
And a follow up question: would the current best way to handle this be to "mark records as deletable" and then do the batched deletion operations when convenient?
Create a column called MarkedForDeletion. Create a job that starts in the off hours to detect how many locks are present on the table, if low then delete X records. Else wait for Y minutes. Put this in a loop. If error detected, breakout of the loop.
You're probably thinking of the --safe-updates option [1] for the `mysql` CLI, also available as the memorable alias --i-am-a-dummy. This requires UPDATE and DELETE to have either a WHERE clause or a LIMIT clause. Under the hood, the command-line client option just manipulates the sql_safe_updates session variable [2] to enforce the UPDATE and DELETE requirement, as well as a couple other unrelated variables to prevent overly-huge SELECTs.
It's not enabled by default out of the box, but some companies do override their configuration to enable it for new sessions, iirc Facebook did this.
That just wouldn’t fly where you have business customers, many insist on data deletion at the end of contracts. In practice though partitioning or using seperate databases can be a better strategy for dealing with that as otherwise dealing with backups are challenging.
Sybase SQLAnywhere as well, and not unlikely MSSQL too given its shared ancestry.
Delete with WHERE is sufficiently slow in MSSQL we have to do batched deletes, but I can't recall offhand if that holds for whole table deletion as well.
That is why you always get in the habit of wrapping your stuff in “BEGIN TRANSACTION”. Then if and when you fuck up you can issue a rollback and be all good.
> Unlike DELETEs, UPDATEs don’t trigger cascaded actions - they only involve triggers that are explicitly defined.
That's not entirely true. ON UPDATE CASCADE is a thing for foreign keys, at least in PostgreSQL (which this article is talking about), meaning that the foreign row referencing the row gets updated. Though, personally, I would never use ON UPDATE CASCADE, as it seems kind of funky.
> That's not entirely true. ON UPDATE CASCADE is a thing for foreign keys, at least in PostgreSQL (which this article is talking about), meaning that the foreign row referencing the row gets updated.
For this to work with soft-deletes, wouldn't you have to include the soft-delete column in a compound foreign key? That sounds funky indeed.
While I overall agree, there have been cases where I have found it handy.
In one case, we had a table of invoices and a table of goods items, and each goods item should point to an invoice.
If one wants to use the natural key, invoice number, and have a foreign key to ensure the goods items can't point to an invoice that doesn't exist, then ON UPDATE CASCADE was needed in case the user had to change the invoice number (due to mistyping it or similar).
Of course, if one does not use a natural key then this isn't such an issue.
Another case was where we've had to merge databases after two companies merged. If they had overlapping primary key ranges, then the easiest was to switch to ON UPDATE CASCADE and then renumber the existing rows before inserting the other rows.
It does if your key is an auto increment or random unique identifier. But if you had a key that is also data, say a "Genre" colum, it starts to make sense that you'd want to cascade updates
Personally, I like to be explicit and in control. In the application layer, I may be far away (at least mentally speaking) from the constraints in the database, and if I update/delete something, I don't want it to "magically" cascade through the database. For those reasons, I always prefer RESTRICT, both for ON DELETE and ON UPDATE. This forces me to clean up before I make the actual change I'm interested in, and anyone reading the code later, can uncover that behaviour quickly.
That being said, I can see the arguments for ON DELETE CASCADE, particularly in a codebase, where there is a lot of functionality in the database itself (in the formed of stored procedures, and what have you), since you are always mentally closer to the action. But ON UPDATE CASCADE feels weird, because why are you updating the primary key (which is usually what foreign keys references) of your rows? That feels like something that needs a good explanation.
Though, I do recognise, you need to jump through a lot of hoops to modify your primary key values with ON UPDATE RESTRICT, because you basically need to cover all your bases in a large convoluted common table expression (depending on the number of foreign keys, of course), when an ON UPDATE CASCADE would do that for you. But I'd also rather be blocked from updating primary row values entirely, since it feels like the wrong thing to do. (Yes, I know that foreign keys doesn't have to reference other primary keys, and there may be niche cases for this, but personally, I'd just avoid it altogether.)
> if I update/delete something, I don't want it to "magically" cascade through the database.
This is the big reason I don't like triggers either. I would use them in only one case: if the database didn't support an auto-incrementing "identity" type, I might use a trigger to simulate that. But just as often I would prefer a stored procedure that got the new ID from a sequence, and then did the insert, especially if there were other things I needed a stored procedure to deal with.
The first case I was thinking about was a merger of two accounts. Second case would be a db-wide update of one or more keys for some reason. That’s why I tend to leave it ON UPDATE CASCADE.
Although both of these cases are more diffuse muscle memory than rational foresight.
For many of the most painful deletion questions, the root problem is that when the software was first made the stakeholders/product-org didn't think about use-cases for deleting things. At best, they assume a "do not show" property can be placed onto things, which falls apart when you get to legal issues that compel actual removal.
That depends on the context. In some cases you're not allowed to physically delete because you need to provide an audit trail for 10 years or so. You could move those rows into separate audit tables instead. However, that requirement should not come as a surprise.
But, just like this article using «physically deleted», when in practice it's not the case (the bits are just freed to be overwritten some unknown amounts of time later), does legal compliance just completely ignores this fact of actual physical deletion ??*
(AFAIK it takes several passes of overwriting bits with random bits on magneto-mechanical storage to not be able to retrieve any significant fragments of the original data, and things are even worse on transistor storage, which casually makes copies of the data for wear leveling reasons.)
*With the exception of state secrets of course, where we know that storage is mechanically destroyed «with extreme prejudice».
>software was first made the stakeholders/product-org
Practically all building, physical and software is made for a purpose first, the process is mainly an obstacle that needs to be minimized. A piece of software is trying to solve a problem just like a door is. It's driven by economics where the recipients don't want to pay any more than they need to and someone is always willing to undercut you by cutting corners.
They are difficult but it shouldn’t be underestimated the cost of trying to keep consistency with the alternatives either.
I sometimes think that people ask the wrong question on this sort of thing - rather than thinking “what technical solution should I come up with” you should be thinking “what is the business requirement here, and what consistency guarantees are needed?”
In many cases you want to soft delete anyway and mark rows as stale rather than deleting them wholesale. Cascade deletes need to be very carefully thought about, as while they’re very handy they can be quite destructive if the relationships are not mapped out.
Personally having spent some time now in the microservices hole, I miss all the power SQL databases give you for this sort of thing. I think everyone should spend some time reading and digesting ‘Designing Data Intensive Applications’ and evaluating the trade offs in detail.
My only recommendation would be, no matter which strategy you go with, cover it with tests to make sure the right information stays and the right information gets physically (or marked) deleted, and that data marked for deletion is invisible to the UI except via admin access.
But, indeed, proper deletion is surprisingly difficult, especially when you consider cascades on a complex table containing many defined foreign-key relationships.
Why do DBs perform Delete operations online? Wouldn’t it be better to soft-delete (at the table-space level ) and then run scheduled task to clean up the table spaces?
Similar to git. When you “delete” files they are just removed from the tree. It isn’t until later that all refs and reflog references have expired , and gc is run, that the objects are actually removed.
Most databases I used have a Status column we could mark as active, inactive, or deleted. That way, you can see what records were marked as deleted and change them back in case of accidental deletion.
Keep record retention with the Date_Modified column so you can use SQL delete to remove those deleted records that are older than a year or so.
I do something similar, but instead keep a "date_deleted" column null by default, and the "active" column as a boolean.
That way, I kill two birds in one stone by having a dedicated column for last deletion (instead of updating a record that is supposedly deleted) and the status just as a boolean instead of some enum, or integer or string.
this is a “soft delete”. as the author notes, depending on the nature of the data being stored a soft delete does not meet the requirements of many data privacy laws and compliance regulations (like GDPR’s right to erasure).
I've had cases where the rows in question absolutely could not be hard deleted, because of legacy foreign key relations. But the PII in those rows had to go. So we did a kind of "firm delete" by setting all columns (except the PK and a few necessary flags) to their default values and/or null.
One solution for performance degradation with soft deletes is to partition the table by some field like `created` monthly. Queries will need to include `created` in the query is the main downside.
Depends on the workload. In the past, I've worked on several workflow-based systems that performed lots of OLTP operations to drive live workflows forward, but once a workflow was done the operational data became a lot less interesting. So there it made sense to (say) partition the operational data and table by month, and roll off partitions after 3-6 months.
If data isn’t actually removed until vacuuming, then are systems that perform SQL DELETES actually GDPR compliant? Because technically the private data is still there on disk and could be recovered. “Until the autovacuum process or a manual VACUUM operation reclaims the space, the “deleted” data remains.”
Even vacuuming wouldn't actually destroy the data right? Because filesystems don't guarantee they will overwrite or wipe any particular disk blocks. And even if they did, SSDs still wouldn't promise that the blocks aren't remapped instead of being wiped & reused.
> Because filesystems don't guarantee they will overwrite or wipe any particular disk blocks.
Some filesystems have a richer interface to the underlying storage device, allowing them to invoke commands such as ATA TRIM or SCSI UNMAP - either incrementally as blocks are freed, or on demand - which request that the underlying storage device forget the block contents.
So the necessary interfaces exist and are widely available, and even if imperfect they improve the situation.
> Some filesystems have a richer interface to the underlying storage device, allowing them to invoke commands such as ATA TRIM or SCSI UNMAP
No, that's not a guarantee of data erasure. Not just because it's just a request that the device can disregard, but also because filesystems play tricks (like storing small bits of data inline, or logging data in various places, etc.) and they don't clear all those blocks just because you wanted to clear a couple bytes.
Yea the only way to be sure that data is gone is through mechanical destruction (shredding) of the drives. Sometimes you can write something to a SSD and then not be able to delete it due to a hardware fault, but the data can still be read.
I wonder if a GDPR nation has made a ruling on the extent of data erasure? Surely you cannot expect a company to shred a SSD every time someone asks for their data to be deleted.
With our current understanding of physics you cannot destroy information outside of maybe throwing something in a black hole — and even then you may still be able to get the information back from hawking radiation after many eons — so the question is how much should we scramble information before it is considered “deleted”?
> I wonder if a GDPR nation has made a ruling on the extent of data erasure?
My understanding (based on a couple random conversations, take it with a grain of salt) is that at least some entities are taking the sheer difficulty of true compliance with the letter of the law to imply that softer deletion methods have to be reasonably acceptable, and their stance is basically "if you disagree, well, take us to court and we'll figure it out."
GDPR supervisory authorities disagree on what to do about data in backups. France has said you don't have to delete data from backups. The Danish authorities have said that you have to delete from backups where it is technically possible. The UK (which still has GDPR) has said that you must put the data "beyond use" which most have taken to mean that you have to make sure that if you ever restore data from the backup you will omit the data that is supposed to be forgotten.
I don't know what other supervisor authorities have said--those three are just the ones that tend to show up when searching on this topic. I would not be surprised if there are at least a dozen other different opinions from the rest.
Yes. GDPR allows for delays when complying with deletion requests. You should ideally document it and factor the delay into any deadlines you might be bound to.
You’d need to make sure the process is somewhat predictable, like running the vacuum on a set schedule so you know for sure what maximum amount of time a deletion request will take.
If vacuum runs at least once per day, seems pretty GDPR compliant to me. Even if it runs once every two or three days.
Now if your database runs vacuum once every 6 months, yeah, DELETE might not actually be a delete. But is it really a GDPR issue? What's really going on in this system?
I don't think any EU agency is going to fine your company if the data you say you deleted survived 6 or even 60 hours after deletion, if that is the end of it.
I think the nature of DELETE is one of the more interesting open problems in computer science. It is one of those things that, when we require precision, turns out to be very difficult to define.