Hacker News new | past | comments | ask | show | jobs | submit login
How to Corrupt an SQLite Database File (sqlite.org)
152 points by naftaliharris on Oct 5, 2013 | hide | past | favorite | 63 comments



The amount of engineering rigor that has gone into sqlite over the years boggles the mind. It's a project that has been held up as a sterling example of successful software development many times, rightfully so. I especially like this part of their license, from the top of sqlite3.h

"The author disclaims copyright to this source code. In place of a legal notice, here is a blessing: May you do good and not evil. May you find forgiveness for yourself and forgive others. May you share freely, never taking more than you give."


It really is rock solid, heres an outline of their testing: http://www.sqlite.org/testing.html


It was originally designed to run on guided missile destroyers in the US Navy. Not sure if it was ever used for that, but it explains the scrupulous attention to testing.


Replying to myself, there is a good discussion on Reddit about what that actually meant: http://www.reddit.com/r/programming/comments/1eyfk0/til_sqli....


> It was originally designed to run on guided missile destroyers in the US Navy.

Wouldn't that be in contrast to the license?


I dunno ... sharing frequently, never taking more than you give ... could be describing some kind of firefight ;)


That would be in contrast to the blessing, not the license.


If you are building a guided missle or the destroyer it launches from, you are doing so because you believe it to be good and not evil to have a strong military defense. And the best defense is a good offense or something like that, I don't really follow sports.


It wouldn't be in contrast to anything. Destroying a guided missile is probably an act of self defence, hence likely to be considered good by the author.


No. A "guided missile destroyer" is a type of ship that sends out guided missiles to hit targets, not the other way around.

More info: https://en.wikipedia.org/wiki/Destroyer


It was not intended for the weapons system, which is a completely different beasts. There are many other things for a destroyer (which is a type of ship) to do. Good discussion here: http://www.reddit.com/r/programming/comments/1eyfk0/til_sqli....


Unfortunately the FSF holds the view that a clause like that makes software "non-free". See http://www.gnu.org/licenses/license-list.en.html#JSON


It's a blessing, not an obligation or a demand or a condition of use.

Unlike the JSON license clause "The Software shall be used for Good, not Evil", the SQLite blessing doesn't put any restriction on your "freedom to run the program, for any purpose (freedom 0)" or on any of the other freedoms listed by the FSF.

It's no different from putting this in a library:

# This work is dedicated to the public domain.

# I hope you enjoy using it!

No one could argue that if you didn't enjoy using the library (maybe the API is a confusing mess) you would somehow be in violation of the license.

p.s. I upvoted your comment because I think you raised an interesting point!


> It's a blessing, not an obligation or a demand or a condition of use.

That's how we interpret it. The problem is that we don't know what the rights-holder has in mind, and what that entity's willing to sue over.

Lawyers can be comically risk-averse. Comical, that is, until you see the kinds of things people actually sue over.



The quote explicitly "disclaims copyright" and includes the blessing "in place of a legal notice". It would be truly absurd even for a lawyer to believe that the blessing constitutes a copyright license anyway.


See this 3 minute presentation fragment about the "Do not use JSLint for Evil" license clause: https://www.youtube.com/watch?v=-hCimLnIsDA


It's so idiotic. Who of those who do evil in my view, do agree that what they are doing is evil?


> That's how we interpret it.

No. That follows from the fact that the author gives up copyright. No copyright -> no license terms.


Fortunately we have the freedom to not care about what FSF thinks.


Fine, but the FSF doesn’t think that SQLite is non-free software http://directory.fsf.org/wiki/SQLite#tab=Overview


SQLite doesn't _mandate_ "do no evil".


Interesting, although the FSF example uses "shall", not "may". A lawyer might argue that only the prior is binding.


The FSF has a very narrow definition of freedom.


I wouldn't call this license free either. In many jurisdictions there is no concept of "public domain", in some you cannot even disclaim copyright at all. In those cases the work would be under copyright still, and in absence of a license permitting you to copy and change and distribute it, you can't (theoretically. no one's going to sue you).


The copyright page mentions that they will provide a commercial license through their employer if requested.


I'm not calling this license free. I'm just pointing out that the FSF's definition of free is a very specific one of many.


I know SQLite "does testing right" and they rightly get credit for this.

They also "do documentation right" and this is another accessible, clear example of that.

It's also all public domain. I've used their documentation as an example in the past when driving through documentation improvements and i'll no doubt point to them again!


I just wish they made their C test suite available as FOSS. Instead you are left with the significantly less portable TCL tests. The C test suite costs many thousands of dollars. Gotta earn a living though, right? =)


Am I missing something about this "do documentation right" as while I've got sqlite working a few different times, generally speaking it wasn't their docs but SO that actually enabled me to do so. As far as I remember it is definitely anti-noob. The very first page in the documentation repeatedly uses KiB instead of Kb for example. The first example is in TCL. The 2nd example is in C.

Not Java, not Python, not Ruby. TCL & C.

Documentation done right for a small percentage of programmers elbow deep in neckbeards maybe. Alienating and weird for the rest of them.

Still a fantastic product ofc!


This is a pretty weak argument. Using "KiB" instead of "KB" is precise, not "anti-noob". And I don't think that there's a serious reason for a C library to provide a non-C tutorial on the off chance that you don't know C and have to bind it to a language without libraries--if you're using almost anything modern, you already have a SQLite library and shouldn't need to do the binding yourself.

Even assuming the lack of a library, however, I still find the "but it's all in C!" argument unpersuasive. The idea that C is some deep "neckbeard" stuff is just silly. Part of understanding programming is understanding how your language of choice interacts with the lingua franca that is C, even if you don't know C yourself (which you really should, even if you're working in Ruby or Python or Java on a regular basis). So learn what you need to learn before using SQLite if you have to write your own bindings. That is not a big deal. If it "alienates" you, learn more. It's all out there for you.


Complaining that a C native library has its examples in C is like going to China and complaining that the signs are in Chinese.


Re: Tcl and C -- SQLite originated as a Tcl extension (written in C). Richard Hipp (the principle author) also used to sit on the Tcl Core Team and though doesn't officially fill that position anymore, is active in the Tcl community. Tcl has been (and is) a very important part of SQLite. This is why you'll see examples in Tcl, and the test-suite written in Tcl.

Hipp himself admits SQLite wouldn't have been possible without Tcl.


"As far as I remember it is definitely anti-noob."

If you are a noob, you probably shouldn't be delving directly into the SQLite documentation. You should be reading the documentation that explains the interface for your specific programming language. Python has sufficient documentation for SQLite.


Python has its own built in sqlite wrapper. Just import sqlite3 and then you're 3 lines away from creating the database, opening it and executing queries. Why would sqlite bother about documenting this when python already does it.

Besides, most languages have a predefined and fixed interface for opening databases of any kind and running queries on them. The only thing that differs is the connection string or the constructor of the database/connection/reader-object. For example in python this would be PEP249.


I understand that a lot of programmers start with Java, Python, or Javascript these days, but saying that C is for neckbeards? Are you for real?


Well i definitely would never hire you.


That's why the D in ACID confused me for a long time. There's no way to guarantee durability. They say "durable" but they mean "durable, kinda, we think, we hope, if nothing is lying to us, if your storage array doesn't lose power without a battery to flush the cache, if your system doesn't power down while your drives are lying about their committed write status, or if your write succeeded then an elf ran away with your platters."


In days of old from ACID was dreamt of, computers were built head to tail by giant engineering firms largely with three letter names. In those days, when you wrote to disk, it was on disk. If you handed it to a subsystem to write to disk, you could believe the answer because your people built that subsystem and it had better do what you wanted.

Now, computers are built with subsystems from a half dozen different companies, none of which can be bothered to fully document their product's behavior. Product life cycles are so fast it isn't worth working out what a device actually does, because by the time you do you either won't be able to buy it anymore, or someone's new iteration will be 20% cheaper/faster/bigger and you will get killed in the market if you don't switch to it.

PS: We were also happy to have a whole MIP of Vax equivalent (or identical) processing power, and I was slightly reknowned for my ability to code cleverly enough to get 42 disk IO/second out of our main disk drive, so I'd rather not go back, think you very much. I'll get along as best I can with the miraculous and inexpensive rubbish we build our systems from today.


> when you wrote to disk, it was on disk.

So, if the drives didn't lie about a flush that problem is solved. But if it's on disk at time t, that's no guarantee that you can read it back at time t+1. The drives can physically fail. These "days of old" are before my time, but I really doubt they had magic disks that never physically degraded.


The D is "durable", not "will survive the apocalypse along with the cockroaches". I've got plenty of things I'd describe as durable while being capable of breaking.


Right. I guess I was just trying to point out that most instances of an sqlite corruption are probably not going to be the drive lying about a flush, or an interrupted write. It's going to be a disk that gives you back different bits than you (successfully) put in.


Well it means the system tries the best it can do make the value durable. It will do a sync after every write perhaps. If the disk lies, well there is not much sqlite can do. If disk has a capacitor and battery maybe it can still lie that it wrote but keeps it in its internal buffer and if someone pulls the cord it has enough energy to flush everything it "lied about" to disk. Oh but what if you shoot a proton beam through it and blow away its capacitor and battery, well then you are out of luck.

As someone mentioned, this is contrast to say how MongoDB was shipping not too long ago. They had turned of any acknowledgement for writes. So doing a db write was more like a throw over the fence and pray operation. So it didn't even try to be durable, by design.


Sqlite guarentees Durability by waiting and checking the disk to make sure it is stored correctly before reporting a success. This default behavior is rather slow and can be turned off.

This is in stark contrast to something like MongoDB, which barely parses the request before reporting a success, and doesn't make any guarantees of when or even if it will ever save the data (though it usually does).


What happens if the disk or kernel lie. At both levels, I can see a potential performance enhancement by buffering changing while transparently acting as if they have actually been committed. My guess would be that hard-drives have explicit instruction(s) than ask for what actually happened; but I have never worked with this type of thing.


> What happens if the disk or kernel lie

My understanding is that the kernel (for Windows and Linux values of "kernel") will never lie to you. They will accurately report what the storage drivers told it, and the lying occurs at that level.

Storage drivers seem to be a big source of corruption for MSSQL [1]:

> The most common cause of database corruption (more than 95% of all corruption cases) that we in PSS encounter turn out to be caused by a platform issue, which is a layer below the SQL Server. The most common individual cause is a 3rd party driver or firmware bug.

And as well

> My guess would be that hard-drives have explicit instruction(s)

They do, but the problem is that the storage drivers lie about what actually happened. E.g. Basically they implement write caching for the "flush the write cache" instruction.

[1] http://blogs.msdn.com/b/suhde/archive/2009/04/08/introductio...


It's same story with PostgreSQL aka Postgres. "Almost all corruption issues are related to hardware problems, and most of the rest are due to documented-unsafe configurations like disabling fsync."

But of course doing that is up to the user. Using write-back caching with long sync interval and disabling fsync. It's really nice, gives better than SSD performance with regular HDD. Until you shut it down uncleanly, then you're screwed. But of course any sane person would use this only for temporary or other really non-important data which can be regenerated or lost without problems in such situation.

I'm using such configuration with ERP,BI/ETL (Extract, Transform, Load) tasks. When I start the task, I anyway drop and recreate any tables required for the task. SO I don't really mind if data gets corrupted. That's just life. Doing safe commits would make task very slow.

Only good question is how to balance smartly, in application data / caching, database engine caching and file system caching. In cases where database runs on same server as the processing application.


I'd say that this is a good task for tablespaces and intelligent disk partitioning. Set up a partition for the data you really don't care about with all the speedup options, and create a tablespace that points to it.



If you do happen to corrupt any SQLite database file, it can usually be repaired... unless it is on an Android phone and the app tries to re-open it before you can make a copy:

Android automatically deletes corrupt SQLite files!

http://stackoverflow.com/questions/7764943


I managed to corrupt the SQLite database that Lightroom uses last December. Around that time my old Windows system was having some hardware issues (random crashes) which led me to upgrade to a Mac Mini. However, after the upgrade, the problems started reoccurring and eventually Lightroom wouldn't read in the database at all and wouldn't repair it. I suspect that something bad happened during one of the system's random overhear or whatever shutdowns.

As it turned out, I was able to fix the problem by dumping the database and creating a new one. Oddly enough, Adobe's repair function apparently didn't try this fairly straightforward (once you knew to do it) repair process.


Great stuff and all but can I ask why SQL folks make it hard for an application to get simple read access to a SQLite database? Especially, I have an application which wants to use the Firefox places-bookmarks-etc db file. I can copy the file to a different directory and then run a query on it but can't open it in place for reading, for just reading. WTF is the problem with that? Is reading a file a way to corrupt it?


Use sqlite3_open_v2 with the SQLITE_OPEN_READONLY flag. It's been there for several releases now.


Try using pragma read_uncommited=1; For my stuff I use the sqlite3_backup api to "copy" the database into another, or in memory - then read from memory normally.


This also serves as a clear illustration of SQLite internals and DBMS internals in general without thick clots of rubbish jargon. Nice. If you're interested in database implementation this neatly covers a multitude of topics.


I can tell you one of the easiest ways to corrupt an SQLite db file is to not let a library specifically designed for file operations, handle file operations. And of course, not using provided libraries for db access (E.G. PDO if you're using PHP).

I ran a forum on SQLite a while back, and things worked extremely well as long as I didn't pretend I knew better than the library.

There's no good excuse to not use PDO and there hasn't been one for years now. Before that, I've implemented a write queue, which seems redundant in retrospect, in case the file lock issue came about, but it never did even though I did hit the write queue a few times.

As for crashes, periodic snapshots of the db and journals (that's very important) is usually the best way to avoid recovery issues.


I stumbled onto this article a couple of days ago when I was trying to find a library to integrate sqlite with a cloud service like dropbox. Does anyone know of one that has:

* Read

* Write

* Callback (to sort out paradoxes when two or more devices update the store independently offline and need to merge)

It's possible to do this with CoreData, but poorly, with a high burden on the developer to learn the entirety of Apple's APIs and no way to alert the user as to what it is doing under the hood, which causes the app to hang for minutes or even forever until the managed object context says it's ready.


This is one of the main reasons CouchBase was created. Its community has fragmented as the company tries to figure out how to make money, but it would be worth your time to check it out. Here the Xamarin folks explain their new CouchBase Lite wrapper: http://blog.xamarin.com/simplifying-persistence-with-a-docum...


This is just exactly what I was looking for, thank you! I especially like that it's document oriented and that replications/merges are handled in the simplest way possible, by presenting the most recent revision ID and giving the application the option to inspect previous revision IDs and use custom logic or ask the user which revision to use.

I think where they perhaps missed the boat is that nearly every app that needs to synchronize across devices needs this, but I'm having trouble finding a simple SAAS plan. When I was younger I was interested in hosting my own database but now I'm just not. I want to pay a few bucks and have someone else do it.

So on that note a possible startup idea is to host couchbase and charge for it. I think the core of the problem is that app sales are one-time, so a subscription model may not be appropriate. But the bandwidth will typically be so small that it won't matter. So that puts the total value per user maybe in the 25 cent range. How many million new users per year would it take to gross a million dollars.. I can see their dilemma. But, I think there is something to this. Whoever pulls it off could be the next dropbox but for databases. If this all clicks for someone, look me up!


Tcl (the test harness) at work.


The message I got was that POSIX advisory locking needs to be ripped out.


On seeing this first thought was "oh, another Ess Kyoo El person" because reading a "An See Kwul" instead of "A See Kwul" just grated. Pardon my attempt at phonetic spelling.


The official pronunciation of "SQLite" is "ess kyoo el ite". That's straight from the author.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: