If you haven't tried SQLite, please do. For years I ignored SQLite and used MySQL (it does the job) but once you see how fast SQLite is, and advantages of having a DB contained in a single file... just go play around with SQLite instead of ignoring it for years like me. It's neat.
Counterpoint: I over-used SQLite because it was the first database I encountered and spent waaaay longer working around its shortcomings than I eventually spent porting to postgres.
Long version: I couldn't get bulk insert performance above absolutely miserable levels. I tried tricks like deleting and recreating indices but without luck. The perf tooling wasn't there to quickly figure out where the problem was (this was 10 years ago, not sure if things have improved) so I wound up building a version of SQLite with debug symbols and profiling it with a C profiler. The problem turned out to be a default setting that made spill-to-disk very aggressive and basically guaranteed that any workflow like mine would grind along with miserable slowness and no outward indication of what to do about it. I found an email thread where someone in effectively the same situation made some constructive suggestions and got turned away on the principle that even casual users ought to just know performance knobs like this one. Yikes. I am probably munging some of the details, but it made me angry enough to learn postgres and port my code over despite having a fix for my immediate problem.
Like everyone else, I'm going to offer tuning tips even thought that very much wasn't your point :-)
What you are describing sounds almost exactly like PRAMGA synchronous = FULL [1] (which is the default). That pragma controls when fsync occurs. Depending on your application, you might have got away with NORMAL or even OFF. Again depending on your application, you could have set journal_mode = OFF or increased the mmap_size (both also discussed on that page). Yes those are fairly magical hacks, but synchronous and journal_mode at least are things are always worth considering for a new SQLite database (and before building it with debugging symbols!).
Even without those tweaks, the really key thing is to use fairly large transactions. I'm surprised that alone wouldn't have got you decent performance.
One option you didn't have at the time but might help today is write-ahead mode [2] with journal_mode = WAL (but still presumably not as fast as journal_mode = OFF!). I believe the only reason it isn't enabled by default is for backwards compatibility. According to that article, it was introduced in 2010-07-21, and improved to better handle large transactions (>100MB) in 2016-02-15.
I've been there. It doesn't work. I've been applying all those tips and tricks, but it didn't improve performance that much, but I have lost data.
Sqlite is full of locks, and any writes are single-threaded stop-the-world, and there's no way around it. It's Sqlite's philosophy. Write-heavy databases just need something like MVCC, and Sqlite won't have that.
I think you make a point which is very true and valid but not relevant to this particular comment thread.
You are talking about multiple different processes/threads heavily and concurrently writing to a database. In that case, you're absolutely right, the point has come to switch to a client/server database like PostgreSQL or MySQL.
But the parent comment was not about that (or at least they didn't explicitly mention concurrency, and their mention of "a default setting that made spill-to-disk very aggressive" rather than locks suggested that concurrency wasn't the problem). They seemed to be talking about a single writer inserting at a high rate, which is something I'd expect to cope with very well with the right tricks (mostly batching multiple inserts in transactions - I acknowledge their comment that the defaults are unfortunate though). Yes even with a single process there are locks, but if a lock is uncontended then it is not normally a problem.
There’re disadvantages too. It does not implement SQL, the queries need to be done manually on top of various indices in these tables. The DB has much more than a single file. The API is way more complicated than sqlight. The databases are portable from older to newer versions of Windows with automatic upgrades, but not the other way.
The JET database continually corrupts itself. It’s terrible.
The Windows search functionality uses this database and the only way Microsoft was able to make this feature somewhat reliable is by having it run very thorough checks at startup and tossing the database and reindexing at the first hint of trouble.
I don’t know why it is so terribly unreliable but I know it is.
Interesting, I have opposite experience. Once I had lots of experience with AD and Exchange. Later I shipped desktop software based on ESENT, and supported it for a few years. Was generally happy with the technology.
Are you sure these issues are caused by the DB engine? Other possibilities include your PC (like interference with crappy AV software), or Microsoft’s code of these search indexing services.
This happens on large amounts of PCs and I don’t think the way Microsoft uses its own database engine should be an excuse for it corrupting its data.
I think the difference is that Windows Search is always running, including when the computer crashes, is turned off unexpectedly or doesn’t complete waking up from sleep or hibernation. I think JET just isn’t that robust against that. It’s quite conceivable people manually closed your software when they shut down their computer.
I remember Exchange failing the same way if its database disks suddenly disappear due to network issues.
Microsoft is large and software quality varies. For instance look a Skype, they failed to use their own GUI frameworks and are using Electron i.e. Google Chrome to paint a few controls.
> I think JET just isn’t that robust against that.
Some of them are just the unavoidable hardware failures, for others they suggest ‘Deploy the OS on server-class hardware’. Or the always helpful ‘restore from backup’. Not quite reasonable for a database containing a search index on a consumer device.
> Not quite reasonable for a database containing a search index on a consumer device.
Before I switched to MS Outlook, I was using Windows live mail (now discontinued) as an e-mail client for a decade or so, it used ESENT for everything.
When I run process explorer and search for “esent.dll”, it finds a dozen of system services using ESENT databases, many of them critical like CryptSvc.
I try to buy good hardware, but that’s not server-grade components. I don’t use ECC RAM nor a UPS, and I suffer from brief power outages couple times a year. If ESENT would be corrupting databases when the power is turned off suddenly, I would have noticed.
> Sqlite is full of locks, and any writes are single-threaded stop-the-world, and there's no way around it. It's Sqlite's philosophy.
SQLite is not the silver bullet of the DB world, but it's extremely useful in certain scenarios. Will you need a DB for a desktop app? Ditto. Want a portable file to carry some complex data structure? Ditto. Want to test your webapp during development, or provide a single-user version for users? Ditto.
Concurrent writing, multiple user, multiple producer scenarios need something bigger. MySQL Embedded, MySQL, Posgres, MSSQL, Oracle... List goes on.
SQLite makes databases accessible and useful in much more scenarios. I've hated databases until I've found SQLite, because I simply didn't see the point of running a big server which is designed to handle much bigger data sets to store 250KB of text tables.
I think they are referring to a OLTP type scenario where you have lots of inserts and reads, but the reads only look at a couple of rows at a time, often only a single row. MVCC is good for such cases because you can avoid the overhead of taking lots of locks but still get fine grained concurrency control.
You are right that the Sqlite approach actually works quite well for bulk operations since you only require a single lock. However, it's usually still better for bulk inserts without updates to use fine grained locking or MVCC since you can often avoid acquiring any locks at all beyond the basic ones guarding fundamental DB data structures (these aren't locks as far as SQL is concerned since they cannot cause deadlock, it's a big pet peeve of mine when people think lock-free = no use of mutexes).
As a side note, don't do the following pattern: "BEGIN TRAN; INSERT INTO T ...; SELECT max(id) FROM T; COMMIT;". I used to do this, but this is a very bad habit that may be incorrect (assuming that id is an autoincrementing column). It's only correct on systems with true serializability, when you have opted into full serializability, and where such systems consider autoincrementing IDs to be part of serializability. When I tested this, Postgres and MSSQL handled this as expected while MySQL allowed the select to return a different row. I just tested Sqlite, and it does seem to work there regardless of WAL since it only allows concurrent readers plus a single writer. Use last_insert_rowid() or the equivalent for your database[1].
> As a side note, don't do the following pattern: "BEGIN TRAN; INSERT INTO T ...; SELECT max(id) FROM T; COMMIT;". I used to do this, but this is a very bad habit that may be incorrect (assuming that id is an autoincrementing column). It's only correct on systems with true serializability, when you have opted into full serializability, and where such systems consider autoincrementing IDs to be part of serializability.
And even when it work, it'll create a lot of unnecessary conflicts.
> Use last_insert_rowid() or the equivalent for your database[1].
See what you need to do is stand up a Redis cache cluster in front of your app, and then a cassandra cluster to cache the writes and compact them, before synchronizing the data with your single-page SQLite datastore as the master data.
Obviously for your SQL queries we crank up a large SparkSQL cluster.
One trick of `pragma journal_mode = WAL` is that even with synchronous normal, SQLite syncs when you close the last connection [1]. So re-using a connection is key to getting the full benefits if you're executing commands from a single thread/process.
Out of curiosity, just how many rows were you trying to insert, for this to be a problem? My memory is a bit fuzzy, but on SQLite even standard INSERT statements can scale to hundreds of thousands per second, if you do them in one transaction. Just curious about the scenario here.
This is quite a generalised point so difficult to really respond to but some frameworks do provide a mechanism to check if a transaction has been closed. The stateless nature of web development does also make working with transactions easier as you have a clear entry and exit point of each request. However I appreciate not all uses of RDBMSs are stateless, let alone web applications. But a similar approach can sometimes be applied with static stateful applications where you can calve up the transactional operations into logical code paths that are easier to mentally visualise. As always though, it really depends on how your code is structured and the specific problems you're trying to solve.
Where you know code boundaries are an issue I've found functional designs tend to work a little better than OOP with regards to managing transactions but a lot of that could just be down to how my brain is wired (while I'm not a FP evangelist I do tend to favour breaking code down to stateless functions rather than stateful classes).
It's fair to say spaghetti code will be a problem on most reasonably mature code bases but there are approaches and frameworks that help somewhat with managing transactions across code boundaries -- just as there are tools that make working with transactions harder. But in my experience there are much harder problems to solve than working with transactions.
> but it's not like they are a zero-overhead feature.
Is there such thing as a zero overhead feature? (I say this semi-flippantly).
I mean I have to deal with this crap at $dayjob but I genuinely can't believe of the terrible code I see that borrows a resource (connection, transaction, file handle) and then only the happy path gives it back. I desperately wish that languages would make this a compile error if all code paths don't lead to the resource being freed.
The only thing that should ever stop you from returning a resource is a malicious scheduler.
I can see the obvious parallels between memory management lifetimes and transaction management, does rust have explicit features for extending lifetimes to resources besides memory?
My point is that using transactions should be drilled into people who work on databases and where consistency is a requirement because transactions turn multiple complex SQL requests into one atomic operation:
I've never worked at a place that doesn't use them. What if the 3rd insert in a series fails and consequently writes the wrong thing down on the 4th with an update?
That's just my experience so I guess it may be meaningless but I'm surprised to hear it may not be often used.
I've seen all kinds of wonders during my life so this one is no surprise. If however someone is doing stupid things it is their problem. They're free to complain to themselves.
Tangentially - it may be their problem, but I hesitate to say it's their fault. I'm continually dismayed at how spotty and superficial education about how to use an RDBMS can be. Even in formal education on the subject.
It does not take PhD and rocket science for one to figure that sometimes operations must be bunched and executed with success/failure as a single unit. That would come as a business requirements. For curious person it would not take much to do some search on a subject and discover and read about transactions.
You hardly ever need transactions if you track validity explicitly in your schema.
Suppose T1 has a 1-many relationship with T2. Declare in your assumptions that any rows in T2 with no corresponding valid row in T1 are not valid.
Additionally, have an is_valid field on T1 so selecting all valid data from T2 is done with "select * from T2 inner join T1 on T2.t1id = T1.id where T1.is_valid".
To insert data, insert a row into T1 first but initially have is_valid be false. Then insert all necessary data into T2. Finally, do an update and change the original row in T1 to have is_valid be true.
For deletions to T1, just do an update and set is_valid to false. Thanks to the validity logic, this has the effect of also invalidating all T2 rows.
Updates are trickier, but you can allow them to work without transactions by having two ids for for every table. The first id is the one we worked with before which is used for joins. The second id is used by applications to look for explicit records. Therefore, just never do any updates aside from the one setting is_valid to true (which is really storage logic and not application logic). Instead, just insert a new row into T1 whenever you want to update something in T1. The final update now just needs to flip the is_valid bit for the old row and the new row and will also need to verify that the old row is valid as well as any other rows the current update relies on (basically need to turn it into complex CAS).
All of this is pretty messy, but it does let you have CRUD without any transaction support from your DB. Also, even if your DB has transactions, this scheme has the advantage of being lock-free so your application cannot deadlock.
If you have many updates/deletes, you can do garbage collection either by allowing the GC to use a transaction or by changing adding in a check for insertions to T1 that verify the number of associated rows in T2 before setting is_valid.
Unfortunately, while updates and inserts with GC can still be lock-free, they are not wait-free since an insert or update can fail. If you never do updates or GC though, this is actually wait-free and guarantees that every create, read, and delete operation will succeed in the absence of hardware/network failures.
Still, this overhead probably isn't worth it unless you already need to track the history explicitly for auditing or something. At the company where we used this, we didn't have an is_valid row, we had valid_from and valid_to which were timestamps.
Same as @mikeyjk! Can you point toward more information on this topic? The applications I'm dealing with in my day job are constantly plagued by deadlocks in the database surrounding bulk transactions. I've theorized that a "T1 1-many T2" refactoring of the tables - like what you've described - might be a solution , but haven't settled on how to implement. Would absolutely love to learn more on this esoteric, hard-to-google topic.
All of this sounds like a lot of additional work just to provide some of the functionality transactions provide out of the box. And I say "some" because you've not addressed rollbacks.
That said, it's still an interesting thought experiment so glad you shared.
Wow, this has broken my brain but has really piqued my interest - thanks. Do you have any reading on this topic / strategy to help a mouth breather like me out?
Another trick if you have multiple process (users) accessing the DB and you don't want to lock the DB for a long time, is to insert into a temp table (possibly with a transaction, although the time savings is not as dramatic with temp tables). Then copy the temp table to the main one (insert into ... from ...). The advantage is lets say you are reading in a bunch of items from something else, that will take a chunk of time more than just the DB time. So by going to a temp table you aren't locking the target table for anyone else while gathering the data. Then combine this with flushing the temp table every X rows or X seconds, and you have a number of efficient updates to the table without long lock times.
Also have WAL mode on to get multi-user access going.
If you have wal-mode enabled then the automatic locks are table level. So one process can be updating one table and another one can work on a different one. Also, you only need to enable wal mode on the DB once (pragma journal_mode=wal), it "sticks" for each connection.
In my application that uses SQLite (Snebu backup), as data comes in (as a TAR format stream) I have one process extracting the data and metadata, then serializing the metadata to another process that owns the DB connection. This process dumps the metadata to a temp table, then every 10 seconds "flushes" the metadata to the various tables that it needs to go to. This way I can easily have multiple backups going simultaneously, as each process spends a small amount of time (relatively) flushing the data to the permanent tables, and a greater part of the time compressing and writing backup data to the disk vault directory.
I've been working with this for the past 8 years or so, and have picked up a few tricks on keeping as much as possible batched up in transactions, but also keeping the transaction times short relative to other operations. So far seems to work out fairly well.
Note, that in addition to journal_mode=wal, you need to have a busy handler defined that infinitely retries transactions with a 250 ms delay between each retry.
Edit: On further review of the docs, I'm not sure if wal mode enables table-level locking, it may be that when writing to a temp table, that temp tables are part of a separate schema (or are otherwise separate from the main DB) -- which makes sense, as temp tables are only visible to the process that owns them. So a temp table can be locked in a transaction, while the rest of the DB is writable.
It seems like the locking is page-level in "wal" and "wal2" modes:
> Usually, SQLite allows at most one writer to proceed concurrently. The BEGIN CONCURRENT enhancement allows multiple writers to process write transactions simultanously if the database is in "wal" or "wal2" mode, although the system still serializes COMMIT commands.
> When a write-transaction is opened with "BEGIN CONCURRENT", actually locking the database is deferred until a COMMIT is executed. This means that any number of transactions started with BEGIN CONCURRENT may proceed concurrently. The system uses optimistic page-level-locking to prevent conflicting concurrent transactions from being committed.
> When a BEGIN CONCURRENT transaction is committed, the system checks whether or not any of the database pages that the transaction has read have been modified since the BEGIN CONCURRENT was opened. In other words - it asks if the transaction being committed operates on a different set of data than all other concurrently executing transactions. If the answer is "yes, this transaction did not read or modify any data modified by any concurrent transaction", then the transaction is committed as normal. Otherwise, if the transaction does conflict, it cannot be committed and an SQLITE_BUSY_SNAPSHOT error is returned. At this point, all the client can do is ROLLBACK the transaction.
The page also mentions:
> The key to maximizing concurrency using BEGIN CONCURRENT is to ensure that there are a large number of non-conflicting transactions. In SQLite, each table and each index is stored as a separate b-tree, each of which is distributed over a discrete set of database pages. This means that:
> Two transactions that write to different sets of tables never conflict
+1. I was not aware of the 'stickiness' of the wal pragma. Thank you for the tip. I will be defaulting to this going forward. For a quick confirmation, I used:
The code is on github.com/derekp7/snebu if you want to take a look. The next task on my list is to write up developer / contributor documentation on how the whole thing is put together along with various tricks (and simplified demonstration code).
But specifically, look in "snebu-main.c" that is where the opendb function is (so you can see the pragma statements), and there is a busy_retry function that gets referenced (all it does is sleep for .1 seconds). I believe that you don't need the busy-retry function, if you use the built-in busy handler, but I'm not really sure and don't want to take a chance and break working code.
For the temp tables, look in snebu-submitfiles.c -- the function at the top handles the DB operations, one towards the bottom handles the non-DB tar file consumption operations, and there is a circular queue in the middle to handle buffering so the data ingestion can keep going while the data is getting flushed (these three run as separate processes). I should learn threads, as there may be more flexibility in that, but not comfortable enough with thread programming yet.
I think parent means that for some careful selection of N (where N > 1) insertions per bulk transaction you can scale up to hundreds of thousands of insertions per second, rather than putting hundreds of thousands of insertions in a single transaction.
No, the opposite. In SQLite, starting and ending transactions that write things to the db is a relatively expensive operation, and running queries outside transaction is effectively the same as running each of them in an independent transaction.
If you need to do a lot of inserts (or updates, etc), the slowest possible way to do them is to do them outside of a transaction. The fastest way to do them is to wrap them all into a single transaction.
> If you need to do a lot of inserts (or updates, etc), the slowest possible way to do them is to do them outside of a transaction. The fastest way to do them is to wrap them all into a single transaction.
This doesn't seem to contradict the comment you're replying to. They're suggesting wrapping operations into transactions in batches e.g. (just making some numbers up) if you have 100,000 inserts maybe you'd do 100 transactions of 1000 inserts each. I wouldn't call that "the opposite" of your one mega-transaction suggestion. In fact I'd expect it to still have most or all of the speed benefit of using one single transaction, or potentially even be slightly faster.
Oh fascinating, you actually put hundreds of thousands of statements in a single SQLite transaction in an online CRUD app (as opposed to offline processing)? I've never done more than a couple hundred and even then usually they're "logically batched," both because I'm worried about forcing unnecessary read to write transaction promotions for concurrent reads and thereby increasing busy errors, but also because that affects durability to have a transaction open that long (it's not great to let your HTTP response hang for a second before responding as you keep your transaction open).
For serialized writers in any system I'm sure keeping a transaction open as long as possible is the ideal case for throughput, but there's other problems with that in a CRUD app no?
Well, that was my original question to the the parent poster, too. You can achieve 100k+ /sec, but they went looking for better performance, apparently.
But to answer your question at face value, one common use for SQLite is as a file format for complex desktop apps, or perhaps a game save format for certain types of games (mostly the non-realtime types). One great advantage to this is that if you use a DB migration library, you ensure backwards compatibility with previous versions of your saved files. However, it's easy to imagine a fair bit of data getting inserted into such a new file each time it is created. It might only be 5000 rows, but I'd prefer to not have to wait 5 seconds just to persist it, if a single change can make it 0.05 seconds.
My main app sync data across ERPs and their main case is batch loading of data. This mean that I need to nearly mirror a SQL Server/Oracle/Cobol/Firebase/Etc database into sqlite, clean it, then upload to postgresql.
I have more troubles fast loading into PG than sqlite (not saying I don't have them in the past!) and sqlite is very very fast to me.
I was building a small Rails project that has mostly reads, but here and there it has inserts that can have 1000+ entries with related models. Things seem fine when clicking around, but when I used jMeter to test what is the capability of the server, I found it terrible with simultaneous requests.
Adding two workers made things even worse, and this is with ~10% writes and the remaining going to reads.
I quickly changed the db to postgres for comparison purposes, and from 3-4 requests per second with sqlite, it jumped to 60-70 as it did scale linearly with the number of workers.
I am by no means expert in database optimisations, but out of the box this behaviour was somewhat limiting the usability of sqlite for a webapp.
Postgres is pretty bad at bulk insert. You need to use COPY INTO and create an in memory CSV file or mess around with unnest() to get decent performance.
`copy to` yes. That's what bulk insert is. How can having the feature make postgres bad at it?
"in memory csv" no. You don't need to create an in memory csv file in order to load in bulk. Your language binding should provide a way to stream data into a `copy to` process
I can't imagine how unnest is related to bulk loading. It's a query feature to unnest arrays.
The relationship between bulk loading and unnest is as follows.
A single insert of an array using an unnest can insert many rows. The performance is worse than copy to, but in the same ballpark. But there are use cases where you'd like to bulk load through a stored procedure for a variety of reasons, and now calling the procedure with an array and using unnest internally is a straight win.
Thanks for sharing! Did you have performance benefits using PostgreSQL out of the box, or did you have to tune specific settings? How much effort did you have to spend to do this migration?
If you like SQLite, then DuckDB[1] is probably worth looking at. Very similar in many ways, but DuckDB is a column-oriented rather than row, so does have some performance advantages. It is quite new, so I might not go all in for mission critical production yet, but it is worth exploring for analytics work.
Question for ppl using DuckDB: are the use cases similar to what you'd use Apache Arrow, but with the benefit of working in SQL, or are they meaningfully different?
I'm not currently using any of those, mind you, still on a pandas/dask* dataframe basis, but I'm trying to wrap my head around where the ecosystem is moving
*I know Dask is already using Arrow behind the scenes
I don't use Apache Arrow, so I'm in a poor position to compare it with DuckDB.
My use case for DuckDB is effectively querying R dataframes with SQL. DuckDB has the functionality to register virtual tables, with data from existing dataframes.
As I know SQL reasonably well, using DuckDB to query dataframes means I don't need to learn a bunch of new dplyr verbs or data.table constructs.There are some other R packages which also support this use case - sqldf and tidyquery are two I am aware of. Both of these follow a different approach, where they parse the SQL query. Using a DuckDB virtual table lets the database handle all of the SQL.
I've found so far that through using DuckDB, performance is much better than sqldf and tidyquery, nowhere near as quick as data.table and can be quicker than dplyr, depending on query complexity. I haven't really looked at anything approaching big data sizes though.
> Flexible typing is considered a feature of SQLite, not a bug. Nevertheless, we recognize that this feature does sometimes cause confusion and pain for developers who are acustomed to working with other databases that are more judgmental with regard to data types. In retrospect, perhaps it would have been better if SQLite had merely implemented an ANY datatype so that developers could explicitly state when they wanted to use flexible typing, rather than making flexible typing the default. But that is not something that can be changed now without breaking the millions of applications and trillions of database files that already use SQLite's flexible typing feature.
It is possible to create check constraints that do the type checking. Actually, if they shipped some predefined ones and said "if you want strong type checking do this", like, syntax to automatically populate the table columns with type check constraints, it would be IMO perfectly backward compatible.
When I started a project with SQLite, the available data types struck me as odd (coming from MySQL), but as I learned I started asking the question: are there any more fundamental data types other than null, int, real, text and blob? For example dates are just a facade for an integer of some kind, JSON is really just text adhering to certain formatting rules, booleans are usually stored as some kind of byte anyway so why not drop that abstraction?
With this limited set of datatypes it really makes you think harder about the data you are processing, because in the end all your data is one of these types anyway.
I don't really see it as such a problem. When interacting with the database you either use an ORM with your strongly-typed business objects or you write the SQL and use parameters (again strongly typed)
> When interacting with the database you either use an ORM with your strongly-typed business objects
If you have strongly-type business objects then why not have a strongly typed storage? If you code can control the constraints that a correct data type ensures, then why have "strongly type business objects" to begin with? Why not store everything in your code as strings as well?
I see this misconception all the time. The database (and its data) lives way longer than most applications. And it's also a wrong to assume that there is always only one application accessing the database. Bulk loads are a typical case of secondary applications.
Not choosing the proper data type in a relational database is a really bad decision and we see question on stackoverflow and similar sites on a weekly (if not daily) basis asking how to fix invalid data in those "un-typed" columns.
I'm not saying it's useless to have type constraints in a database. Just that with a bit of care it's not such a big deal.
In the end there will always be some business rules that are not constrained by the database. So, you always have to be a bit careful about what you store in it. Indeed not being careful and hoping that your types, constraints and triggers are going to save you is more risky
> Not choosing the proper data type in a relational database is a really bad decision
Well, then rejoice, you can't make this bad decision in SQLite because everything is +/- a number or a string
SQLite just does what JavaScript and other weakly typed languages do. The developer is responsible for what they input. If type checks are a must, SQLite is just not for you, easy as that.
I don't see how having "NaN" or "undefined" in a number field makes any difference to just putting the value there. If you are using a strongly typed language, this will not be a problem.
SQLite made a design choice in favor of simplicity. It's also missing basic date functions all together. The only way to compare dates is by using Unix Epoch.
They were carefully designed so that collation order is identical to temporal order. Which is convenient!
If you need interval logic, though, SQLite won't help you, and epoch is the better choice. It's possible to solve some queries with a regex, but you won't love it.
Chill out. This is a design choice to make it more viable to be used in dynamically typed languages. It's maybe not the design choice everyone would make, but it's not, like, insane: this is how all variables in Python/JavaScript/Ruby/whatever works.
I mean, you can represent everything under the sun as a string (or a very big int) but that's a level of operating where I find my brain starts to get fatigued for absolutely no real gain at all.
Yes of course. That's why I said I'm sure we could get it to work. But then it has to be converted to kg for actual usage. Either directly in the queries or views.
Having a fixed-point decimal type allows us to not think about once the table is created.
Same issue with money. Most of the time we need to store monetary values with two decimal places (cents).
What kind of application are you developing where there is so little middleware between the database storage layer and the user that converting to/from kilos adds significant additional complexity, but your users aren't technical enough to just deal with having weights in grams?
These are electronic versions of official forms. The user interface and reports should show what is actually being sent or has been sent.
Sure we could deal with it, but there are quite a number of tables due to different forms and messages, and then there's all the reports. Many custom ones thanks to to local officials wanting data from a certain customer in a certain way...
Bureaucratic forms have a tendency to change unpredictably. Today they want weight in kilograms, tomorrow they'll want it in grams, and the day after that in standardized intergalactic weight units. I get what you're saying (and I don't like writing unnecessary mapping boilerplate too), but there should be a middle layer to protect the application from craziness like that.
Thankfully things in our sector is fairly static in that regard, so this is seldom an issue. That sounds quite horrible indeed.
And yes, a fatter middle layer would be nice. Our next generation software will probably have more of that, this code base is over 20 years old at this point...
I realize this response turned out a lot more harsh than I intended. I was in a slight rush so didn't formulate myself well, didn't mean to be abrasive.
Yeah for being so uptight about just about everything, the officials seems quite relaxed about values slipping between the cracks due to lack of decimal places.
Like, total invoice value can only be specified with two decimal digits, typically in foreign currency. Yet we also have to specify per-line value in local currency, also with only two digits. And then the per-line values are used to calculate taxes and whatnot...
With this sort of perspective, why use a database at all? You can store literally any digital object as some sequence of bytes, and that's what file systems are for.
At some point, you actually like for the software you use to actually have meaningful features.
But dates are integers (all databases store dates as an integer that is an offset from some fixed date), and JSONs are strings. SQLite has the functions to work with these datatypes (dates, JSON, etc...) but fundamentally they are not different from integers or strings.
And integers are bytes and strings are bytes and tables are bytes and indexes are bytes. If you're not happy with databases being intelligent, why stop with integers and strings, when you can do everything you want with bytes?! Memcached, or LevelDB, or even a file system is the ultimate database if you follow your arguments to their logical conclusion.
But data representation isn't data. You can't look at a serialized sequence of bytes and know what it represents without context - without a serialization scheme. For relational databases, the table definition is the serialization scheme - it is the context. For example, you can't know what date an integer is representing without contextual information like the offset. By storing dates as dates in your database, that context is baked in.
It is helpful to have that full context inside your database because it allows you to operate that database more efficiently (by carefully indexing on the properties of the data, not the data representation), but it also allows you to use that database in ways that are not tightly coupled to your application, such as analytics, because you are storing data itself and not just the bare minimum required to represent that data.
Having rich data types allows a lot of things without having to write your own layer. Spatial indexes are extremely efficient, and I can't query geometries efficiently when I'm storing geometries as a blob. The same goes for dates, json, XML, ranges, etc.
The problem with reserving specialized logic for the application layer is that it limits you to simplistic indexing schemes and you end up doing excessive IO and filtering in memory to get what you actually want.
The idea that databases shouldn't have specialized datatypes is really only an idea that works in simplistic crud apps. The world is much bigger than that.
I tried to switch from MySQL to SQLite for my Postfix/Dovecot installation but I ran into a very annoying problem: every now and then I lose an email because the "database is locked".
This is a show-stopper for me because the problem happens very rarely, only once every couple of days, but it's catastrophic: when this happens, the incoming message is not bounced, it is actually lost. The only reason I even realized it was happening is because I noticed there were emails in the root account, which is the error-reporting mechanism of last resort.
I've searched the web in vain for a solution. If you have any suggestions, I would love to be able to stick with sqlite, but at the moment I am about to begin migrating back to MySQL. :-(
This sounds like a problem with Postfix or Dovecot. Postfix shouldn't ack or Dovecot shouldn't delete the email until it is safely stored.
That being said there are many use cases where having high availability is critical and in the face of multiple writers SQLite isn't the best option for that.
I don't think this is a multiple-writer problem. Postfix is only reading. I am running a milter that is writing, but I control the code for that so I have it set up to retry if it fails. So the error is being generated by postfix itself, and so it must be happening on a read (because that is all postfix does).
I was hoping to find some kind of global switch that would make sqlite always wait for locks rather than throwing an error. But I've scoured the web for such a solution without success :-(
I don't understand enough about your specific problem to know if this will actually help you, just sharing a tidbit I encountered working on a comparable issue.
It's terrific right until you need multiple processes writing to the same database. It's no accident that SQLite is fast for many small queries; it's not doing a lot of the work required to, say, be a good database backend for multiple web frontends.
Good observations from a MySQL perspective. Any thoughts from the other end, where the alternatives are JSON or XML or ZIP? SQLite tries hard to convince you to use it as an application file format, but it looks like a giant black box of overkill: why incorporate its 200k SLOC when the alternatives are a fraction of the size?
Is 200kLOC really a lot? Lots of software nowadays has hundreds of megabytes of dependencies, and people seem to be fine with that. Not that I think having tons of dependencies is really a good thing, but for an application that needs a file format, SQLite is a very sensible dependency.
neither XML, JSON nor zip solve the problems SQLite does, though; if you use plain old files, you need to make sure any changes you make actually end up on the disk, consistently. This is not easy to do. It also solves any consistency issues that might stem from someone reading the data while you're writing it.
On top of being just better, having a relational model for your data gives you much more freedom to use said data; you'll be able to do things efficiently that might require restructuring your JSON or XML format. Personally, I love SQLite-based application formats because I can explore them with SQL, which is often much easier than trying to make sense of a custom JSON or XML schema.
Yes 200k SLOC is huge (modern development practices notwithstanding). SQLite creates temporary files at whim - nine different kinds! https://sqlite.org/tempfiles.html
I know how to atomically write a JSON file. But when I read, for example:
"The temporary files associated with transaction control, namely the rollback journal, super-journal, write-ahead log (WAL) files, and shared-memory files, are always written to disk. But the other kinds of temporary files might be stored in memory only and never written to disk. Whether or not temporary files other than the rollback, super, and statement journals are written to disk or stored only in memory depends on the SQLITE_TEMP_STORE compile-time parameter, the temp_store pragma, and on the size of the temporary file..."
My eyes have completely glazed over. If I add this to my app, what will it actually do? How can I even know?
Are you sure? I’ve had a lot of trouble getting that to work reliably myself across multiple OSes. (In hindsight I wish I’d used SQLite!) This article gives a good explanation of the many difficulties:
My eyes have completely glazed over. If I add this to my app, what will it actually do? How can I even know?
Well, fundamentally it’s very hard to get it exactly right, and I imagine that’s why the implementation is a little involved.
But you could a) read through those docs, lengthy though they are, and/or b) trust the many testimonials saying SQLite is very, very robust and reliable.
Unless you're on nfs. Remote file locking is hard, and I don't think that any nfs implementation has gotten to the point where you can trust SQLite on it.
SQLite does updates in place, which I would trust far less than a rename call.
I don’t think you can atomically append more than one byte to files in unixes (the write call can return after having written some but not all requested bytes)
(Haven’t googled, but if that’s possible, I don’t see why write would have that limitation)
Yeah, and eventually we reach the best-effort bedrock. Maybe the file is on a NFS mount, you call write(), it goes over the wire, who knows what happens!
> If I add this to my app, what will it actually do? How can I even know?
Be really, really, *really*, unambiguously sure about whether your data was written or not, AND have high confidence that I/O errors (eg, power loss) in the middle of does of deletes won't scramble (or truncate) existing data.
What you're looking at is the complexity required to solve for the wonderful tornado of "but it's my data really written???". But you don't have to deal with SQLite's implementation details in order for it to do its thing, which is what makes it so awesome (given is public domain status, what's more!).
File formats based on ZIP files only work for small files. For big files you have huge overheads; opening and saving a moderately sized documents takes seconds (vs. milliseconds for writing changes to an SQLite database). There's a reason why Excel files are limited to a million rows, while Access databases aren't.
The complexity of including SQLite is trivial for practical purposes; it's already available on many systems, and if not you can include it by adding a single C file to your project.
Setting up a workflow for Google Protocol Buffers (another popular alternative for document file formats) is a lot more complex than building or linking with SQLite, and it doesn't stop people from using them.
One thing that speaks for SQLite is the quality of the project; it's one of the best maintained Open Source projects with fantastic quality assurance and support for almost every OS. This means that you are unlikely to run into issues compiling or working with SQLite, like you might have with alternative libraries like libxml2 or jsonc (which are still great libraries!!).
EDIT:
The big downside of SQLite is that it's unsuitable for documents that are exposed to the user because of the temporary files (like the WAL). If you have a ZIP based file format that you atomically rewrite from scratch on every save, it's almost impossible to corrupt. Your users can just take the file and email it and nothing bad will happen. I'm not sure what happens if you email an SQLite database file that is currently being used. I've done that in the past and have been surprised that some data seemed to be missing, but I don't recall the details.
Hence SQLite is often used for application data files that are not directly exposed to the user.
You could get SQLite to work as document files exposed to the user so long as you use sessions[1]. When a file is opened, copy the DB to a temporary file or to use memory and write all changes during operation to this new DB, recording them all in a session. When the user explicitly saves a document, apply the session to the real DB.
That doesn't sounds like it would just combine the drawbacks of both approaches: You end up with slow open/save operations, and the writes to the db still aren't necessarily atomic (eg. if a user copies the file while changes are being applied)
JSON/XML quickly stop being alternatives as soon as you need any sort of index, a memory-mapped/on-disk data structure that doesn't have to be loaded into memory, transactional or even just incremental writes. ZIP is not even directly comparable.
It's not limited to phone users. I've been using computers since the 1980s (C64), and I appreciate not needing to habitually keep pressing "save" every few seconds in Google Docs or macOS Notes.
#1 best jetbrains idea feature IMO - save on focus lost. Just alt tab into your app, or into your terminal to git commit, no worrying about "did I remember to ctrl-s".
> That’s not normally what you need for an application format though is it.
Well, it depends what you mean by “need”. But continuous, incremental updates generally provide a much better user experience, either instead of or in addition to active “save” actions.
So, yeah, I think its exactly something that is commonly desirable in a file format for maintaining application state, even if there is a different interchange format that the application produces/consumes as a static input or output.
During the alpha Minecraft divided the world into 16x16x128 grids of blocks called chunks. Each chunk was its own file. Large worlds suffered from very poor performance because there were tens of thousands of files in a single folder.
Some random modder basically just put multiple chunks into one file so that each file is 2MB. If Notch had just put the game world into a SQLite database he wouldn't have had to reinvent the wheel. There are games that did that, such as the alpha of Cube World and they work just fine.
Heck, notch went one step further and invented NBT aka named binary tag which is basically a weirdo binary file format that stores JSON like data.
> Large worlds suffered from very poor performance because there were tens of thousands of files in a single folder.
It was using subdirs for the chunks, two levels iirc, one was chunkX % 36, the next level chunkY % 36. So there weren't that many files per directory. The slowness came from the overhead of opening, read/write and closing so many files all the time.
> Some random modder basically just put multiple chunks into one file so that each file is 2MB.
Almost, it wasn't limited by file size, it was putting 32*32 chunks into one file that was similar to a simple file system. The format of the individual chunks within that file stayed almost the same. Yet it performed much better.
NBT is indeed a little weird but fairly straight forward overall, I guess designing and implementing it just scratched an itch. It was a hobby project after all.
I'm actually currently working on a user mode FS using Dokan for Windows that saves everything to a SQLite file for similar reasons. NTFS just doesn't do well at all with lots of small files.
> My understanding is that SQLite doesn't impose any format either?
That's true that you can stuff any kind of string/blob data into any column of any table, so, yes, you still have to determine the data schema with sqlite much as you do with JSON, XML, or even CSV. I mean, I could have a CSV where each element is a base64-encoded ZIP containing sqlite database files that are each a single table with a single column of JSON files, each of which contains a JSON array of strings with XML documents in them.
But that's usually not something people would mean if they said their app was using CSV as it's data storage format, nor is the version stripping out CSV on the top what people would mean if they say they are using SQLite.
With ZIP, you have to decide the format(s) for the file(s) in the ZIP, their hierarchical structure, and, if the files aren't themselves the atomic data elements, the schema applicable to each file.
Furthermore, in discussion of performance characteristics and other aspects of suitability, ZIP adds overhead, but you still also need to consider the access properties of the contained files.
> Of course SQLite imposes a file format... That format is a SQLite database
Well this is in a context that rejects zip as being a format. Do you do that? If the answer is no then skip the rest of my post and just note that they're talking about a different definition of 'format'.
-
But in that context:
The amount of structure imposed on you by the sqlite database format is not much more than the structure imposed on you by a zip. I think it's fair to rate them similarly as formats. A zip file is basically a key-value store.
"Zip full of csvs", while awful to use, would impose about the same amount of structure as sqlite does: not much. And zip+csv is not much more elaborate than zip on its own.
A configuration like that comes from the program using sqlite. Just adding sqlite into a system doesn't set up any data formatting like that. Sqlite itself gives you a blank canvas. And a blank canvas is not much of a data format.
The SQLite data format includes the schema, in plain ASCII. This self-documenting nature makes it an excellent data format, I've taken advantage of it numerous times in making use of SQLite-based application file formats.
SQLite is put forth as a basis for an application file format, and by definition it must be sufficiently flexible to accommodate any application. But by including the schema, it is self-documenting as to what the structure is, which ZIP isn't and can't be. QED.
I hope to make the modest point that, if you package your data in a SQLite file or a zip file, then you have both packaging and data problems.
MyCoolSQLApp may read and write a SQLite file with its own schema, but it can't handle an arbitrary SQLite file. Likewise MyCoolZipApp can't handle an arbitrary zip file.
I'm not clear on the point here. You can make an application that handles arbitrary SQLite files, (for example: the SQLite shell, or a GUI database broswer), or you can make an application that only uses SQLite files with certain content/schema. Does this mean that SQLite does or does not define a file format?
If not, what do you call the specification of how data is stored in a SQLite file besides a 'file format'?
It all depends on how much/how complex data you have. SQLite is a database after all, you can query it with SQL and do lots of fancy stuff that might be hard to do with regular file formats like JSON or XML.
If you just need a config file or only have a small amount of data you can use XML/JSON files that you parse yourself. If you are going to have loads of data that needs some structure (for example messages in a messaging app) i would use SQLite.
There's an enormous amount of comments and tests in that codebase. As installed on my Mac, sqlite comprises a 1.3MB command line utility and a 1MB shared library. That's absolutely tiny given the functionality it provides.
Go ahead and use text files and then have fun with data corruption issues. We use CSV for sending commands to IoT devices and it's an issue. If this had been done with SQLite, then there were at least no data corruption issues. One could even use SQLite as a storage container for JSON if one whises to do so. They even have an extension that aids it with an useful set of functions:
The JSON features of SQLite are extremely robust and performant. There is no reason to use raw JSON as the storage when you can just shove it into SQLite and lose almost nothing.
Not only that, it enables us to to CRUD operations, list the commands, sort them by time, do limits, pagination, bundle a bunch of commands that enable a certain functionality in a transaction etc. SQLite has all of those and more and also avoids data corruption issues by design. Anyway, the path we took was to move everything to MySQL just because most of the other data is also in a MySQL database. Otherwise we would have definitely used SQLite.
ZIP files are not a database: they are more like a directory hierarchy. But maybe all I need is named blobs: no query language parser, optimizer, indexing, etc.
SQLite positions itself as an improvement over ZIP for application file formats: https://www.sqlite.org/appfileformat.html . But minzip is so much smaller, easier to understand, debug and ship. So why use SQLite for an app if ZIP suffices?
If you’re talking about like cbr archives, you’re right. It’s comparing against usages like word/excel, which store a bunch of XML in an archive and call it a day.
If you’re not reading and writing out application state, then yes, you don’t need something to manage your non-existent state
Depends on your use case. XML and JSON are great for applications with simple data stores, having done this myself. But if you foresee a need for complex queries or locking and threads then SQLite might be a good choice.
The main challenge there is how to you ensure your database is resilient to machine or datacentre outages? ie what happens if the 1 server with the database is in a datacentre that loses Internet connectivity?
SQlite "merely" assumes that the problems that come with distributed systems are handled at the application layer. You'll have to solve those problems for yourself, sure, but in practice I have rarely (I think never actually) had dataloss through a fault of sqlite.
Also, did you know you can use in-memory instances (and share them across threads!) with the right incantation? And that you can backup your on-disk instance to an in-memory one, do your expensive transactions without hitting the disk then backup the modified instance right back to disk, even in-place if you want!
Sqlite is amazing when you don't expect the DB to do replication or failover on its own.
No I did not know that, I've looked for a long time for a way to convert a sqlite3 database to an in memory database and then back again. Do you mean that there's support in sqlite3 for this? Could you point me in the right direction?
We both learned something new today. Looks like this is what you want in combination with using an in memory database.
I've been doing a handrolled in memory cache layer to speed data access, but with this, I can just call the db directly and then periodically sync to disk, redis rdb style. Sqlite is a staggeringly good piece of technology!
In Python you just have to open the special file name “:memory:” to get a memory-based db. I don’t remember what the raw SQLite incantation is (or if it’s different). Also, pay attention to “ATTACH” - it’s the way to use multiple databases (file and/or memory) while still letting SQLite handle it all (e.g. join a memory db to a file db, insert result into 3rd file db - all without having to look at records in your own code)
I feel like the sibling comments here are basically just saying "yep, that's the main challenge!" without providing useful tips. I personally haven't used it, but I'm aware that this library exists to help resolve this challenge. https://litestream.io
Just open your SQLite database in read-only mode :)
SQLite works really well for static or semi-static data. For example, a blog where you have a small number of users writing and many users reading from the DB. If the authors are content to use one server to edit the DB then you can easily push that DB to the servers handling the reads.
Yes this can work, however you are mostly relying on the operating system's file system cache for speed. Other databases will try harder to keep their own cache. But true, there is lots of room where SQLite works nicely.
TBF: you don't. The moment you care about any shortcoming of SQLite, move away.
One of the cool things about it, is that SQLite is very lax about what it accepts (mostly in the datatype area). You can write your SQL statements targeting whatever database you think you'll move to later and they'll work while you're still on SQLite. I believe having this migration work seamlessly towards PostreSQL is one of the advertised features.
I'm using SQLite at the moment; on the one side there's a 'legacy' (read: poorly written 2012) application, on the other there's the new and rebuilt version. The old one was not built very well, it does not use foreign keys or any kind of database constraints (it references other entries by name in a column of comma-separated values) and it runs like trash. But the performance problem is not in the dozen queries it runs to load the data, it's in the fact that it converts the query result to XML (via string concatenation, because of course) and that is converted to JSON; the conversion is at least 60% of each request. The other problem is that it writes and re-queries the data whenever you leave one of the hundreds of form fields in the application.
I'm rebuilding the application in a modern tech stack, still using SQLite but properly this time, along with Go and React. API requests take 20-40ms instead of 300-1500ms, and there's much less of them.
The main downside to using SQLite is that it does not support "proper" database migrations; you cannot alter a column. You can add columns to an existing table, but you can't change existing columns. The database abstraction I'm using at the moment, Gorm (a different subject entirely) work around this by moving stuff to a temp table, recreating the table with the updated columns and moving stuff back, I believe.
I reach for SQLite if I need persisted state for a local application or custom file format but why use it for things that may need more write concurrency like web server?
Postgres is basically just as easy to use and backup.
> Postgres is basically just as easy to use and backup.
SQLite is way ahead on this: no daemon to run, no user / database to create, manage and administrate, no authentication to set, no socket connection to manage… backup is as easy as it gets: (copy one or two files).
Well, backing up by copying doesn't neccissarily result in a consistent state if there are writes to the database. For that you have to use the SQLite `.backup` command (or using the backup API https://sqlite.org/c3ref/backup_finish.html) after which the backup database has to be copied over to backup storage (or backup storage has to be mounted to the production system, which is dangerous)
Use "rsync" instead of "cp". After it finishes copying, it will check to see if the file has changed since it started copying, and will restart the copy if so (with a limited number of retries).
If copying the entire database is faster than your average update right, this will converge very quickly and will deliver a consistent copy.
For many small applications, this is perfectly fine. It's not much harder to just "sqlite3 $file ".backup $backupfile"' (that's literally all it takes, and what you should do) and guarantee consistency. But it's nice to know that a simple "rsync" is sufficient for slowly-updating uses - e.g.
And as for the other side of backup, you know -- restore -- sqlite shines brighter than everything else. You can just take a good copy and put it back. You can examine the file everywhere, on a read only system, etc - without configuring anything if needed.
SQLite loses most of its edge in concurrent write scenarios, but its read performance is difficult to beat.
A lot of it comes from what TFA says: there's no network roundtrip, but a function call. Even in a local machine, a unix socket query will carry at least a couple of system calls with potential context switches, and that makes regular RDBMS lag behind when you do tons of sequential and small queries.
Of course, when you have large results or complex queries that eat a bigger chunk of the time cake and that technical advantage wanes. After that, which RDBMS has the performance lead is largely workload-dependent.