> The key point is that SQLite is very forgiving of the type of data that you put into the database. For example, if a column has a datatype of "INTEGER" and the application inserts a text string into that column, SQLite will first try to convert the text string into an integer, just like every other SQL database engine. Thus, if one inserts '1234' into an INTEGER column, that value is converted into an integer 1234 and stored. But, if you insert a non-numeric string like 'wxyz' into an INTEGER column, unlike other SQL databases, SQLite does not throw an error. Instead, SQLite stores the actual string value in the column.
I'm often exploring data where either there's no defined standard or the use of the data has drifted from the standard. Now, I could go over this line-by-line, but instead my go-to has been "Hey, let's throw this into SQLite, then run stats on it!" See what shakes out. SQLite kindly and obediently takes all of this mystery data, which ends up being nothing like what I was told, and just accepting it. Then I can begin prodding it and see what is actually going on.
This is something that has come up for me for at least a decade: chuck it in SQLite, then figure out what the real standard is.
Right what's the point of declaring a column to have type INTEGER if it really means "This column MAY contain integers, but may contain other kinds of data as well".
It's especially great for large important legacy datasets that similarly did no enforcement. You can then interact with the data and clean it up through some commands without trying to work with massive datasets in Excel where it sometimes just crashes the entire app.
I do exactly the same. You import from CSV, everything ends up as a string, and then you copy the data to a STRICT table so you can understand what's really going on.
What do you mean? I don't think this process ends up with less info than the CSV file format itself.
Short of doing that, cleaning very dirty data has no satisfactory solution, I think. Optional typing is a nice middle ground between untyped and slow (R, Python) or strictly typed and tedious (all other DB engines).
Maybe I misunderstand, but if you take a data type that is a union of two datatypes (string and int) and push it into somethign that converts all the data to strings, you lose the information of whether a string that looks like an int was originally a string or int.
Maybe instead you mean to use the database functionality to identify problems like that in the original source data. If somebody hands me a string "11101100101001" I would not attempt to interpret it by parsing it into a binary number, but I think that's because I really like strong, simple typing.
I can't answer for the person you're commenting to, but I use sqlite in a similar way and the loss of information isn't an issue. When you're exploring data, just trying to make sense of what someone has sent you (say, migrating data from an existing system to a new system) you probably have a pretty good idea if the column is supposed to have integers or strings in it. In my own use of it, it's very unlikely that the loss of information will cause a problem. Eventually the data will be type-checked -- not in sqlite (I haven't experimented with the new-ish "strict tables" feature yet), but in my case the data will eventually flow through a typed programming language.
Anyway, different ways to use the tool, for sure! And in some cases one would definitely need to be attuned to the issue you're raising. In the kind of situation I'm thinking of the data is usually so dirty anyway, a bit of string->integer conversion won't hurt (probably).
I get it, you don't have typing control over your data exchange. somebody might send you an unquoted-string CSV with multiple data types in the same column:
id,location,name
1,90210,Tori Spelling
2,Schitt's Creek,Eugene Levy
and you use sqlite to quickly explore the loaded data to check the set of types used in a column, and maybe even glean what the meaning is.
Exactly. The thing I'm talking about is necessarily a semi-manual process (data exploration, usually prior to migrating data from one system to another).
In the example you give, when you've done all the exploration you need, there's a program interpreting that CSV that ensures the location column values are strings. At least, that's how I do it!
I've done it for formal ETL pipeline code, too. I make it tighter, but not too tight. Sadly, the spec given me was very ... optimistic? Bright-eyed? Many things that were not supposed to happen did happen, data that shouldn't have gotten near me did, and so on. SQLite swallowed it and kept on truckin', because when it comes to ETL, one dumb field entry shouldn't stop the show. Catch it, yes, deal with it when you note it in the logs, absolutely, but one must continue, and for that SQLite is most excellent.
Were SQLite not such a stellar piece of software, I would complain, and ask that this kind of flexibility be disabled by default, with an option to explicitly enable it if necessary.
> There should be an option to have strict tables, foreign key checks and other modern features to be automatically enforced.
I know to enable foreign key checking, and I learned strict tables from this thread. What are the "other modern features" you were referring to?
EDIT: In TFA, section 8 is relevant here I guess: SQLite accepts double-quoted strings as a historical misfeature, which can be disabled with the appropriate C API functions. This is one of the "other modern features" I guess; TIL.
The strict keyword per that doc already taints the a table for pre 3.37.0 (2021-11-27) libraries. (There is a way around that for pre 3.37 per the doc, but it's mainly enabling the PRAGMA used for db recovery operations that treats the strict at the end as an error and ignores it to "facilitate recovery").
So the solution your after would be some additional calls when initializing the database to enable the FK checks (alongside any other app related PRAGMA calls like checking the data_version, configuring journal or page size), and ensure any tables created have STRICT applied.
Unfortunately I found strict mode not very usable, because it only allows the six basic datatype names and doesn’t do column affinity determination <https://www.sqlite.org/datatype3.html#determination_of_colum...>. This thwarts one of the more practical forms of strictness in querying libraries, automatic column type mapping (e.g. mapping DATE (which has NUMERIC affinity) to your language’s date type, or conceivably some other convention like mapping TEXT_Foo to your own Foo type). I’m perplexed as to the rationale for this restriction which feels just gratuitous and unnecessary.
That feels like a bit of a hack, tbh. Playing off a mechanism I think? was made for making porting from other day syntaxes easier?
Is sqlite converting an iso8601 to a timestamp, or would it just store as given when type coercion fails due to dashes and spaces? From memory, it's the latter.
It would be slightly more brittle, but surely this metadata should be at app/orm layer rather than the database?
> surely this metadata should be at app/orm layer rather than the database?
I cannot fathom why you would say this, because the way I see it of course it belongs in the database as part of the table definition, as it’s obviously part of the logical schema. Sure, you can’t actually enforce invariants for individual types so that they would be just conventional aliases for the underlying affinities, but that doesn’t mean you should avoid specifying meaningful types.
Which of these would you prefer:
CREATE TABLE example (
id BLOB PRIMARY KEY,
created INTEGER,
data TEXT
);
CREATE TABLE example (
id uuid PRIMARY KEY,
created timestamp,
data json,
);
I know that I want the latter, because it makes my life much easier when reading the schema, and lets code automatically use the right types based on inspecting the schema (though you will need to define a mapping of SQL type names to your programming language’s types, since they are still only informational rather than structural like in most SQL databases). Ideally you might be able to define your own datatype affinities (SQL even defines a suitable syntax: CREATE TYPE uuid AS BLOB), but it’s not so bad leaning into the built-in rules with BLOB_uuid, INTEGER_timestamp and TEXT_json (… though on reflection I admit this is rather perilous due to the precise affinity determination rules, shown in the example “TEXT_point” which would be INTEGER due to containing “int”, so maybe it is actually better that strict mode doesn’t blithely use the current affinity determination rules on the expressed type).
(Actually on the DATE thing I was forgetting and thinking that was a regular feature but it’s actually just the fallback affinity where a type is specified but not matched by any other rule, NUMERIC. Strike out that example as a canonical definition or anything, then. But the rest of the point still stands.)
Each and every biostatistician on this planet. Especially those touching clinical data. Personally, I was saddened to learn that DuckDB did not include dynamically typed, or at least untyped, columns. Happily my data loads are usually small enough for a row-oriented data store.
CHECK constraints, now in conjunction with STRICT tables, are the best invention since sliced bread! If I could improve on one thing, it would be to remove any type notion from non-STRICT tables.
Fwiw the ANY type is valid in strict tables, and does what it says.
Slightly more so than in non-strict tables in fact: it will store exactly what you give it without trying to reinterpret it e.g. in non-strict, a quoted literal composed of digits will be parsed and stored as an integer, in strict, as a string.
Strict mode also doesn’t influence the parser, so e.g. inserting a `true` in a strict table will work, because `true` is interpreted as an integer literal.
This always put me off a bit too. I have memories of mySQL supporters back in the early days defending it storing dates like 2022-02-31 and other such type/range/other issues with the standard refrain of “but SELECT * is blazing fast!”…
As of last year there is an option to make things more strict (https://www.sqlite.org/stricttables.html) though
as SQLite doesn't have real date types, unless you are using one of the integer storage options code could insert invalid dates like mysql used to allow.
--
EDIT: having actually read the linked article, it explicitly mentions the date type issue also.
Even in the absence of strict tables, you can also add a CHECK constraint whenever limits are needed on the value types. Here is one example from a recent discussion in SQLite forum threads: `check(datetime(x, '+0 seconds') IS x)`.
An important proviso on the strict tables is that they will not be observed on older SQLite versions lacking this feature.
SQLite has been rock-solid since attaining DO-178B, and commonly isn't ever upgraded in many installations.
CentOS 7 is using 3.7.17. Since the v3 database format is standardized, the older version can utilize a STRICT database file, but will not have the capability to alter datatype behavior.
For these cases, implementing both STRICT and the relevant CHECK constraints is advisable.
It's even more fun in a boolean column, where some databases accept 't'/'f', 'true'/'false', etc. SQLite accepts some of those, but treats other ones as text.
I do, for JSON columns. I store UTF8 strings in SQLite, so that it is easy to see JSON values with a plain `SELECT *`). And I load blobs, because I code in Swift and the standard JSON decoder eats raw UTF8 memory buffers, not strings.
This avoids two useless conversions:
- first from a C string loaded from SQLite to a Swift Unicode string (with UTF8 validation).
- next from this Swift Unicode string to a UTF8 memory buffer (so that JSONDecoder can do its job).
SQLite is smart enough to strip the trailing \0 when you load a blob from a string stored in the database :-)
SQLite is an excellent app database for local things. It doesn't compete with PostgreSQL or MySQL for huge number of reasons. So, to make it fit in the competition model, perhaps there are a bunch of things are glossed over.
Does anyone see a massive pro-sqlite movement going on? Sort of like what happens in JS-ecosystem. Everyone is bandwagoning on it. Criticism of SQLite is much welcomed, specifically exemplifying what its role is and which use cases it serves really well.
SQLite needs this, because changing column types is such a pain in the ass. The answer to "I want to move from INT primary keys to TEXT/UUID, how do I change the column type?" is "just insert the data into the column because SQLite allows everything".
And the reason changing column types is so hard is because, uh, SQLite stores its schema as human readable plaintext (easier to keep compatibility between versions) and not normalized tables like other databases.
What does it mean for a database to have primary keys of different types? In particular, primary keys are frequently used for clustering and other performance enhancements and I would expect that they would need a total ordering that made sense.
As much as I love sqlite, its table model is really confusing to me coming from a postgres mentality:
"WITHOUT ROWID is found only in SQLite and is not compatible with any other SQL database engine, as far as we know. In an elegant system, all tables would behave as WITHOUT ROWID tables even without the WITHOUT ROWID keyword. However, when SQLite was first designed, it used only integer rowids for row keys to simplify the implementation. This approach worked well for many years. But as the demands on SQLite grew, the need for tables in which the PRIMARY KEY really did correspond to the underlying row key grew more acute. The WITHOUT ROWID concept was added in order to meet that need without breaking backwards compatibility with the billions of SQLite databases already in use at the time (circa 2013)."
Setting aside some of the technical choices, I wish we saw more product builders publish transparent perspectives on the 'shortcomings' of their product. SQLite also has a list of reasons why you would or wouldn't want to use SQLite that I've always enjoyed: https://www.sqlite.org/whentouse.html
I hit annoyances when using sqlite when dealing with some corner-cases. I wanted to implement a lock and share some small amount of data (who did what when) between two different linux users on the same machine.
I figured "sqlite is better than fopen, let's use that!", but between directory permissions, all the WAL files, probably the sqlite3 python lib and Diskcache (https://pypi.org/project/diskcache/) not helping things, it was a real pain, where regularly under different race conditions, we would get permission denied errors. I managed to paper it over with retries on each side, but I still wonder if there was a missing option or setting I should have used.
I feel you! I had an ASP.NET Core Web API using a SQLite file for data storage. It was working exceptionally fine in development. When we released it, as soon as more than 10 users were throwing request upon it, it got so deadly slow. Obviously caching and some tweaking helped a lot. But it was far from easy to find, understand and test (!) all required options and optimizations.
Not only that, but the FK enforcement must be enabled on a per-connection basis. Unlike the WAL, you can’t just set it once and know that your FKs will be checked, every client must set the pragma on every connection. Such a pain in the ass.
The error reporting on FK constraint errors is also not great (at least when using implicit constraints via REFERENCES sqlite just reports that an FK constraint failed, no mention of which or why, good luck, have fun).
More generally, I find sqlite to have worse error messages than postgres when feeding it invalid SQL.
Anyone can share their experience with the somewhat new STRICT mode? Does it help? I tend to use Postgres when available, primarily for the added strictness, but I'd surely prefer SQLite in more scenarios as it's easier to operate.
I use strict tables. I’m now realising it did help when I migrated tables from one data type to an other, because I’d missed updating some of the code.
I didn’t realise because it was just what I’d expected, but without strict tables I’d have had to debug strange errors on retrieval rather than the type error on insertion I got.
We have extremely heavy SQLite usage throughout. Strict mode would only cause trouble for us. Most of our SQLite access is managed through a domain-specific C# library we developed. It handles all of the parsing & conversion internally. We don't touch the databases directly in most cases.
I've run into situations where I need a db, but all I have is Python, so the embedded SQLite is my best option. It would be ideal if the interface was fully compatible with postgres (or a practical subset), even if SQLite was handling storage under the hood.
- It is annoying to have it per table and not database
- No alter table (to convert existing tables)
- No backward support (maybe through connection string or something)
Like Oracle, SQLite lacks boolean (bit) columns, but it does support bitwise operators, so multiple boolean values can be packed into an integer with powers of 2.
Setting a value with an update is a bitwise or, and checking a value is a bitwise and.
$ sqlite3
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> select 2 | 1;
3
sqlite> select 3 & 1;
1
Oracle only has a "bitand" function, but "bitor" has been posted on Oracle user sites:
create or replace function bitor(p_dec1 number, p_dec2 number) return number is
begin if p_dec1 is null then return p_dec2;
else return p_dec1-bitand(p_dec1,p_dec2)+p_dec2;
end if;
end;
/
That isn't necessary in SQLite.
Searches on these will likely require full table scans, a definite performance disadvantage.
Most DBs don't (and won't ever) support that because the results become undefined. You could get different values out for the same query in different versions of the DB (due to query planner changes) or even within the same version and the same database over time (as the relative sizes of your tables change so the planner choses a different route, or index options are changed, so the planner is forced to go with a different route or has the option of a better one).
Though in a world where “eventual consistency” is accepted, may be “eeny, meeny, miny, moe” will be more generally considered OK at some point :)
At least SQLite tries to be consistent (taking the value from the row where the aggregate found its value) where that is possible (which it often isn't) which mySQL (which also allows non-grouped non-aggregated columns in the projection list) does not.
Use of that group by behavior is a bug, nearly every time I've seen it. (Though usually I'm seeing it in MySQL.)
Columns that are not part of the group, an aggregate of the group, or a functional dependency of the group aren't stable — there are multiple rows that could supply the value, and they needn't have the same value.
Yeah, I've seen that bug in a monitoring graph that was showing data for say every 5 minutes but the (MySQL) database table had data per minute; so you'd see a spike in the graph, but refresh a minute later and voila no spike anymore.
exactly, SQL being a declarative language, you should always prefer explicit over implicit
SQL bugs are very hard to detect, when the query return a result that looks right, and because the language is declarative its easy to do those mistakes
SELECT max(salary), first_name, last_name FROM employee;
This returns one row! AFAIK all other databases would return one row per record in the table where first_name, last_name would be from the row while max(salary) would be the value from the row with max salary. Is this SQL ANSI compatible?
<SQLite's behaviour is> "Be liberal in what you accept". This used to be considered good design - that a system would accept dodgy inputs and try to do the best it could without complaining too much. But lately, people have come to realize that it is sometimes better to be strict in what you accept, so as to more easily find errors in the input.
I encountered a case a few years ago where a query failed because the database had been opened read only but it needed to recover a journal. The solution was to just never open the database read only... but this seems less than optimal. Anyone else seen this?
Once I worked with a codebase that used SQLite to mock MySQL for tests. It mostly worked, but they're different enough that things sliped by around nasty edge cases.
Though there is an issue that some of sqlite's own functions are unaware of this and will end early if a NUL is encountered: https://www.sqlite.org/nulinstr.html
There are other encodings, yes, where any byte string is valid within that encoding. The Unicode encodings (UTF-*) don't have that property, however.
The SQLite docs say this about text values or columns, though it's a bit muddy which is which. (But it doesn't really matter.)
> TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).
But it's the best reference I have for "what are the set of values that a `text` value can have".
E.g.,
sqlite> PRAGMA encoding = 'UTF-8';
sqlite> CREATE TABLE test (a text);
sqlite> INSERT INTO test (a) VALUES ('a' || X'FF' || 'a');
sqlite> SELECT typeof(a), a from test;
text|a�a
Here we have a table storing a "text" item, whose value appears to be the byte sequence `b"a\xffa"`¹. That's not valid UTF-8, or any other Unicode. The replacement character here ("�") is the terminal saying "I can't display this".
Presumably for this reason, the Rust bindings I use have Text value carrying a [u8], i.e., raw bytes. Its easy enough to turn that into a String, in practice. (But it is fallible, and the example above would fail. In practice, it gets rolled into all the other "it's not what I expect" cases, like if the value was an int. But having a language have a text type that's more or less just a bytestring is still regrettable.)
¹borrowing Rust or Python's syntax for byte strings.
"Similarly, SQLite allows you to store a 2000-character string into a column of type VARCHAR(50). Other SQL implementations would either throw an error or truncate the string. SQLite stores the entire 2000-character string with no loss of information and without complaint."
That's not flexible typing, that's user-hostile behavior.
In PostgreSQL a varchar is also a text type column but with an added length check. So it's actually just slower to insert and not more space efficient, as far as I know. So SQLite is not alone in this.
What part of varchar is a “critical optimisation” exactly?
If it’s ensuring that the data is never larger than X, then you can use a CHECK constraint. It has no impact on storage in sqlite, or in postgres for that matter.
wtf. who would ever want that?