SQLite Is Serverless

nicholaides · on Jan 26, 2020

I think a good under-appreciated use case for SQLite is as a build artifact of ETL processes/build processes/data pipelines. Seems like lot of people's default, understandably, is to use JSON as the output and intermediate results, but if you use SQLite, you'd have all the benefits of SQL (indexes, joins, grouping, ordering, querying logic, and random access) and many of the benefits of JSON files (SQLite DBs are just files that are easy to copy, store, version, etc and don't require a centralized service).

I'm not saying ALWAYS use SQLite for these cases, but in the right scenario it can simplify things significantly.

Another similar use case would be AI/ML models that require a bunch of data to operate (e.g. large random forests). If you store that data in Postgres, Mongo or Redis, it becomes hard to ship your model alongside with updated data sets. If you store the data in memory (e.g. if you just serialize your model after training it), it can be too large to fit in memory. SQLite (or other embedded database, like BerkleyDB) can give the best of both worlds-- fast random access, low memory usage, and easy to ship.

sly010 · on Jan 26, 2020

I have been using SQLite as a format to move data between steps in a complicated batch processing pipeline.

With the right pragmas it is both faster and more compact than JSON. It is also much more "human readable" than gigabytes of JSON.

I only wish there was a way to open an http-fetched SQLite database from memory so I don't have to write it to disk first.

SQLite · on Jan 27, 2020

> I only wish there was a way to open an http-fetched SQLite database from memory so I don't have to write it to disk first.

The sqlite3_deserialize() interface was created for this very purpose. https://www.sqlite.org/c3ref/deserialize.html

paulfurtado · on Jan 26, 2020

If the language's sqlite bindings don't offer a way to load a database from a string, if you're on a modern linux kernel (3.17+) you can make use of the memfd_create syscall: it creates an anonymous memory-backed file descriptor equivalent to a tmpfs file, but no tmpfs filesystem needs to be mounted and there's no need to think about file paths.

palotasb · on Jan 26, 2020

You can use the memvfs module to load in-memory databases if you're using the C API. I'm not sure how many higher-level APIs support it though.

[1] https://stackoverflow.com/a/53453338/3063 [2] https://www.sqlite.org/loadext.html#example_extensions [3] https://www.sqlite.org/src/file/ext/misc/memvfs.c

rakoo · on Jan 26, 2020

A very interesting approach is sqltorrent (https://github.com/bittorrent/sqltorrent): the sqlite file is shared in a torrent, and all queries will touch a specific part of the file, which is downloaded on-demand.

Also check https://github.com/lmatteis/torrent-net

kevas · on Jan 27, 2020

Incredibly odd, but so awesome

0xbadcafebee · on Jan 27, 2020

  $ mount -t tmpfs none /some/path
  $ write db.sqlite /some/path/db.sqlite
  $ read db.sqlite

We've been abusing tmpfs for more than 10 years to get around the IO layer's failings. It's probably still a valid pattern.

FridgeSeal · on Jan 26, 2020

This is a amazing, I think you may have just solved and headed-off a huge number odd problems for me.

Could you talk more about what Pragmas you’ve been using and why?

jjgreen · on Jan 26, 2020

Not the OP, but I find `PRAGMA synchronous = OFF` makes the creation of DBs vastly faster ...

fctorial · on Jan 27, 2020

> I only wish there was a way to open an http-fetched SQLite database from memory so I don't have to write it to disk first.

Ramfs?

dredmorbius · on Jan 27, 2020

tmpfs is the better-behaved option should you run out of resources, see:

https://www.jamescoyle.net/knowledge/951-the-difference-betw...

I'm still remembering old-school ramdisks under Linux which were finite in both number and size, both to quite small extents. I think there were 8 (or 12 or 16?) total ramdisks available, of only 2-4 MB each, configurable with LILO boot options.

That's now ... mostly taking up valuable storage in my own brain for no useful effect.

fctorial · on Jan 27, 2020

It looks like a good intro, thanks. I wasn't aware of these technologies, but I knew it was possible to build an FS in RAM. So I just put these two keywords together.

dredmorbius · on Jan 27, 2020

FWIW, I learned a few things researching my answer.

(A prime validation for answering questions, BTW.)

My first read was that the old-school ramfs / ramdisk limitations still held. I can't actually even find documentation on them, though I'm pretty sure I'm not dreaming this.

Circa 2.0 kernal IIRC, possibly earlier.

OK, some traces remain, see:

https://www.tldp.org/HOWTO/Bootdisk-HOWTO/x1143.html

Note that this is OBSOLETE information.

DelightOne · on Jan 26, 2020

What pragmas do you use? It sounds amazing!

at_a_remove · on Jan 26, 2020

Using SQLite in my ETL processes is something I have done for over a decade. It's just so convenient and, at the end, I have this file that can be examined and queried to see where something might have gone wrong. All of my "temporary" tables are right there for me to look at. It is wonderful!

chrisweekly · on Jan 26, 2020

Yes! Along these lines I heartily recommend `lnav` ^1, a fantastic, lightweight, scriptable CLI mini-ETL tool w embedded sqlite engine, ideally suited for working with moderately-sized data sets (ie, millions of rows not billions) ... so useful!

1. https://lnav.org

tenken · on Jan 29, 2020

I have used it to inspect say the history of a users' requests on a load-balanced server. I like to permanently store the results of the logfile excerpt to a DB table for posterity and future reporting.

Figuring out how to enter "sql" mode in lnav, generate a logfile table, and then persist it from an in-memory sqlite db to a saved-to-disk sqlite db .... was frustratingly annoying.

It boils down to:

    :create-logline-table custom_log
    ;ATTACH DATABASE `test02.db` AS bkup;
    ;create table bkup.custom_log as select * from custom_log;
    ;detach database bkup;

if i recall you cannot call sqlite commands ".backup" or similar in lnavs sql mode. So lnavs interjection into the sqlite command processing is annoying (I'm actually very familiar with sqlite).

pjot · on Jan 26, 2020

Would you mind elaborating on your ETL process a little more? Im a junior DE and curious about how I would implement this

at_a_remove · on Jan 26, 2020

It's pretty straightforward, really.

I construct the .sqlite database from scratch each time in Python, building out table after table as I like it.

Some configuration data is loaded in from files first. This could be some default values or even test records for later injection.

The input data is loaded into the appropriate tables and then indexed as appropriate (or if appropriate). It is as "raw" as I can get it.

Each successive transformation occurs on a new table. This is so I can always go back one step for any post-mortem if I need to. Also, I can reference something that might be DELETEd in an a later table.

Often (and this is task-dependent), I will have to pull in data from other server-based databases, typically the target. They get their own tables. Then I can mark certain records as not being present in the target database, so they must be INSERTed. If a record is not present in my input and is there in the target, that would suggest a DELETE. Finally, I can compare records where some ID is present in my input and my .sqlite, they might be good for an UPDATE. All of this is so I can make only the changes that need to be made. Speed is not important to me here, only understanding what changes needed to be made and having a record of what they were and why.

I am happy to say that an ETL process I wrote using this general method back around 2009 is probably still running. I haven't had to touch it in years. Occasionally I will receive questions as to "why did this happen?" and I can just start running queries on the resultant .sqlite database file, kept with the logs, for answers.

Similarly, I can use these sorts of techniques when I am analyzing other datasets. The value here is that I can just refresh one table when the relevant data comes in, rather than having to run the ingest process for everything all over again. This can save me a lot of time.

F_J_H · on Jan 26, 2020

Awesome - elegantly simple using very common technologies.

at_a_remove · on Jan 26, 2020

I am not a very talented programmer so I stick very close to what is common, standard, and easy to understand. It usually means I am on the downslope of the hype cycle and it limits some opportunities but I have become okay with that.

I have gotten some CS students who were about to shoot flies with various cannons turned on to SQLite. I kept a couple of the decent books about it nearby and would shove it into their hands at that point. Usually a week later they would be raving about it.

julianee · on Jan 27, 2020

Do you still have the titles of those books at hand? I'd love to take a look at them.

at_a_remove · on Jan 27, 2020

They are The Definitive Guide to SQLite by Mike Owens and Using SQLite by Jay A. Kreibich. I am quite sure they are more book than I needed, I only plumbed a fraction of SQLite's immense capabilities.

hinkley · on Jan 26, 2020

Do you generate the file from scratch every time or do you modify the previous one as new data arrives?

tracker1 · on Jan 26, 2020

Depends on what you want... if you have a separate db project, you can have the output of that project be a clean database for testing other things, or a set of migration scripts for existing deployments.

I've been working on doing similar with containerized dababase servers for testing, while still having versioned scripts for prod (multiple separate deployments).

at_a_remove · on Jan 26, 2020

It is a bit of a hybrid.

In the early stages of development of whatever the ETL process is, I keep the database and just empty it out each time. As I got more of a sense of what I needed, I started DROPing my TABLEs more often and remaking them. Eventually I would make the whole database from scratch once I was along the way and had most everything fleshed out.

hinkley · on Jan 26, 2020

Ok. So each export is a full dump, not a delta on a previous one.

Do you anticipate hitting a wall at some point where the total time becomes a problem?

at_a_remove · on Jan 26, 2020

Well, it depends on the process. Some were full dumps, some were deltas pushed up to the final database, sometimes both (this product in particular had a load from file capability that you were supposed to use but some edge cases that were not well-addressed).

No, the time never grew significantly.

For one of the analysis projects, just one step of the analysis was quite time consuming but it would have been that way no matter what. SQLite allowed me to let it grind away overnight (or even over a weekend) on a workstation without tormenting production servers.

elyobo · on Jan 26, 2020

We do something like this; one of the outputs of the data pipeline is an sqlite file that's deployed nightly along with code to App Engine. The sqlite stuff is all read only, read/write data for the app is stored in firestore instead.

We initially used json but ran in to memory issues; sqlite is more memory efficient and being able to use SQL instead of the wild SQL-esque is both faster and more reliable.

platform · on Jan 26, 2020

Yes, I have been doing same thing, only with LMDB.

I do not think LMDB could load from in-memory only object (as it has to have file to memory-map to), however.

But same design reasons, I wanted something that

a) I can move across host architectures

b) something that can act as key-val cache, as soon as the processes using it are restarted (so no cache hydrating delay)

c) something that I can diff/archive/restore/modify in place

We tested sqllite for the above purpose at the time, and writing speed and ( b ) - lmdb was significantly faster.

So we lost the flexibility of SQLite, but I felt it was a reasonable tradeoff, given our needs.

I also know that one of the Intel's python toolkits for image recognition/ai, uses LMDB (optionally) store images that processing routines do not have incur the cost of directory lookups when touching millions of small images. (forgot the name of the toolkit though)…

Overall, this a very valid practice/pattern in data processing pipelines, kudos to you for mentioning it.

elyobo · on Jan 27, 2020

"wild SQL-esque" should have been "wild SQL-esque thing I wrote to query the JSON"

F_J_H · on Jan 26, 2020

I've wondered about this too, but have not gotten around to trying it yet.

We get a gnarly csv log file back from our sensors in the field, which is really a "flattened" relational data model. What I mean by that is a file with "sets" of records of various lengths, all stacked on top of each other. So, if you open it in Excel, (which many users do), the first set of 50 rows may be 10 columns wide, the next 100 rows will be 20 columns wide, the next 45 wide, etc. And, the columns for each of these record sets have different names and data types.

Converting to JSON is obvious, but I've thought about just creating a SQLite file with tables for each of the sets of records. Then, as others have said, can use one of any number to tools to easily query/examine the file. Also can easily import into a pandas data frame.

One concern is file size. Any comments on this? I can try it, but wonder if anyone knows off the top of their heads if a large JSON file converted to an SQLLite file would be a lot larger or smaller?

edit: clarity

forgotmypw · on Jan 26, 2020

Yes, it is great for that.

You only have to read the CSV file once, and after that you have a nice set of tables you can query any which way you want.

I use SQLite as an intermediate step between text files and static HTML, for example.

swiley · on Jan 26, 2020

I was under the impression SQLite files were not supposed to be moved across architectures.

catalogia · on Jan 26, 2020

Thankfully that's not the case:

> The SQLite file format is cross-platform. A database file written on one machine can be copied to and used on a different machine with a different architecture. Big-endian or little-endian, 32-bit or 64-bit does not matter. All machines use the same file format. Furthermore, the developers have pledged to keep the file format stable and backwards compatible, so newer versions of SQLite can read and write older database files.

https://www.sqlite.org/different.html

tmpz22 · on Jan 26, 2020

How does this work with container/ephemeral services such as typical K8s deployments? Can I trust the file system mounting via resources like StatefulSets or FS mounts? For that matter, Heroku, App Engine, Cloud Functions, whatever?

Our current setup is having all our services in kubernetes but our databases in stateful VMs. I do occasionally stuff job-reports and similar data into postgres rows since it's already there, but I've been unhappy with our ETL setup and would be interested in hearing techniques to improve it.

closeparen · on Jan 26, 2020

ETL workers themselves are typically ephemeral, plumbing batches between remote storage systems like Postgres, S3, and Hive. You might use local disk as scratch space during the batch, but not as a sink.

drderidder · on Jan 28, 2020

From experience, and supported by the Sqlite docs, I can tell you that trying to run sqlite on files on an NFS mounted filesystem will not work. See section 2.1 of this document [1] and the related discussion HN discussion [2]

[1] https://www.sqlite.org/howtocorrupt.html

[2] https://news.ycombinator.com/item?id=22098832

tracker1 · on Jan 26, 2020

Use a volume container mounted against a persistent storage engine on the node and do pod mounting from those containers. Stateful VMs are often a better choice for production imo.

I'm in favor of leveraging ISP dbaas and persistence offerings over trying to home grow something. It just depends on where you are coming from and/or what you are trying to do... K8s alone avoids so much lock in, and as long as whatever storage option (container mount) or dbaas you use is portable, I don't think it's so bad in either case.

zzo38computer · on Jan 26, 2020

I don't know what is "ETL" meaning here, although SQLite does include a JSON extension to read/write JSON data too, so you can use SQL and JSON together if necessary.

elgenie · on Jan 26, 2020

It's an acronym for munging some data

ETL = extract, transform, load

zzo38computer · on Jan 26, 2020

Yes, if that is what you are trying to do, I think SQLite is good. SQLite command shell also has a .import command to read data from a file, and you can also import into a view and use triggers to process the data (this is something I have done). And there is also functions and virtual tables for JSON, and you can load extensions (written in C) to add additional functions, virtual tables, collations, etc. So for many cases, SQLite is useful.

crtlaltdel · on Jan 26, 2020

this is inspiring, i cannot believe i had not considered this before!

bradgessler · on Jan 26, 2020

I’d love a SQLite to macOS Excel (or any macOS spreadsheet application) workflow so less technical users can do analysis. Has anybody pulled this off?

drongoking · on Jan 26, 2020

You mean like the .excel command?

"... causes them to accumulate output as Comma-Separated-Values (CSV) in a temporary file, then invoke the default system utility for viewing CSV files (usually a spreadsheet program) on the result. This is a quick way of sending the result of a query to a spreadsheet for easy viewing"

simonb · on Jan 27, 2020

Or load it in Metabase (as an macOS app).

spaniard89277 · on Jan 27, 2020

You can do that with powerquery

yencabulator · on Jan 30, 2020

Beware opening SQLite files you didn't create: https://research.checkpoint.com/select-code_execution-from-u...

vturner · on Jan 27, 2020

Terrific! Data pipelines I've built have had JSON as their intermediary steps which I'm growing weary of.

oldgradstudent · on Jan 26, 2020

> Of those that are serverless, SQLite is the only one known to this author that allows multiple applications to access the same database at the same time.

IIRC, MS Access allowed that, which explained a lot of its popularity.

buckminster · on Jan 26, 2020

This was a standard feature of flat file databases in the early 90s. There were many products. They often had an ODBC driver, which provided a SQL front end. dBase .dbf files were often used for storage. The arcane file locking in Windows is intended for exactly this kind of application.

Apart from quality (!), SQLite's main advantage over these products is broad platform support. And continued existence.

specialbat · on Jan 26, 2020

We still have a multi user desktop application that uses the Borland database engine and dbf files over a windows file server. The BDE is no longer supported but it still works

swiley · on Jan 26, 2020

Unix like operating systems also allow locking files for on disk databases (like mbox files) it’s just not what every application does by default.

tootie · on Jan 27, 2020

Used to love dbm files. 1980s serverless NoSQL. They're still totally usable although we have LevelDB nowadays too.

squarefoot · on Jan 26, 2020

In the old days of DOS I could do that with Clipper 5.2. Two or more PCs using Netbios shared directories could work on the same database, provided that they wouldn't write to the same record. That wasn't a problem because the environment (can't recall for sure if it was Clipper or an external library) allowed single record locking, so I enclosed the lock attempt into a timed spinlock-like block which attempted once per second for 10 or 15 seconds before obtaining the lock or failing with a record busy error. No SQL involved however, and indexes had to be rebuilt by hand like twice a day to be safe. But those were the days when a 486 could crunch a 100.000 records db with multiple indexes very easily.

matt2000 · on Jan 26, 2020

This is the first time I've heard Clipper mentioned in a looong time. I was just talking the other day how useful it was in those days that there were tools that included the database and UI all in one. Everything was a bit easier because the language assumed a database, rather than just using a library to interface with one.

int_19h · on Jan 27, 2020

Pretty much all dBase derivatives had both file-based and record-based locking, that was specifically designed to work in a networked environment with file shares (although NetWare was much more common than NetBIOS back in the day). RLOCK() was the function you had to use for records.

wenc · on Jan 26, 2020

MS Access “allowed” it but depending on the version it could be quite problematic. We had a use case where Tableau connected to an Access file read only (but which another program used as a data store and wrote to it often) on an Windows file share, and once in a while the lock files would get screwy and we would have to manually delete the lock files to get things working again. Deleting lock files could be a huge chore because you have to figure out stuck processes and kill them. Task Manager wasn’t up to the task, so had to use SysInternal tools to help with that.

Access is really meant for single-user scenarios I feel. Maybe the locking mechanism has gotten better but for multiuser access I tell people to use a real SQL database.

m0xte · on Jan 26, 2020

It was ok pre-windows 2000. I had access 97 it on 9 workstations on NT with never a problem! Moment office 2000 and windows 2000 came out it went to hell. Moved to SQL Server then. I am not sure what changed.

It was quite frankly the most productive custom business software package I have ever used. Literally would take 10 people a week to do something custom you could do in an afternoon with 1 person in Access. I suspect the same is true now.

JasonFruit · on Jan 26, 2020

We had a much heavier multiuser setup on Access 97, and it worked fine. We had the same breakdown as you did, and had to do a registry edit on all users' machines to keep it limping along until we could move to SQL Server, which was the right thing to do anyway.

m0xte · on Jan 26, 2020

Interesting. Nice to know it wasn't just me. There was no one to talk to about it back then so you had to think on your toes.

SQL Server was a crap load easier to back up reliably

slfnflctd · on Jan 26, 2020

> Access is really meant for single-user scenarios

In a lot of cases, I've found it to be the best tool for a temporary or one-off (preferably smaller scale) data mining/massaging project. The query-building interface was the way I originally learned the basics of relational databases, and it also helped me get a better grasp of SQL-- the ability to flip back & forth from the GUI query builder to the SQL it generates is nice.

On the multiuser side, however, I have found a couple workarounds in the past. If you have everyone operate locally and space out their central database connections to intermittent, automated burst queries, you can get more concurrent users than you might expect. It helps to have fewer users per table, as well, and of course it really helps if they don't need to see the most recent adds/changes in real time.

bshipp · on Jan 26, 2020

My recollection was that the automated wizard loved to nest statements and it was a huge chore to manually massage stuff after the fact; almost incomprehensible to parse. but I agree that it was a real godsend for people trying to branch out from the limitations of excel.

velox_io · on Jan 26, 2020

Yeah, Access' Jet Engine and many of its issues are very well known.

[https://en.wikipedia.org/wiki/Microsoft_Jet_Database_Engine]

pjmlp · on Jan 26, 2020

And all xBase based programming languages, Clipper, FoxPro, Visual Assist, and their competition Paradox.

I loved the Clipper 5 OOP capabilities, sadly Visual Objects tried to be too much like Visual Basic and some of the easiness was lost.

blattimwind · on Jan 26, 2020

Since Access 97 or so it could actually use MS SQL as a backend which largely removed those issues; the Access "database" was then merely a VBS GUI for SQL Server. This works pretty well.

pklee · on Jan 26, 2020

I am not sure about MS Access. From what little I know 2 people opening from network drive would mostly result in database being corrupt.

sidewndr46 · on Jan 26, 2020

MS Access "allowed" it. SQLite actually works.

tenant · on Jan 26, 2020

Access is actually designed to work in a multiuser situation over a LAN for concurrent read and write. SQLite isn't afaik. As long as the LAN was cabled I never saw any issues. The only reason it was necessary to move to a server type database was because people were insisting on wifi networking.

paulie_a · on Jan 26, 2020

It might have been designed for it. It was poorly designed for it. It was a clustfuck of corruption when actually utilized

slfnflctd · on Jan 26, 2020

You basically had to engineer around the corruption risks for anything actually in production. Data redundancy, old-school 'Save' buttons (your work isn't properly saved until you click it, because it sends copies of that work to multiple backends) and isolating users as much as possible to their own 'shards' was part of how I saw it kludged through in the real world.

There was still a lot of weird behavior, though, and while you could reduce data loss you couldn't eliminate it.

paulie_a · on Jan 26, 2020

Penny wise pound foolish. I had a president waste two to three hours a day instead of using a 40 dollar service to take care of the incredibly simple and repetitive task. Eventually when I was planning to leave anyways I asked him what his time was with per hour. I guess it was less than minimum wage at the time. (5.25 an hour)

I'm probably going to have nightmares filled with screens of the access database corruption dialogs

tenant · on Jan 26, 2020

What's anyone's time worth? Might have kept him from doing something foolish.

paulie_a · on Jan 26, 2020

Valid point. He did defraud the federal government for 600k. If he had more time on his calendar he could have fucked over the American taxpayers ever more.

tenant · on Jan 26, 2020

It depended. It suited a small cabled office with 10 to 15 computers fine. I saw such setups work fine for 12 years without corruption. But when wifi came along people started connecting that way, sometimes unintentionally, and corruption became an issue. So then we bit the bullet and moved the backend to a server database. Not a single database corruption since.

afiori · on Jan 26, 2020

But this is the difference between supporting concurrent access or not. The fact that the feature relied so much on network reliability means that race conditions still existed.

tenant · on Jan 27, 2020

I'm not arguing with you but the fact remains that unlike SQLite, Access doesn't lock the database file so that only one user can write at a particular moment, it's more granular than that. It's also the fact that this worked very well for databases on small cabled LANs where no more than a dozen-ish computers might be interacting with the database at any one time. It was never designed for use over the internet (having said which I have maintained a forum running an Access mdb file as its backend since forever without a hitch the forum software process does the database edits so it's like a database server process in a way I suppose).

klez · on Jan 26, 2020

I'm not sure I understand what the physical transport has to do with this. Can you please expand a bit?

wolf550e · on Jan 26, 2020

Maybe disconnect and reconnect cause windows network share to lose locks or something like that.

tenant · on Jan 26, 2020

Yes, for Access (and sqlite?) it is the client computers that do the edits directly to the database file, rather than via a mediating server side process, so they are particularly vulnerable to network interruptions while in the middle of an edit. Cable is far more reliable than wifi and so far fewer interruptions. I think that's it.

makapuf · on Jan 26, 2020

Ok but you cant really say you support network access if you rely on it to be flawless. Working fast, maybe but not reliably working if your network is not super reliable is not really support.

tenant · on Jan 26, 2020

It was a different time.

aksss · on Jan 27, 2020

Back in the day, which was a Wednesday

im3w1l · on Jan 26, 2020

Race conditions not triggering when the transport is fast enough?

pklee · on Jan 26, 2020

I understand the distinction now :), thank you. Funny.

lxgr · on Jan 26, 2020

Good point.

Berkeley DB also supports multiple processes accessing a database concurrently, as far as I know.

I was wondering if the authors were referring to SQL-like databases, but MS Access seems to be one?

ComodoHacker · on Jan 26, 2020

Yes, MS Access supports both SQL and non-SQL API, and not mentioning it isn't very professional from the author.

derision · on Jan 26, 2020

Is there anyone actually choosing MS Access for new projects in 2020?

oldgradstudent · on Jan 26, 2020

They were indeed referring to SQL servers. The paragraph starts with

> Most SQL database engines are client/server based.

I skipped it for brevity.

paulie_a · on Jan 26, 2020

It did allow it but getting it running reliably was basically getting punched in the face on a daily basis. Act! Was similar except they included getting kicked in the balls two to three times a week.

The reason access was popular was it made any middle manager that could wield an Excel spreadsheet think they could build a database.

bsenftner · on Jan 26, 2020

I have a sales guy buddy with a 50K contact database in Act! that runs a Win95 VM to maintain access to this data. I've tried to figure out how to migrate it, but it is more than a few hours and who has that kind of time?

sethev · on Jan 26, 2020

Didn't prevent it is closer to my recollection. My first job was on a system that used this idea pretty heavily - what a nightmare! We only got that code stabilized once we removed any sharing (and later removed Access completely).

tyingq · on Jan 26, 2020

Not open source, but Raima supports multiple processes on a single db file as well. https://raima.com/

mikestew · on Jan 26, 2020

FoxPro/Foxbase as well. In the Foxbase days predating FoxPro, there was even a “multi-user” version that, unsurprisingly, cost more than the standard single user.

And plenty of other file-based DBs such as Borland’s Paradox and the db engine, dBase; pretty sure FileMaker, too.

flatline · on Jan 26, 2020

HSQL does too I believe.

celticmusic · on Jan 26, 2020

JET database are what MS Access used underneath, and attempting multiple concurrent usage on JET databases is a good way to corrupt them.

EGreg · on Jan 26, 2020

Waaaaait wouldn’t that mean the file system is the server, with some binary API and responsible for handling concurrent access and locks for the entire file? LOL.

m-p-3 · on Jan 26, 2020

Serverless in this situation means that you don't really have to provision or setup an actual server to handle the database, the client itself just need the ability to read and write an SQLite file.

EGreg · on Jan 27, 2020

Because many of the locking things are outsourced to the file system which acts as server for concurrent threads.

A server is something that listens for commands from various clients and executes them.

idclip · on Jan 26, 2020

That just circle jerk. There is an agreed upon def of serverless.

If we reduce things to the absurd we stop being able to reason about things.

https://en.m.wikipedia.org/wiki/Serverless_computing

finnthehuman · on Jan 26, 2020

>There is an agreed upon def of serverless.

Sorta but not really. The fact people have worked backwards from marketing names to try and constructively define inherently self-contradictory branding (rather than create a descriptive category into which we place questionable names and ignore them) is an embarrassment for everyone except the marketing departments.

idclip · on Jan 26, 2020

I honestly too dislike all this naming fad and feel the internet’s been taken over by the management and marketting folk .. but still, i try not to hyperbole about it - too bad, but its sorta ok, and it ultimately irrelevant for the job, dbConnection is simply remote.

aikah · on Jan 26, 2020

It's called a buzzword, nothing more nothing less, it's like "cloud computing", eventually it becomes meaningless.

idclip · on Jan 26, 2020

We now have kubernetes tho, and openshift. And i actually believe in docker swarms and compose battalions.

Ill give u that its hard to discern when a thing is a real change and when it isnt, as the titans of industry try to peacock around.

Id just examine what they actually mean, for good measure. Theyre trying to sell things they have no idea about - but thats why the hierarchy of commerce is.

DrJones1098 · on Jan 26, 2020

Thank you for saying this. People seem to argue semantics to appear clever and it really pollutes communication and reasoning.

filleduchaos · on Jan 26, 2020

That is not the definition of serverless in question.

idclip · on Jan 26, 2020

If you examine it closely, youll see that they have alot incommon, and that the author defined the term neo-serverless to attempt to adress this - both definitions share the fact that multiple applications (in clientless form) can access the database. He even gives amazon S3 as an example. For neo serverless.

I agree with you that it take a bit to marry both, but the stretch isnt far.

Im also avoiding criticising the Author for not sticking to the main def, since id then be red-herringing the post.

filleduchaos · on Jan 26, 2020

> both definitions share the fact that multiple applications (in clientless form) can access the database

That is like the least relevant thing on that entire page, to be frank. "SQLite is Serverless" is specifically referring to SQLite being an embeddable library that runs in the same process (and same thread, even) as your application vs. the client-server architecture (database in another process, with communication via a port) that DBMSes like MySQL and co have.

> Im also avoiding criticising the Author for not sticking to the main def

The "main definition" (i.e. the web dev buzzword) came into being years after this post was written.

idclip · on Jan 27, 2020

You are right

simias · on Jan 26, 2020

Well you effectively report some of the workload normally handled by the server to the OS's filesystem layer, that's true. In particular you rely heavily on the FS locking working correctly. Calling the FS a "server" is a bit of a stretch though.

cle · on Jan 26, 2020

> It is important to understand these two different definitions for "serverless". When a database claims to be "serverless", be sure to discern whether they mean "classic serverless" or "neo-serverless".

It's really not important to understand that distinction, because this author seems to be the only one making it. Everyone knows what "serverless" means at this point, and it's not an embedded DB.

skrebbel · on Jan 26, 2020

I agree! Also the cryptography people should stop calling their hobby "crypto", everybody knows crypto means bitcoin and stuff. Snobs.

(here's SQLite's "Serverless" page the way it was in 2007: https://web.archive.org/web/20071115173112/https://www.sqlit...)

yegle · on Jan 26, 2020

I'm not sure if this comment missed a /s somewhere...

mcpeepants · on Jan 26, 2020

it wouldn't be wrong to include one, but it didn't necessarily need it. the link in parentheses serves the same purpose.

skrebbel · on Jan 27, 2020

nitpick: it would be wrong to include one, because sarcasm isn't sarcasm if you say it's sarcasm.

personally i feel like even including the archive.org link was pushing it over the edge :)

jiofih · on Jan 26, 2020

This such an air-headed comment to make. You must realize that page describing how SQLite is “serverless” has been up probably longer than your entire adult life. It is important in its context, they are not trying to “redefine” (lol) the term.

hinkley · on Jan 27, 2020

> (This section was added on 2018-04-02)

jiofih · on Jan 27, 2020

That’s exactly the point, the section was added to note the new definition of serverless vs the one that has been there for over a decade.

cle · on Jan 26, 2020

It doesn’t matter. It is not the generally accepted definition of serverless. The meaning of a word is based on how it is generally used, and almost nobody means this when they use the word “serverless”.

At this point trying to use the word in this way just creates a bunch of unnecessary confusion. Call it something else so we can move on to more important (and clear) discussions.

(Also there are a lot of assumptions and snarky comments in these responses about my age, which is quite rude and pretty elucidating.)

mbreese · on Jan 26, 2020

Their usage of serverless predates all of these cloud application offerings. Just because a lot of people in this particular echo chamber like to think of serverless as meaning one thing, there are plenty of others who have been using the same word to mean something different for quite a bit longer.

Moreover, the cloud as a service definition isn’t even accurate. There still is a server — it’s just not one the developer has to worry about.

aidenn0 · on Jan 26, 2020

The page was written before that usage of the term serverless existed, and anyone reading it would have known exactly how it was meant. The "neo- vs classical-" note was added in 2018 because the distinction was being made.

Is your argument that they should replace "serverless" with "runs without a server?" That seems like a strange position to me.

PaulDavisThe1st · on Jan 26, 2020

No doubt you're a little surprised that hacker news doesn't contain that many posts about people breaking into computer information systems, amirite?

quasarj · on Jan 26, 2020

For sure. Everyone knows hackers are criminals now. So why are we all here on a public criminal forum discussing our crime?

ForHackernews · on Jan 26, 2020

I think hosting providers marketing their managed servers as "serverless" is what has created confusion.

Imagine if Uber called itself a "carless" taxi.

cmroanirgo · on Jan 27, 2020

Although clearly different in your world, to me serverless meant that it can run without a central server (it was used in peering systems, eg games in the 90's, and co-distributed systems workers sharing info - a bit like block chain in the 00's). Occasionally, it was used when the app was working offline.

I think it highly ironic that the marketing hype just upended what's really going on. The new stuff's 100% server bound, as most people realise.

mikestew · on Jan 26, 2020

Everyone know that “serverless” anything doesn’t run on a server, be it AWS or Azure instances or what have you. That would be both ironic and silly, like someone used the wrong word or something.

Serverless databases and what have you have been around longer than the current batch of folks trying to redefine things (or more charitably, ran out names to call things). Like or not, there is a distinction even if the old definition was before your time.

cle · on Jan 26, 2020

Nobody called them serverless DBs, and nobody does.

There is a vocal minority of reductionists here on HN who dismiss the accepted definition of serverless because the literal meaning doesn’t make sense. It’s just noise though. “Serverless” does have a specific meaning and it’s not “there are no servers anywhere”.

I was around when there was no serverless. Things change, new words arise. Time for us to get with the times.

int_19h · on Jan 27, 2020

I remember Firebird called "serverless" sometime around 2006, in the same meaning that SQLite uses. It was pretty common terminology for RDBMS.

quasarj · on Jan 26, 2020

tonyarkles · on Jan 27, 2020

> I was around when there was no serverless.

We used to call it /cgi-bin/ ;)

coldtea · on Jan 26, 2020

>Everyone knows what "serverless" means at this point, and it's not an embedded DB.

Serverless is a marketing term at this point (as it was when it started). This post brings some welcome definitions and expands it to something that has many of the same attributes but wasn't appreciated as such.

cle · on Jan 26, 2020

Serverless refers to platforms with specific and well-understood properties, such as managed servers, autoscaling, usage-based pricing, etc.

What parts of SQLite have these properties?

mbreese · on Jan 26, 2020

And for others the word “serverless” literally means without a server. As in... a library. It’s been a well used definition for many years.

pnako · on Jan 27, 2020

In the case of SQLite, it doesn't have a server process.

In the case of cloud "serverless" it doesn't have a physical ser-- Wait did you just mention "managed servers"?

ForHackernews · on Jan 26, 2020

SQLite has specific, well-understood pricing and scaling properties.

cnst · on Jan 27, 2020

This so much!

But is it web-scale?! /s

Animats · on Jan 26, 2020

The "neo-serverless" thing is strange. But then, so is "serverless". "Serverless" systems are just time-shared servers. But only if they use new, cool technology. Shared hosting with vendor managed MySQL doesn't count, apparently.

sgt · on Jan 26, 2020

After this article has been uncovered by HN, that may just be about to change.

hn_throwaway_99 · on Jan 26, 2020

Exactly. If anything, it seems like the author wants to take the word "embedded", which everyone understands, and somehow redefine it to "serverless" which everyone also understands, and which this is not.

filleduchaos · on Jan 26, 2020

Yes, of course, the author took a time machine to 2007 to try and co-opt your buzzword.

hinkley · on Jan 26, 2020

Meh. If we allow serverless to make REST calls, is accessing a file system any different?

I had more trouble with the assertion that the embedded DB would be maintenance-free. I started a top level comment asking for someone to explain how that would work. The part of my brain that protects me from scams is screaming “something for nothing”.

catalogia · on Jan 27, 2020

> "I had more trouble with the assertion that the embedded DB would be maintenance-free. I started a top level comment asking for someone to explain how that would work."

When is the last time a certed-up DBA had to do any maintenance on your Firefox install's places.sqlite database? That's what's meant by maintenance free; you can reasonably employ sqlite databases on users' computers, without users having the foggiest idea of what a database even is.

threatripper · on Jan 26, 2020

In case SQLite is not enough and you need redundant servers or clustering, there's also database servers that use SQLite as storage engine: http://rqlite.com/ https://dqlite.io/

Gys · on Jan 26, 2020

So you run an app with sqlite on one server and sync to sqlite on another server? What would be the benefit to using a separate db server like in the 'neo-serveless' setup?

threatripper · on Jan 26, 2020

No, in this case you always have to use rqlite/dqlite because they manage the network synchronization. They use SQLite as storage engine (one SQLite database per server instance).

Gys · on Jan 26, 2020

I understand that in those cases rqlite/dqlite is used. But that it just a technical detail. My point is that I am running two servers: one with the app and Xqlite and another one with Xqlite.

In case of a neo-serveless setup, I also have two servers: one with the app, the other one with the db server.

So what are the benefits of the Xqlite setup? I looked into that before and for one thing, Xqlite is slower (obviously) then just sqlite. So speed is not a key benefit. I also will have to manage both servers myself.

At least for a seperate db server I have the benefit that I can buy that as a service, incl management, backups and such.

tracker1 · on Jan 26, 2020

Not the author, or knowing of all the technical details... simplistic replication structure and redundancy/failover without an expensive or more complex RDBMS solution while still self-hosting the service.

There are still a lot of instances where you cannot use a cloud provider for your app or database.

To be honest, I'd probably lean more towards a nosql database that has in the box, relatively easy replication strategy, though that might mean 3+ db servers for good performance. (RethinkDB, MongoDB, Cassandra/ScyllaDB, Cockroach). Just depends on the budget and resources.

PeterCorless · on Jan 27, 2020

Well, ScyllaDB is free and open source, so that should help the budget. (Though we do have an enterprise version, base price is FREE!)

tracker1 · on Jan 29, 2020

I'd probably reach for Scylla wherever I would consider Cassandra (or similar), though the similar bigtable/columnstore-like services in most cloud hosts is generally still going to be easier.

The performance over C* is surprisingly good.

geokon · on Jan 26, 2020

a bit tangential, but when do you move form using in-application data structures (maps, trees, vector/arrays) to using a database? Is it basically when the data doesn't fit in memory? I've been programming for almost a decade and I've never come across needing a database... (for context, it's ten years without anything web related) I'm interested in them and I'd love to learn SQL but I can't even think of a use case outside of managing user in some web based application

closeparen · on Jan 26, 2020

SQL queries tend to be much smaller than their equivalent data-structure traversal procedures. This can be beneficial even when you still want the data in your process's address space, hence embedded database engines like sqlite. Libraries like Linq can also provide the same expressive power over a programming language's native objects and collections.

As to why you'd want a separate DB server process: long lived mutable structures tend to drift into unexpected states, which is why the standard computer troubleshooting procedure since forever is to restart. Database management systems specialize in not having that problem. You deploy one that's widely used and hardened over many years, change it infrequently, and operate it with great care. Or pay a cloud provider to do that.

Then everything else gets to outsource the burden of persistence to it, and the vast majority of the workloads you develop and operate are stateless. These are drastically more convenient and more resilient to mistakes. They're effectively "restarted" with each API request, and you can stand them up, tear them down, migrate them to new machines, etc. as much as you want without fear of data loss.

FpUser · on Jan 26, 2020

"long lived mutable structures tend to drift into unexpected states"

Why are those "tend to drift" ? As an example I wrote sort of like game server for one of my applications. Internally it has those exact forever lived mutable structures. I've never observed it to drift into any unpredictable state. Works like a charm and running for many month. I only reboot it when I need to update it to a new version.

The only catch here is that all data fit into RAM. With the amount of RAM modern computers can be stuffed with I do not really see if my server would ever run out of it.

Of course it backed up by database but it merely serves as a persistence layer for this particular application.

dragontamer · on Jan 27, 2020

Consider why 3rd normal form exists to begin with.

In the real world, let's say you have a shipping address and a billing address for a customer, and they are usually (but not always) the same.

Eventually, a customer moves, changing both of their addresses. But the user forgets to change their billing address with their delivery address.

A proper database would have a 'billing address is same as delivery address' logic, following the principle of DRY (don't repeat yourself).

----------

There are lots of examples here of what can go wrong when you repeat yourself in a database application. The user may have an error when repeating themselves over the dataset (delivery address is correct, but zip code on billing address has a typo).

Dealing with these issues at scale, with hundreds of thousands of customers, is certainly a problem. Normal forms can formalize these issues and help the business owner avoid the problems.

Where do you verify the existence of zip codes and cities? Where do you check for typos? How do you prevent contradictions on the submitted information?

Your human customers will make many mistakes. Your logic must hold up even in the presence of faulty data.

FpUser · on Jan 27, 2020

"A proper database would have a 'billing address is same as delivery address logic"

The database does not have logic like this. It has to be implemented by stored procedure. When I have application server all such logic (if applicable) is handled by code in much more performant way. No data goes to a database directly. Everything passes through the app server along with the validation data transformation etc, etc. As already said the database in this particular case is nothing more but persistence layer.

Again we can put all kind of theoretical speculations but as I already said, my particular server does not have data drifting to some faulty states.

dragontamer · on Jan 27, 2020

>> "A proper database would have a 'billing address is same as delivery address logic"

> The database does not have logic like this.

Of course it wouldn't have 'logic', databases are just stores of data.

You'd have one 'address' table, with probably a int-primary/surrogate key. Then the 'delivery' and 'billing' address would be an int, pointing to the address table.

Furthermore, the billing and delivery address would be foreign keys, so the internal database logic would keep the tables in sync with no application code required.

With the data organized in this manner, the application code becomes logic free and braindead easy to write. Or at least, corner cases become easier to handle and more explicit. (Say two customers share the same address, do you allow repeats in the address table? Or do you allow customers to tie address information together? Either way, your decision rests on how you define the primary key)

> When I have application server all such logic (if applicable) is handled by code in much more performant way.

The most performant way is no logic at all. A proper database removes a lot of checks, the storage format itself naturally creates logic free code.

Some application logic is necessary of course. But you can minimize the logic needed by thinking about data layout.

geokon · on Jan 26, 2020

hmm, it seems to map to my workflow with Clojure lately. But there I'm massaging and generating new data sets (even though it's statless) but I can see that if you have a very settled data format then this would work in a language agnostic data-interface kind of way. I could see like measurement/time series data being stored in SQL and accessed that way would be much cleaner than opening and closing CSV files for all sorts of datasets

wpietri · on Jan 26, 2020

I'm sure people will give you the orthodox answer, so let me give you mine: when other things become more important than performance, ease of development, and code clarity.

Many years ago I happened to meet the creator of Prevayler, an open-source persistence framework that provided ACID guarantees and was thousands of times faster than a database as long as your data fit in RAM. I tried it out for a project and we loved it. Our hundreds and hundreds of unit tests ran in a few seconds. Our pages rendered in ~5 milliseconds. It was structured around a log of actions, so if you wanted to know how the data ended up like it did, every change was logged.

What people who haven't worked this way often miss is that a database doesn't make things simpler, it just makes certain things easier. Once you add a traditional database to your project, you're importing a million lines of mysterious code into your project, and demanding to pay a serialization/deserialization penalty any time your code wants to look at data. When it works, it can be swell, but when you have a problem, suddenly things can get hairy. Database performance optimization is a murky art in a way that just isn't true if all your data is right there in RAM.

Prevayler of course didn't catch on. It was just too weird for most people, who had grown up on databases and for whom data structures were something that they hadn't really thought about since their last CS exam. But I sometimes dream of the world where it did catch on. At the time, fitting in RAM was a big limitation. But now I can get an off-the-shelf desktop with 768 GB of RAM, and Amazon's servers go up to 24 TB of RAM. If you're going past working sets at those levels, traditional databases have anyhow fallen out of favor of big data tools. But it would have been a much better fit for today's world of microservices and distributed systems.

notduncansmith · on Jan 26, 2020

This is exactly the style in which I build applications these days. Most of the time I just describe it as “hella caching” but it is a different paradigm from treating the database as the primary state engine. The speed and simplicity of working on in-memory structures are great. When you outgrow a single server’s RAM capacity, you can use Kafka or another durable message queue as your application’s WAL and shard your data across multiple servers.

wpietri · on Jan 27, 2020

Have you written up the systems you've built? I'd love to read more about the practical details. Feel free to email me or DM me on Twitter if that's better.

notduncansmith · on Jan 27, 2020

I haven’t written up any of the production work I’ve done in this vein, but here’s a demo application I built as a hiring challenge (apologies for the broken demo link, it was hosted by the now-defunct Hyper.sh): https://github.com/notduncansmith/agree/blob/master/README.m...

Given no firm deadline, I timeboxed to 12 hours so it’s not fully fleshed-out but I like to think it illustrates the concept well.

wpietri · on Jan 27, 2020

Ah, that's cool! I like your writeup; it makes the advantages of your approach clear. I hope you got the job!

notduncansmith · on Jan 28, 2020

Thanks! I did, though now working at another co and spending my free time generalizing this into a library that makes this paradigm easier to adopt (basically Redux-ifying your backend but with end-to-end encryption on the stored event logs). The initial version should be ready to publish in the next few weeks :)

wpietri · on Jan 28, 2020

If you remember, please email me or contact me on Twitter when it comes out! I'd love to check it out. And I'll definitely pass it along to the Prevayler community, who I'm sure will be tickled.

adamcharnock · on Jan 26, 2020

Prevayler sounds a lot like Redis, unless I’m missing something.

I’m a big fan of Redis, but I was bitten many years ago when I tried to use it as a replacement for an RDBMS. There were two reasons for this: 1) lack of development libraries and operational tools, and 2) lack of data integrity checks.

1 has changed these days, but 2 is still very much the case (and rightly so, IMHO). Perhaps Prevayler had this?

But yep, it was fast at a time when our competition’s software was slow and clunky. It made a difference.

wpietri · on Jan 26, 2020

Not quite. Redis is still an external database. Prevayler was much simpler, just a library.

With Prevayler, all data is kept in RAM, reachable from one root object, which Prevayler holds. All changes to the data model must be expressed as command objects. Each object is handed to Prevayler, which serializes it to a log and then executes it. Once in a while, you can snapshot the data out and start a new log. If there's a crash, you just load the latest snapshot and replay the log.

You got exactly as much data integrity as you wrote into your objects and your commands. Which without much work could be quite a bit, because you get a lot of data integrity by not writing things. E.g., if a kind of object should never be deleted, you just don't write any deletion code. If, say, account balances should never be changed directly, but only as part of properly structured credits and debits, then you just write your code like that.

Most databases, Redis I think included, are made for arbitrary operations on data. Developers add integrity and security later, hopefully. And that is often duplicative of the code base, so that one ends up having integrity checks both in the code and in the database.

I should say that made a lot of sense for the era databases came out of. Databases were a huge step forward in the 1970s and 1980s. My dad was a developer in that era and it was a big relief not to have to get a bunch of programmers to all follow the same conventions for exactly which record was stored in exactly which spot on their precious and expensive disk drives. Not having to know the minutia of the hardware let a lot of people just get in there and build business reports. But if we were starting fresh today, I don't think we would do anything like a SQL database. Redis was definitely a step away from that era, and I look forward to many more.

solarkraft · on Jan 27, 2020

If I didn't need the log, would I need to use Prevayler at all? It sounds to me like a plain old "keep objects in RAM", which is what most programming languages already natively do.

wpietri · on Jan 28, 2020

What Prevayler got you was the ACID guarantees: https://en.wikipedia.org/wiki/ACID

So if you didn't need persistence, you certainly didn't need it. Ditto if data integrity wasn't important.

It would also have made distribution much easier. Since each mutation was already serialized to disk before execution, you could also send it over the wire to read-only replicas and hot spares for the master.

bob1029 · on Jan 26, 2020

I have found that you can go for a disturbingly-long period of time using primitive schemes like LINQ<->Objects<->JSON to persist your business data before things start to get hairy. Personally, I would say 10 megs of persisted data is about the upper limit before I am going to start reaching for SQLite. If you start to get clever with schemes like one file per serialized entity, you could potentially avoid using a 3rd party database indefinitely. I have found that the technical cost of using SQLite from the start is so low that I just start out using it by default even if the file will never exceed 1MB.

solarkraft · on Jan 27, 2020

> If you start to get clever with schemes like one file per serialized entity, you could potentially avoid using a 3rd party database indefinitely

Most file systems will start to get slow at some point with too many files in a folder.

geokon · on Jan 26, 2020

will you store large objects like images or audio file directly in SQL?

bob1029 · on Jan 26, 2020

In most cases I will. Pulling a blob out of a row is a lot faster than opening an additional file handle. There aren't really any downsides to this either unless you are running table scans. SQLite does support indexes (even full-text) and they do work miracles so do use them when necessary.

I would go so far as to argue that SQLite could be used to store all of the assets for any large piece of software (I.e. a AAA game). This would probably wind up faster and more reliable than most alternative solutions out there today.

bsenftner · on Jan 26, 2020

Can confirm SQlite is used in AAA game engines as well as VFX production pipelines. I know because I put it there.

solarkraft · on Jan 27, 2020

So - is SQlite a good file system? Should I just use it for everything?

bob1029 · on Jan 28, 2020

I think hard drive manufacturers should just embed SQLite directly into their devices and optimize the SQLite VFS implementation to be flash aware at the controller-level. Who needs a file system when your disks speak SQL?

nomadluap · on Jan 27, 2020

SQLite has a page measuring performance of internal blobs vs external files: https://www.sqlite.org/intern-v-extern-blob.html

d0mine · on Jan 26, 2020

Anytime you want persistency: sqlite is a replacement for fopen() https://www.sqlite.org/whentouse.html

dragontamer · on Jan 26, 2020

> when do you move form using in-application data structures (maps, trees, vector/arrays) to using a database?

The killer feature of databases, including no-sql or older ones like BerkeleyDB, is multiprocess synchronization.

If two programs (or two copies of the same program) need to coordinate their data, a database is often far easier to use than writing your own transaction layer through files, mutexes, and other primitives.

Web applications, such as forums or discussion boards, have many users and processes adding data simultaneously. That's why databases are so popular in web backends.

manigandham · on Jan 26, 2020

Persistence, portability, and durability. SQLite is an actual file on disk, can be easily read by multiple programs/languages, and has write-ahead logging to keep your data safe from crashes.

threatripper · on Jan 26, 2020

Is all your data on disk read-only and the data in RAM can be forgotten when your application finishes/closes? Then you don't need any of this.

If your state consists of a single data structure you could write it to disk into a new file and move-replace the old file. This works as long as you have a single instance of your application.

For anything more complex SQLite is a good way to keep your data in a consistent state. If you store it on a local drive, the performance is stellar.

Thinking about going multi-process or multi-user? SQLite can still give you an easy head-start because it performs OK in most situations with a database stored on a network drive.

typon · on Jan 26, 2020

Any medium size or greater application either uses SQL or recreates a shitty version of it. You just can't write complex code without relations.

shakna · on Jan 26, 2020

Structured transactions between applications is one place.

You can get by with files, but they're slower. A DB is the right choice.

Structured data anytime you're hitting multiuser - web, networked games, collab programs, live maps, etc.

Anything that's massively stateful. A DB hands you a lot of guarantees. The filesystem has some of them, but is much slower.

JyB · on Jan 26, 2020

That's a weird question, I don't really understand.

When you want to persist data beyond the lifespan of your process?

jtms · on Jan 26, 2020

How do you persist state across a restart of the app or device? Do you dump state to a file? Do you not need to persist state ever in 10 years of software development?

bernawil · on Jan 26, 2020

> when do you move from using in-application data structures?

as soon as you need any of these:

* a relational model -because you need to model your data that way or you are required to because someone wants to consume it with tableau-.

* concurrently read/read data between 1+ instances.

* a standard way of doing backups.

moving from that to a database would probably equal a rewrite.

dana321 · on Jan 26, 2020

It does allow multiple applications to access the same database at the same time, but when you do so it really hurts performance. I noticed this when i wrote a web crawler in go and used sqlite as the backend.

As soon as i connect using the command line interface, it slows down significantly.

Just something to bear in mind if you want to use it with multiple processes!

cheez · on Jan 26, 2020

I use it for real time data (around 22G/month, compressed) and I can do all sorts of filthy stuff with the database while the real time processes are running.

Increasing the retry timeout can help, using WAL can help. I'm sure you've tried all this though.

bob1029 · on Jan 26, 2020

Multiple processes is an instant anti-pattern for a single SQLite database. I would stop and reconsider your approach before trying to build this solution using it. The way I see it there are 3 options:

1) Implement another process which will have exclusive ownership of the shared SQLite database, and then use some IPC scheme to delegate database operations from multiple processes.

2) Give each process its own copy of a SQLite db if there is no effective shared state between these processes (I.e. you are just map-reducing web crawler results). Upon completion of each process you could aggregate each into a final combined db.

3) Use a hosted database solution such as Postgres.

The bigger question for me would be what are you going to do with this data once you collect it. If you plan on having another series of processes that then use the SQLite db to provide reporting views or execute business logic, I think a hosted solution might be a better option. If scalability is a serious concern, option 2 is probably your best bet.

vdfs · on Jan 26, 2020

It's ironic how using a severless database require your app to be the server and do top-level access management to the database

chadcmulligan · on Jan 27, 2020

Yes, I thought the same - so you have to write a server. What about locking? roll your own I suppose. Multi user - roll your own? What about hot backups? Rollbacks?

I always conclude these things are advocated by people who have no experience in large multi user systems. The same as the NoSQL movement. They'll eventually build a database server. They build a system using the cool thing which works fine when they test it on their single user system. Go live - aagghh what's happening, why are all these people trying to access my data simultaneously? and so on.

One I'll always remember was when XML was the next big thing - they decided to store the raw XML in a database. It was a commercial product, and we were interfacing to it from our system. Once we found out this we started asking questions - no no it works fine we were told, laughing at us old database guys. Went live couldn't handle 5 TPS - what a surprise, it never worked as far as I'm aware. There is this continuous circle of databases are bad, no no do this you don't need to do this, no things have changed - what do you database guys know. Its entertaining to watch if nothing else, my advice - learn SQL, and some database tuning, its not that hard, at least compared to writing your own database engine.

bob1029 · on Jan 26, 2020

It may be ironic but there are benefits to rolling your own. Having your persistence layer talk in terms of your business models instead of raw SQL could be seen as a benefit in some contexts. The process which is the "server" can be written to allow for relaxed consistency based on specific business operation being performed (e.g. no need for transactions around log entries). This could be used to leverage substantial performance gains. The benefits afforded by guarantees of exclusivity between application and database are difficult to overstate.

These advantages of purpose-built functionality would also extend into the arena of handling replication and clustering. I.e. multiple distributed processes each with independent databases synchronized via some custom protocol that operates in terms of specific business models and processes.

chadcmulligan · on Jan 27, 2020

you can just turn off these things if you so desire - in reality the overhead is a lot less than you'd think. You'd be way better off spending the time you'd use to write half a database to optimising your application and just use a database. Or maybe you don't really need a database.

catalogia · on Jan 26, 2020

Nothing of the sort is required. If you want to write a simple single-threaded synchronous program, sqlite works great for that.

jecxjo · on Jan 28, 2020

You don't need to write a server to use it. You need to recognize you're using the wrong tool to solve your problem.

I've used SQLite as an embedded database, as a log file, it's even possible to use it as a virtual filesystem in tcl starkits. But a high demand, multiaccess data solution it is not. Yes you can make it work but you need to justify the costs of doing all that server work when you could just use a SQL server that already meets your needs.

pdimitar · on Jan 26, 2020

Centralise your writes to be only done through a single thread/process and then you can read from as many threads/processes you like with no noticeable difference in performance (at least for the several hobby dataset importing projects I tried).

pilif · on Jan 26, 2020

«just write your own server on top of serverless embedded SQLite to get to acceptable performance without needing a client/server database»

(Honestly, once you are at the point where concurrency causes performance issues with SQLite, you are better off moving to databases designed to handle concurrency rather than trying to cobble together your own workaround - you have reached the point where the drawback of SQLite‘s architecture outweigh its advantages and the advantages of other databases architectures outweigh their drawbacks)

pdimitar · on Jan 26, 2020

Sure, I don't disagree. I've ditched sqlite for Postgres many times.

It's just that for 99% of the projects I ever worked on, writes are like 100x less than reads so wrapping the writes in a queue of sorts has been quite okay and performant.

It can and it has been coming to a point when it's easier to use a full-blown database server, too.

I was simply pointing out that for a lot of classic workflows wrapping/centralising writes works quite fine.

wpietri · on Jan 26, 2020

Depends on what you're up to, I suppose. Having a particular chunk of data wrapped in a microservice has other benefits. If that's the direction one is headed anyhow, then wrapping SQLite + single access can be more tractable than allowing multiple access that you'll eventually have to rip out again. I know it works for plenty of people, but for me "database as integration layer" has always been an antipattern.

tracker1 · on Jan 26, 2020

Depends on what you're doing... If you're working in descrete projects that need to be run, then archived.. using an RDBMS vs a service over SQLite is an actual consideration... if you have a remote API interface that talks to different SQLite db files on a per project basis, then backup and archival become trivial matters... if you're using a classic RDBMS then it can become much more complicated.

Aside, adjusting schema over time also becomes easier as archived projects don't need to be updated, they just continue to exist with the older schema.

logicallee · on Jan 26, 2020

As you quoted, people in this discussion are saying that writing a dummy program that has its own sqlite and does nothing but pass messages to it that it receives from all other processes on the system that need to talk to that sqlite file results in much better performance than accessing that file "directly" from the separate processes.

so if everyone's saying this, is there such a standard dummy program?