These days, for a new Web2 type startup, I would use SQLite. Because it requires...

paldepind2 · on Aug 10, 2022

I sometimes see recommendations like this and I just don’t get it. It’s not like setting up a PosgreSQL databases is hard at all? And using files? That just sounds like a horrible developers experience and problems waiting to happen. Databases gives you so much for free: ACID, transactions, easy DSL for queries, structure to your data, etc. On top of that every dev knows some SQL.

What am I missing?

rtpg · on Aug 10, 2022

So I am also "really people, use PostgreSQL", because you have way less tricks you have to play to get it working compared to SQLite, in serious envs. However, some challenges with Postgres, especially if you have less strict data integrity requirements (reddit-y stuff, for example):

- Postgres backups are not trivial. They aren't hard, but well SQLite is just that one file

- Postgres isn't really built for many-cardinal-DB setups (talking on order of 100s of DBs). What does this mean in practice? If you are setting up a multi-tenant system, you're going to quickly realize that you're paying a decent cost because your data is laid out by insertion order. Combine this with MVCC meaning that no amount of indices will give you a nice COUNT, and suddenly your 1% largest tenants will cause perf issues across the board.

SQLite, meanwhile, one file per customer is not a crazy idea. You'll have to do tricks to "map/reduce" for cross-tenant stuff, of course. But your sharding story is much nicer.

- PSQL upgrades are non-trivial if you don't want downtime. There's logical upgrades, but you gotta be real fucking confident (did you remember to copy over your counters? No? Enjoy your downtime on switchover).

That being said, I think a lot of people see SQLite as this magical thing because of people like fly posting about it, without really grokking that the SQLite stuff that gets posted is either "you actually only have one writer" or "you will report successful writes that won't successfully commit later on". The fundamentals of databases don't stop existing with a bunch of SQLite middleware!

But SQLite at least matches the conceptual model of "I run a program on a file", which is much easier to grok than the client-server based stuff in general. But PSQL has way more straightforward conflict management in general

simonw · on Aug 10, 2022

SQLite backups aren't quite that easy.

Grabbing a copy of the file won't necessarily work: you need at atomic snapshot. You can create those using the backup API or by using VACUUM INTO, but in both cases you'll need enough spare disk space to create a fresh copy of your database.

I'm having great results from Litestream streaming to S3, so that's definitely a good option here.

rtpg · on Aug 10, 2022

That is a very good point! Obviously a lot of stuff is easier if you just allow yourself downtime, but at one point the file gets big enough to where this matters.

mekster · on Aug 10, 2022

Just use zfs snapshot to backup. Let go of the old ways of dump commands.

rahimnathwani · on Aug 10, 2022

"SQLite is just that one file"

How do you get enough time to back up that file when it's in use? Lock the file until you've copied it? zfs?

threatripper · on Aug 10, 2022

They actually address this problem.

https://www.sqlite.org/backup.html

> The Online Backup API was created to address these concerns. The online backup API allows the contents of one database to be copied into another database file, replacing any original contents of the target database. The copy operation may be done incrementally, in which case the source database does not need to be locked for the duration of the copy, only for the brief periods of time when it is actually being read from. This allows other database users to continue without excessive delays while a backup of an online database is made.

rahimnathwani · on Aug 10, 2022

From the user perspective, then, a SQLite backup doesn't seem materially less complicated than a postgres backup. I understood GP's comment to mean that it's easier to back up SQLite because it's just a file [which you can just copy].

threatripper · on Aug 10, 2022

It is just a file that you can copy when nobody is modifying data. In most simple cases you can just shut down the webserver for a minute to do database maintenance. Things that you already know how to do them. You don't have to remember any specific database commands for that. So the real difference is in the entry barrier.

e12e · on Aug 10, 2022

I'm quite confident that if you can snapshot the vm/filesystem (eg VMware snapshot, or zfs snapshot) - you will have a working postgresql backup too...

LanternLight83 · on Aug 10, 2022

BTRFS or ZFS sounds just fine to me, you could even create a small virtual disk for it on a host FS (less integrity, and one oight to snapshot the whole system anyways, but hypothetically you could do this just for atomic snapshots as long the rest of the system is cattle and those snapshots are properly backed up / replicated)

TingPing · on Aug 10, 2022

BTRFS recommends disabling CoW for DB files as it destroys performance.

akamaka · on Aug 10, 2022

I recommend that all developers should at least once try to write a web app that uses plain files for data storage. I’ve done it, and I very quickly realized why databases exist, and not to take them for granted.

sedatk · on Aug 10, 2022

Exactly. I'd written my now popular web app (most popular Turkish social platform to date) as a Delphi CGI using text files as data store in 99 because I wanted to get it running ASAP, and I thought BDE would be quite cumbersome to get running on a remote hosting service. (Its original source that uses text files is at https://github.com/ssg/sozluk-cgi)

I immediately started to have problems as expected, and later migrated to an Access DB. It didn't support concurrent writes, but it was an improvement beyond comprehension. Even "orders of magnitude" doesn't cut it because you get many more luxuries than performance like data-type consistency, ACID compliance, relational integrity, SQL and whatnot.

akamaka · on Aug 10, 2022

What made my implementation disastrous turned out to be that there was no enforcement of string encodings, and users started to see all kinds of garbled text.

sedatk · on Aug 10, 2022

One of my best decisions was to start out ASCII-only. It helped me manage that mess longer. :)

kcartlidge · on Aug 10, 2022

> write a web app that uses plain files for data storage ... very quickly realized why databases exist, and not to take them for granted

You haven't lived until you've built an in-memory database with background threads persisting to JSON files. Oh, the nightmares.

gjulianm · on Aug 10, 2022

> What am I missing?

Different requirements, expertise, and priorities. I don't like generic advice too much because there are so many different situations and priorities, but I've used files, SQLite and PostgreSQL according to the priorities:

- One time we needed to store some internal data from an application to compare certain things between batch launches. Yes, we could have translated form the application code to a database, but it was far easier to just dump the structure to a file in JSON, with the file uniquely named for each batch type. We didn't really need transactions, or ACID, or anything like that, and it worked well enough and, importantly, fast enough.

- Another time we had a library that provided some primitives for management of an application, and needed a database to manage instance information. We went for SQLite there, as it was far easier to setup, debug and develop for. Also, far easier to deploy, because for SQLite it's just "create a file when installing this library", while for PostgreSQL it would be far more complicated.

- Another situation, we needed a database to store flow data at high speed, and which needed to be queried by several monitoring/dashboarding tools. There the choice was PostgreSQL, because anything else wouldn't scale.

In other words, it really depends on the situation. Saying "use PostgreSQL always" is not going to really solve anything, you need to know what to apply to each situation.

paldepind2 · on Aug 10, 2022

Thanks for the reply. All the situations that you describe sounds very reasonable. I definitely wasn't trying to say "use PostgreSQL always", I too have used files and SQLite in various siturations. For instance, your second example is IMO a canonical example of where SQLite is the right tool and where you make use of its strengths. My comment was more directed at the siturations where PostgreSQL seems to be the right tool for the job, but where people still recommend other things.

senttoschool · on Aug 10, 2022

>In other words, it really depends on the situation. Saying "use PostgreSQL always" is not going to really solve anything, you need to know what to apply to each situation.

We're responding to this below. The requirement is a Web2 app with "millions in revenue".

>Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.

gjulianm · on Aug 10, 2022

> The requirement is a Web2 app with "millions in revenue".

Millions in revenue doesn't really say anything about the performance required.

senttoschool · on Aug 10, 2022

>Millions in revenue doesn't really say anything about the performance required.

The assumption is that this is a Web2 company which usually means user-generated content. I assumed that there'd be high reads, decent amount of writes.

We're not trying to write a detailed spec here. Just going off a few details and make assumptions.

0xbadcafebee · on Aug 10, 2022

It's "kick-the-can-down-the-road" engineering. If the server can't hold the database in RAM anymore, they buy more RAM. If the disks are too slow, they buy faster/more disks. If the one server goes down, the site is just down for an hour or two (or a day or two) while they build a new server. It works until it doesn't, and most people are fine with things eventually not working.

This has only been practical within the last 6 years or so. Big servers were just too expensive. We started using cheap commodity low-resource PCs, which are cheaper to buy/scale/replace, but limits what your apps can do. Bare metal went down hard and there wasn't an API to quickly stand up a replacement. NAS/SAN was expensive. Network bandwidth was limited.

The cloud has made everything cheaper and easier and faster and bigger, changing what designs are feasible. You can spin up a new instance with a 6GB database in memory in a virtual machine with network-attached storage and a gigabit pipe for like $20/month. That's crazy.

nfhshy68 · on Aug 10, 2022

It's easy to install k8s curl a helm chart too, doesn't mean you should.

What you're missing is complexity and second order effects.

Making decisions about databases early results in needing to make a lot of secondary decisions about all sorts of things far earlier than you would if you just left it all in sqlite on one VM.

It's not for everyone. People have varying levels of experience and needs. I'll almost always setup a MySQL db but I'd be lying if I said it never resulted in navel gazing and premature db/application design I didn't need.

senttoschool · on Aug 10, 2022

>Making decisions about databases early results in needing to make a lot of secondary decisions about all sorts of things far earlier than you would if you just left it all in sqlite on one VM.

Like what? What could be easier than going to AWS, click a few things to spin up a Postgres instance, click a few more things to scale automatically without downtime, create read replicas, have automatic backups, one-click restore?

I feel like it's the opposite. Trying to put SQLite on a VM via Kubernetes will likely have a lot of secondary decisions that will make "scaling to millions" far harder and more expensive.

gjulianm · on Aug 10, 2022

> Like what? What could be easier than going to AWS, click a few things to spin up a Postgres instance, click a few more things to scale automatically without downtime, create read replicas, have automatic backups, one-click restore?

Not do that and not have to maintain that? Just a regular AWS instance with SQLite will be far easier to setup, with regular filesystem backup which is easier to manage and restore.

> Trying to put SQLite on a VM via Kubernetes will likely have a lot of secondary decisions that will make "scaling to millions" far harder and more expensive.

The issue is assuming that "scaling to millions" is a design goal from the start. For a lot of projects, the best-case-scenario does not require it. For another big set of them, they will never get to it. For the few that do, it's possible that the cost of maintaining the more complex solution while it isn't needed plus the cost of adapting over time (because, let's be honest, it's almost impossible to get the scaling solution right from the start) is more than the cost of just implementing the complex solution when it's actually needed.

senttoschool · on Aug 10, 2022

>Not do that and not have to maintain that? Just a regular AWS instance with SQLite will be far easier to setup, with regular filesystem backup which is easier to manage and restore.

You'd have less maintenance using RDS than you would with an SQLite hosted on a VM.

>The issue is assuming that "scaling to millions" is a design goal from the start.

I was responding to the OP who said you can scale to "millions of revenue" on SQLite. This part is true. You could. But it'd be easier to do it on a managed Postgres instance.

gjulianm · on Aug 10, 2022

> You'd have less maintenance using RDS than you would with an SQLite hosted on a VM.

Considering that the maintenance I need to do for SQLite databases is practically zero...

> I was responding to the OP who said you can scale to "millions of revenue" on SQLite.

Millions of revenue doesn't necessarily need a high performance database. You could have millions on a page that deals with less than one request per second.

senttoschool · on Aug 10, 2022

>Considering that the maintenance I need to do for SQLite databases is practically zero...

I also do zero maintenance on AWS RDS. Didn't have to setup it up either. Just a few clicks, copy and paste connection string, done. No VM to configure. No SSL to configure. No backup to configure.

Two clicks for a read only replica. A few more clicks and I can have multi-node failover. You'd want a failover if you have "millions in revenue" no? How do you plan to set that up with SQLite on a VM?

nfhshy68 · on Aug 10, 2022

If you're at millions of revenue and haven't made the switch to a more robust setup I'd have questions.

The question isn't about what you do when you get there the question is if you get there or if you flounder about deciding between self managed MySQL, postgress, rds, MongoDB, gcp, AWS or Azure.

senttoschool · on Aug 10, 2022

>If you're at millions of revenue and haven't made the switch to a more robust setup I'd have questions.

Companies with more revenue than "millions" are still running AWS RDS.

https://aws.amazon.com/rds/customers/

nfhshy68 · on Aug 11, 2022

I think you misread me or I didn't word it well enough.

I meant switch off sqllite by the time you're running a real business critical application off your db.

hurril · on Aug 10, 2022

That's a cop-out answer. You're making the claim that simply using the filesystem compared to going with Postgres somehow has less complexity and fewer (or lesser) second order effects, but you don't even indicate how come. So here's my question:

how come?

geocar · on Aug 10, 2022

> simply using the filesystem compared to going with Postgres somehow has less complexity

Postgres is also using the file system, it’s just got some extra layers in-front like networking and authentication.

People often refer to those “extra layers” as complexity.

> how come?

Why is it that more code is more stuff that can go wrong? That’s just how it is.

hurril · on Aug 10, 2022

You are also missing the point and in the same way.

You are at A, you want to go to C. We are debating which B to pick between:

1. a filesystem + implementing library over that to cover the interface between A and C.

2. Postgres (which is the same as above - just not invented here)

Now, both of you are imagining a world where #2 somehow implies more complexity and also more second order effects than #1 would.

Nobody uses all of Postgres, but the cost of not using more than 1% of what's available is hardly larger than implementing that 1% yourself for most use cases. And please don't make the misstake of thinking that I'm trying to sell Postgres over any other database product - this is all to do with the worse is better idea that so many of you cultivate.

geocar · on Aug 18, 2022

> We are debating which B to pick between:

No we are not.

Postgres isn't a webserver that can handle 1m http GET requests per second; it doesn't distribute data from 30 event collectors to a reporting database without configuration. *Other stuff needs to be written to satisfy the business.* That's life.

> both of you are imagining a world where #2 somehow implies more complexity and also more second order effects than #1 would.

Most studies into product resolution agree that there's zero difference (from a pass/fail perspective) between writing it yourself or trying to reuse third-party applications. Here's one from some quick ddging: https://standishgroup.com/sample_research_files/CHAOSReport2... and see the resolution of all software products surveyed (page 6).

That means that if you can, it is still cheaper to build new software and test it than to try to use Postgres and scale it, because once software is written the cost is (effectively) zero, whereas the cost to support Postgres is a body -- that I might only be able to use 1% of, and that I might need to hire two so that one can go on holiday sometime, with a real salary that we need to pay.

Now maybe you're working for a company whose software never gets finished. In that case, by all means use Postgres in your application, since you can share the churn on that part with every other person using Postgres in the world, but I'm going kayaking today.

ngc248 · on Aug 10, 2022

Choosing a DB is not premature optimization, but it eliminates a whole host of data related problems and smoothens development.

geocar · on Aug 19, 2022

> Choosing a DB is not premature optimization

That's your opinion.

> it eliminates a whole host of data related problems and smoothens development

I don't find this to be true, but maybe you have different problems than I do.

jokoon · on Aug 10, 2022

I built a gallery web app so I could tag images, for my own use.

At first i was writing in a large json file. Each time i did a change, I would write the file and close it.

It worked pretty well. I now use sqlite+dataset and the port was trivial.

I guess there are better ways to do it than open(), but using dataset is quite simple.

Maybe creating folders and duplicating data works, but it seems more complex than using sqlite and dataset.

latchkey · on Aug 10, 2022

You aren't missing anything.

Cloud SQL postgres on GCP with <50 connections is like ~$8/month and has automated daily backups.

The differences in SQL syntax between SQLite and Postgres, namely around date math (last_ping_at < NOW() - INTERVAL '10 minutes'), make SQLite a non-starter imho... you're going to end up having to rewrite a lot of sql queries.

saagarjha · on Aug 10, 2022

> On top of that every dev knows some SQL.

Where "some" may include "none" ;)

gfody · on Aug 10, 2022

> What am I missing?

only that hn is full, FULL of trolls who give the absolute worst advice possible. it's a sport maybe or it's entirely innocent and related to the Dunning Kruger effect but nothing brings it out more than database threads.

senttoschool · on Aug 10, 2022

You can get a managed Postgres instance for $15/month these days. Likely cheaper, more secure, faster, better than an SQLite db hosted manually on a VM.

>Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.

Plenty of setup. How would you secure your VM? SSL configuration? SSL expiration management? Security? How would you scale your VM to millions without downtime? Backups? Restoring backups?

All these problems are solved with managed database hosting.

zinodaur · on Aug 10, 2022

I recently hopped on the SQLite train myself - I think it's a great default. If your business can possibly use sqlite, you get so much performance for free - performance that translates to a better user experience without the need to build a complicated caching layer.

The argument boils down to how the relative performance costs of databases have changed in the 30 years since MySQL/Postgres were designed - and the surprising observation that a read from a fast SSD is usually faster than a network call to memcached.

[1] https://fly.io/blog/all-in-on-sqlite-litestream/ [2] https://www.usenix.org/publications/loginonline/jurassic-clo...

throwaway0x7E6 · on Aug 10, 2022

my brother in Christ, SQLite is not a RDBMS. it runs on the same machine as your application. the scary, overwhelmingly complex problems you've enumerated should indeed be left to competent professionals to deal with, but they do not affect SQLite.

oefrha · on Aug 10, 2022

> it runs on the same machine as your application.

Better, it runs in the same process.

geocar · on Aug 10, 2022

> the scary, overwhelmingly complex problems you've enumerated should indeed be left to competent professionals to deal with

You also aren’t getting that for $15/mo, either.

senttoschool · on Aug 10, 2022

https://www.digitalocean.com/pricing/managed-databases

$15/month for Postgres managed hosting. I would consider DO as competent professionals.

geocar · on Aug 10, 2022

And are they going to deal with that scary stuff for me for $15/mo? Or are they going to turn it off and on and tell me I need a bigger boat if it’s slow?

senttoschool · on Aug 10, 2022

What scary stuff are you dealing with?

oefrha · on Aug 10, 2022

I love SQLite and use it whenever I can, but if you’re building the next Reddit, you obviously can’t live with the lack of concurrent writes. HN is fine as the write volume is really low, just a couple posts per minute, plus maybe ten times as many votes.

mekster · on Aug 10, 2022

> couple posts per minute

There are hundreds of posts that people are reading at every moment. I don't think it's that less.

oefrha · on Aug 10, 2022

You don't have to guess. Just check https://news.ycombinator.com/new and https://news.ycombinator.com/newcomments for everything that's posted here.

Or take a quantitative approach. HN item IDs are sequential. Using the API:

  $ curl 'https://hacker-news.firebaseio.com/v0/maxitem.json'
  32409711
  $ curl -s "https://hacker-news.firebaseio.com/v0/item/32409711.json" | jq .time
  1660125598
  $ curl -s "https://hacker-news.firebaseio.com/v0/item/$(( 32409711 - 14400 )).json" | jq .time
  1660031940

so, 14400 items in ~26 hours, fewer than 10 items per minute on average. At the peak you'll see somewhat more than that.

okasaki · on Aug 10, 2022

What about one sqlite db per subreddit?

senttoschool · on Aug 10, 2022

Where would you store the user data? Duplicate it across millions of SQLite databases?

colinmhayes · on Aug 10, 2022

Using files sounds insane. 100% chance you end up with a nasty data race.

asdff · on Aug 10, 2022

Why sqlite and not postgres? every sqlite tutorial or documentation or anything at all that I've seen says not to use this tool for a large database

bdlowery · on Aug 10, 2022

https://remoteok.com and https://nomadlist.com both use SQLite as the database. 100+ million requests a month and there's been no problems.

threatripper · on Aug 10, 2022

For what definition of "large"? SQLite surely has no problem holding gigabytes of data or millions of rows. In many cases it's much faster than most other kinds of databases. So, size is not the problem. Throughput on one machine with one active process on the same VM is not the problem. Having multiple processes on one VM in most cases is no problem.

But if you spread your compute across many machines that access a single database it can get iffy. If you need the database accessible through network there are server extensions which use SQLite as storage format but you're dealing with a server and probably could use any other database without much difference in deployment&maintenance.

14u2c · on Aug 10, 2022

This all sounds good until you consider high-availability, which IMO is absolutely essential for any serious business. How do you handle fail-overs when that cheap VM goes down? How do you handle regional replication?

You could cobble something together with rsync, etc, but then you have added a bunch of complexity and built a custom and unproven system. Another option is to use one of the SQLLite extensions popping recently like Dqlite, but again, this approach is relatively unproven for basing your entire business on.

Or you could simply use an off the shelf DBMS like Postgres or even MySQL. They already solve all of these problems and are as battle-tested as can be.

Skinney · on Aug 10, 2022

> high-availability, which IMO is absolutely essential for any serious business

Depends very much on the business. You can have downtime-free deploys on a single node, and as long as you've setup a system to automatically replace a faulty node with a new one (which typically takes < 10min) then a lot of businesses can live with that just fine. It's not like that cheap VM goes down every day, but just in case you can usually schedule a new instance every week or so to reduce the chance of that happening.

> How do you handle regional replication?

For backup purposes, you'd use litestream. Very easy to use with something like systemd or as a docker sidecar.

For performance purposes, if you do need that performance you'd obviously use something else. Depending on the type of service you have, though, you can get very far with a CDN in front of your server.

> Or you could simply use an off the shelf DBMS like Postgres or even MySQL.

If you need it, sure.

jsight · on Aug 10, 2022

TBH, I've used hypersonic sql before. People thought it was crazy, but there was no concurrency, and backups were just copying a couple of files. Fixing them if they failed was easy to.

People get too caught up in assumptions without knowing the use case. There are a million use cases where a tool like sqlite would be bad, but also a million where its likely the easiest and best tool.

Wisdom is knowing the difference.

Tade0 · on Aug 10, 2022

> These days, for a new Web2 type startup, I would use SQLite.

There's a whole page in the docs explaining why this might be a bad idea for a write-heavy site:

https://www.sqlite.org/howtocorrupt.html

Section 2 is especially interesting. SQLite locks on the whole database file - usually one per database - and that is a major bottleneck.

> Another option worth taking a look at is to use no DB at all.

We tried that in a project I'm currently in as a means of shipping an MVP as soon as possible.

Major headache in terms of performance down the road. We still serve some effectively static stuff as files, but they're cached in memory, otherwise they would slow the whole thing down by an order of magnitude.

kcartlidge · on Aug 10, 2022

If you're doing .Net Core you could also take a look at LiteDB. Use SQL queries in a local (optionally encrypted) file using a single embedded nuget package. Also creates schemas on the fly when you first persist collection types as per traditional NoSQL systems.

I'm using it to create an embeddable out of the box passwordless auth system that has no external dependencies. Works great.

Admittedly I've not tried it under extreme load yet, but I'm happy to migrate it onwards if something I create goes massive.

(Edit: to be clear it is NoSQL, but you can query it using almost standard SQL which is what I do).

polote · on Aug 10, 2022

Sure and I would recommend they use C to code the backend, they should not use a CDN either to save some bucks. And on the front-end there is no sense to use javascript as everything can be done in the backend

mandeepj · on Aug 10, 2022

Exactly! no separate machine required just for db. You can also backup SQLite db asynchronously

klysm · on Aug 10, 2022

Completely disagree with the suggestion to just write files. The file system APIs are incredibly easy to misuse. A database with good transaction semantics and a schema is worth so much.