Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

These days, for a new Web2 type startup, I would use SQLite.

Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.

Another option worth taking a look at is to use no DB at all. Just writing to files is often the faster way to get going. The filesystem offers so much functionality out of the box. And the ecosystem of tools is marvelous. As far as I know, HN uses no DB and just writes all data to plain text files.

It would actually be pretty interesting to have a look at how HN stores the data:

https://news.ycombinator.com/item?id=32408322



I sometimes see recommendations like this and I just don’t get it. It’s not like setting up a PosgreSQL databases is hard at all? And using files? That just sounds like a horrible developers experience and problems waiting to happen. Databases gives you so much for free: ACID, transactions, easy DSL for queries, structure to your data, etc. On top of that every dev knows some SQL.

What am I missing?


So I am also "really people, use PostgreSQL", because you have way less tricks you have to play to get it working compared to SQLite, in serious envs. However, some challenges with Postgres, especially if you have less strict data integrity requirements (reddit-y stuff, for example):

- Postgres backups are not trivial. They aren't hard, but well SQLite is just that one file

- Postgres isn't really built for many-cardinal-DB setups (talking on order of 100s of DBs). What does this mean in practice? If you are setting up a multi-tenant system, you're going to quickly realize that you're paying a decent cost because your data is laid out by insertion order. Combine this with MVCC meaning that no amount of indices will give you a nice COUNT, and suddenly your 1% largest tenants will cause perf issues across the board.

SQLite, meanwhile, one file per customer is not a crazy idea. You'll have to do tricks to "map/reduce" for cross-tenant stuff, of course. But your sharding story is much nicer.

- PSQL upgrades are non-trivial if you don't want downtime. There's logical upgrades, but you gotta be real fucking confident (did you remember to copy over your counters? No? Enjoy your downtime on switchover).

That being said, I think a lot of people see SQLite as this magical thing because of people like fly posting about it, without really grokking that the SQLite stuff that gets posted is either "you actually only have one writer" or "you will report successful writes that won't successfully commit later on". The fundamentals of databases don't stop existing with a bunch of SQLite middleware!

But SQLite at least matches the conceptual model of "I run a program on a file", which is much easier to grok than the client-server based stuff in general. But PSQL has way more straightforward conflict management in general


SQLite backups aren't quite that easy.

Grabbing a copy of the file won't necessarily work: you need at atomic snapshot. You can create those using the backup API or by using VACUUM INTO, but in both cases you'll need enough spare disk space to create a fresh copy of your database.

I'm having great results from Litestream streaming to S3, so that's definitely a good option here.


That is a very good point! Obviously a lot of stuff is easier if you just allow yourself downtime, but at one point the file gets big enough to where this matters.


Just use zfs snapshot to backup. Let go of the old ways of dump commands.


"SQLite is just that one file"

How do you get enough time to back up that file when it's in use? Lock the file until you've copied it? zfs?


They actually address this problem.

https://www.sqlite.org/backup.html

> The Online Backup API was created to address these concerns. The online backup API allows the contents of one database to be copied into another database file, replacing any original contents of the target database. The copy operation may be done incrementally, in which case the source database does not need to be locked for the duration of the copy, only for the brief periods of time when it is actually being read from. This allows other database users to continue without excessive delays while a backup of an online database is made.


From the user perspective, then, a SQLite backup doesn't seem materially less complicated than a postgres backup. I understood GP's comment to mean that it's easier to back up SQLite because it's just a file [which you can just copy].


It is just a file that you can copy when nobody is modifying data. In most simple cases you can just shut down the webserver for a minute to do database maintenance. Things that you already know how to do them. You don't have to remember any specific database commands for that. So the real difference is in the entry barrier.


I'm quite confident that if you can snapshot the vm/filesystem (eg VMware snapshot, or zfs snapshot) - you will have a working postgresql backup too...


BTRFS or ZFS sounds just fine to me, you could even create a small virtual disk for it on a host FS (less integrity, and one oight to snapshot the whole system anyways, but hypothetically you could do this just for atomic snapshots as long the rest of the system is cattle and those snapshots are properly backed up / replicated)


BTRFS recommends disabling CoW for DB files as it destroys performance.


I recommend that all developers should at least once try to write a web app that uses plain files for data storage. I’ve done it, and I very quickly realized why databases exist, and not to take them for granted.


Exactly. I'd written my now popular web app (most popular Turkish social platform to date) as a Delphi CGI using text files as data store in 99 because I wanted to get it running ASAP, and I thought BDE would be quite cumbersome to get running on a remote hosting service. (Its original source that uses text files is at https://github.com/ssg/sozluk-cgi)

I immediately started to have problems as expected, and later migrated to an Access DB. It didn't support concurrent writes, but it was an improvement beyond comprehension. Even "orders of magnitude" doesn't cut it because you get many more luxuries than performance like data-type consistency, ACID compliance, relational integrity, SQL and whatnot.


What made my implementation disastrous turned out to be that there was no enforcement of string encodings, and users started to see all kinds of garbled text.


One of my best decisions was to start out ASCII-only. It helped me manage that mess longer. :)


> write a web app that uses plain files for data storage ... very quickly realized why databases exist, and not to take them for granted

You haven't lived until you've built an in-memory database with background threads persisting to JSON files. Oh, the nightmares.


> What am I missing?

Different requirements, expertise, and priorities. I don't like generic advice too much because there are so many different situations and priorities, but I've used files, SQLite and PostgreSQL according to the priorities:

- One time we needed to store some internal data from an application to compare certain things between batch launches. Yes, we could have translated form the application code to a database, but it was far easier to just dump the structure to a file in JSON, with the file uniquely named for each batch type. We didn't really need transactions, or ACID, or anything like that, and it worked well enough and, importantly, fast enough.

- Another time we had a library that provided some primitives for management of an application, and needed a database to manage instance information. We went for SQLite there, as it was far easier to setup, debug and develop for. Also, far easier to deploy, because for SQLite it's just "create a file when installing this library", while for PostgreSQL it would be far more complicated.

- Another situation, we needed a database to store flow data at high speed, and which needed to be queried by several monitoring/dashboarding tools. There the choice was PostgreSQL, because anything else wouldn't scale.

In other words, it really depends on the situation. Saying "use PostgreSQL always" is not going to really solve anything, you need to know what to apply to each situation.


Thanks for the reply. All the situations that you describe sounds very reasonable. I definitely wasn't trying to say "use PostgreSQL always", I too have used files and SQLite in various siturations. For instance, your second example is IMO a canonical example of where SQLite is the right tool and where you make use of its strengths. My comment was more directed at the siturations where PostgreSQL seems to be the right tool for the job, but where people still recommend other things.


>In other words, it really depends on the situation. Saying "use PostgreSQL always" is not going to really solve anything, you need to know what to apply to each situation.

We're responding to this below. The requirement is a Web2 app with "millions in revenue".

>Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.


> The requirement is a Web2 app with "millions in revenue".

Millions in revenue doesn't really say anything about the performance required.


>Millions in revenue doesn't really say anything about the performance required.

The assumption is that this is a Web2 company which usually means user-generated content. I assumed that there'd be high reads, decent amount of writes.

We're not trying to write a detailed spec here. Just going off a few details and make assumptions.


It's "kick-the-can-down-the-road" engineering. If the server can't hold the database in RAM anymore, they buy more RAM. If the disks are too slow, they buy faster/more disks. If the one server goes down, the site is just down for an hour or two (or a day or two) while they build a new server. It works until it doesn't, and most people are fine with things eventually not working.

This has only been practical within the last 6 years or so. Big servers were just too expensive. We started using cheap commodity low-resource PCs, which are cheaper to buy/scale/replace, but limits what your apps can do. Bare metal went down hard and there wasn't an API to quickly stand up a replacement. NAS/SAN was expensive. Network bandwidth was limited.

The cloud has made everything cheaper and easier and faster and bigger, changing what designs are feasible. You can spin up a new instance with a 6GB database in memory in a virtual machine with network-attached storage and a gigabit pipe for like $20/month. That's crazy.


It's easy to install k8s curl a helm chart too, doesn't mean you should.

What you're missing is complexity and second order effects.

Making decisions about databases early results in needing to make a lot of secondary decisions about all sorts of things far earlier than you would if you just left it all in sqlite on one VM.

It's not for everyone. People have varying levels of experience and needs. I'll almost always setup a MySQL db but I'd be lying if I said it never resulted in navel gazing and premature db/application design I didn't need.


>Making decisions about databases early results in needing to make a lot of secondary decisions about all sorts of things far earlier than you would if you just left it all in sqlite on one VM.

Like what? What could be easier than going to AWS, click a few things to spin up a Postgres instance, click a few more things to scale automatically without downtime, create read replicas, have automatic backups, one-click restore?

I feel like it's the opposite. Trying to put SQLite on a VM via Kubernetes will likely have a lot of secondary decisions that will make "scaling to millions" far harder and more expensive.


> Like what? What could be easier than going to AWS, click a few things to spin up a Postgres instance, click a few more things to scale automatically without downtime, create read replicas, have automatic backups, one-click restore?

Not do that and not have to maintain that? Just a regular AWS instance with SQLite will be far easier to setup, with regular filesystem backup which is easier to manage and restore.

> Trying to put SQLite on a VM via Kubernetes will likely have a lot of secondary decisions that will make "scaling to millions" far harder and more expensive.

The issue is assuming that "scaling to millions" is a design goal from the start. For a lot of projects, the best-case-scenario does not require it. For another big set of them, they will never get to it. For the few that do, it's possible that the cost of maintaining the more complex solution while it isn't needed plus the cost of adapting over time (because, let's be honest, it's almost impossible to get the scaling solution right from the start) is more than the cost of just implementing the complex solution when it's actually needed.


>Not do that and not have to maintain that? Just a regular AWS instance with SQLite will be far easier to setup, with regular filesystem backup which is easier to manage and restore.

You'd have less maintenance using RDS than you would with an SQLite hosted on a VM.

>The issue is assuming that "scaling to millions" is a design goal from the start.

I was responding to the OP who said you can scale to "millions of revenue" on SQLite. This part is true. You could. But it'd be easier to do it on a managed Postgres instance.


> You'd have less maintenance using RDS than you would with an SQLite hosted on a VM.

Considering that the maintenance I need to do for SQLite databases is practically zero...

> I was responding to the OP who said you can scale to "millions of revenue" on SQLite.

Millions of revenue doesn't necessarily need a high performance database. You could have millions on a page that deals with less than one request per second.


>Considering that the maintenance I need to do for SQLite databases is practically zero...

I also do zero maintenance on AWS RDS. Didn't have to setup it up either. Just a few clicks, copy and paste connection string, done. No VM to configure. No SSL to configure. No backup to configure.

Two clicks for a read only replica. A few more clicks and I can have multi-node failover. You'd want a failover if you have "millions in revenue" no? How do you plan to set that up with SQLite on a VM?


If you're at millions of revenue and haven't made the switch to a more robust setup I'd have questions.

The question isn't about what you do when you get there the question is if you get there or if you flounder about deciding between self managed MySQL, postgress, rds, MongoDB, gcp, AWS or Azure.


>If you're at millions of revenue and haven't made the switch to a more robust setup I'd have questions.

Companies with more revenue than "millions" are still running AWS RDS.

https://aws.amazon.com/rds/customers/


I think you misread me or I didn't word it well enough.

I meant switch off sqllite by the time you're running a real business critical application off your db.


That's a cop-out answer. You're making the claim that simply using the filesystem compared to going with Postgres somehow has less complexity and fewer (or lesser) second order effects, but you don't even indicate how come. So here's my question:

how come?


> simply using the filesystem compared to going with Postgres somehow has less complexity

Postgres is also using the file system, it’s just got some extra layers in-front like networking and authentication.

People often refer to those “extra layers” as complexity.

> how come?

Why is it that more code is more stuff that can go wrong? That’s just how it is.


You are also missing the point and in the same way.

You are at A, you want to go to C. We are debating which B to pick between:

1. a filesystem + implementing library over that to cover the interface between A and C.

2. Postgres (which is the same as above - just not invented here)

Now, both of you are imagining a world where #2 somehow implies more complexity and also more second order effects than #1 would.

Nobody uses all of Postgres, but the cost of not using more than 1% of what's available is hardly larger than implementing that 1% yourself for most use cases. And please don't make the misstake of thinking that I'm trying to sell Postgres over any other database product - this is all to do with the worse is better idea that so many of you cultivate.


> We are debating which B to pick between:

No we are not.

Postgres isn't a webserver that can handle 1m http GET requests per second; it doesn't distribute data from 30 event collectors to a reporting database without configuration. *Other stuff needs to be written to satisfy the business.* That's life.

> both of you are imagining a world where #2 somehow implies more complexity and also more second order effects than #1 would.

Most studies into product resolution agree that there's zero difference (from a pass/fail perspective) between writing it yourself or trying to reuse third-party applications. Here's one from some quick ddging: https://standishgroup.com/sample_research_files/CHAOSReport2... and see the resolution of all software products surveyed (page 6).

That means that if you can, it is still cheaper to build new software and test it than to try to use Postgres and scale it, because once software is written the cost is (effectively) zero, whereas the cost to support Postgres is a body -- that I might only be able to use 1% of, and that I might need to hire two so that one can go on holiday sometime, with a real salary that we need to pay.

Now maybe you're working for a company whose software never gets finished. In that case, by all means use Postgres in your application, since you can share the churn on that part with every other person using Postgres in the world, but I'm going kayaking today.


Choosing a DB is not premature optimization, but it eliminates a whole host of data related problems and smoothens development.


> Choosing a DB is not premature optimization

That's your opinion.

> it eliminates a whole host of data related problems and smoothens development

I don't find this to be true, but maybe you have different problems than I do.


I built a gallery web app so I could tag images, for my own use.

At first i was writing in a large json file. Each time i did a change, I would write the file and close it.

It worked pretty well. I now use sqlite+dataset and the port was trivial.

I guess there are better ways to do it than open(), but using dataset is quite simple.

Maybe creating folders and duplicating data works, but it seems more complex than using sqlite and dataset.


You aren't missing anything.

Cloud SQL postgres on GCP with <50 connections is like ~$8/month and has automated daily backups.

The differences in SQL syntax between SQLite and Postgres, namely around date math (last_ping_at < NOW() - INTERVAL '10 minutes'), make SQLite a non-starter imho... you're going to end up having to rewrite a lot of sql queries.


> On top of that every dev knows some SQL.

Where "some" may include "none" ;)


> What am I missing?

only that hn is full, FULL of trolls who give the absolute worst advice possible. it's a sport maybe or it's entirely innocent and related to the Dunning Kruger effect but nothing brings it out more than database threads.


You can get a managed Postgres instance for $15/month these days. Likely cheaper, more secure, faster, better than an SQLite db hosted manually on a VM.

>Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.

Plenty of setup. How would you secure your VM? SSL configuration? SSL expiration management? Security? How would you scale your VM to millions without downtime? Backups? Restoring backups?

All these problems are solved with managed database hosting.


I recently hopped on the SQLite train myself - I think it's a great default. If your business can possibly use sqlite, you get so much performance for free - performance that translates to a better user experience without the need to build a complicated caching layer.

The argument boils down to how the relative performance costs of databases have changed in the 30 years since MySQL/Postgres were designed - and the surprising observation that a read from a fast SSD is usually faster than a network call to memcached.

[1] https://fly.io/blog/all-in-on-sqlite-litestream/ [2] https://www.usenix.org/publications/loginonline/jurassic-clo...


my brother in Christ, SQLite is not a RDBMS. it runs on the same machine as your application. the scary, overwhelmingly complex problems you've enumerated should indeed be left to competent professionals to deal with, but they do not affect SQLite.


> it runs on the same machine as your application.

Better, it runs in the same process.


> the scary, overwhelmingly complex problems you've enumerated should indeed be left to competent professionals to deal with

You also aren’t getting that for $15/mo, either.


https://www.digitalocean.com/pricing/managed-databases

$15/month for Postgres managed hosting. I would consider DO as competent professionals.


And are they going to deal with that scary stuff for me for $15/mo? Or are they going to turn it off and on and tell me I need a bigger boat if it’s slow?


What scary stuff are you dealing with?


I love SQLite and use it whenever I can, but if you’re building the next Reddit, you obviously can’t live with the lack of concurrent writes. HN is fine as the write volume is really low, just a couple posts per minute, plus maybe ten times as many votes.


> couple posts per minute

There are hundreds of posts that people are reading at every moment. I don't think it's that less.


You don't have to guess. Just check https://news.ycombinator.com/new and https://news.ycombinator.com/newcomments for everything that's posted here.

Or take a quantitative approach. HN item IDs are sequential. Using the API:

  $ curl 'https://hacker-news.firebaseio.com/v0/maxitem.json'
  32409711
  $ curl -s "https://hacker-news.firebaseio.com/v0/item/32409711.json" | jq .time
  1660125598
  $ curl -s "https://hacker-news.firebaseio.com/v0/item/$(( 32409711 - 14400 )).json" | jq .time
  1660031940
  
so, 14400 items in ~26 hours, fewer than 10 items per minute on average. At the peak you'll see somewhat more than that.


What about one sqlite db per subreddit?


Where would you store the user data? Duplicate it across millions of SQLite databases?


Using files sounds insane. 100% chance you end up with a nasty data race.


Why sqlite and not postgres? every sqlite tutorial or documentation or anything at all that I've seen says not to use this tool for a large database


https://remoteok.com and https://nomadlist.com both use SQLite as the database. 100+ million requests a month and there's been no problems.


For what definition of "large"? SQLite surely has no problem holding gigabytes of data or millions of rows. In many cases it's much faster than most other kinds of databases. So, size is not the problem. Throughput on one machine with one active process on the same VM is not the problem. Having multiple processes on one VM in most cases is no problem.

But if you spread your compute across many machines that access a single database it can get iffy. If you need the database accessible through network there are server extensions which use SQLite as storage format but you're dealing with a server and probably could use any other database without much difference in deployment&maintenance.


This all sounds good until you consider high-availability, which IMO is absolutely essential for any serious business. How do you handle fail-overs when that cheap VM goes down? How do you handle regional replication?

You could cobble something together with rsync, etc, but then you have added a bunch of complexity and built a custom and unproven system. Another option is to use one of the SQLLite extensions popping recently like Dqlite, but again, this approach is relatively unproven for basing your entire business on.

Or you could simply use an off the shelf DBMS like Postgres or even MySQL. They already solve all of these problems and are as battle-tested as can be.


> high-availability, which IMO is absolutely essential for any serious business

Depends very much on the business. You can have downtime-free deploys on a single node, and as long as you've setup a system to automatically replace a faulty node with a new one (which typically takes < 10min) then a lot of businesses can live with that just fine. It's not like that cheap VM goes down every day, but just in case you can usually schedule a new instance every week or so to reduce the chance of that happening.

> How do you handle regional replication?

For backup purposes, you'd use litestream. Very easy to use with something like systemd or as a docker sidecar.

For performance purposes, if you do need that performance you'd obviously use something else. Depending on the type of service you have, though, you can get very far with a CDN in front of your server.

> Or you could simply use an off the shelf DBMS like Postgres or even MySQL.

If you need it, sure.


TBH, I've used hypersonic sql before. People thought it was crazy, but there was no concurrency, and backups were just copying a couple of files. Fixing them if they failed was easy to.

People get too caught up in assumptions without knowing the use case. There are a million use cases where a tool like sqlite would be bad, but also a million where its likely the easiest and best tool.

Wisdom is knowing the difference.


> These days, for a new Web2 type startup, I would use SQLite.

There's a whole page in the docs explaining why this might be a bad idea for a write-heavy site:

https://www.sqlite.org/howtocorrupt.html

Section 2 is especially interesting. SQLite locks on the whole database file - usually one per database - and that is a major bottleneck.

> Another option worth taking a look at is to use no DB at all.

We tried that in a project I'm currently in as a means of shipping an MVP as soon as possible.

Major headache in terms of performance down the road. We still serve some effectively static stuff as files, but they're cached in memory, otherwise they would slow the whole thing down by an order of magnitude.


If you're doing .Net Core you could also take a look at LiteDB. Use SQL queries in a local (optionally encrypted) file using a single embedded nuget package. Also creates schemas on the fly when you first persist collection types as per traditional NoSQL systems.

I'm using it to create an embeddable out of the box passwordless auth system that has no external dependencies. Works great.

Admittedly I've not tried it under extreme load yet, but I'm happy to migrate it onwards if something I create goes massive.

(Edit: to be clear it is NoSQL, but you can query it using almost standard SQL which is what I do).


Sure and I would recommend they use C to code the backend, they should not use a CDN either to save some bucks. And on the front-end there is no sense to use javascript as everything can be done in the backend


Exactly! no separate machine required just for db. You can also backup SQLite db asynchronously


Completely disagree with the suggestion to just write files. The file system APIs are incredibly easy to misuse. A database with good transaction semantics and a schema is worth so much.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: