Well, here's some good reasons, your OS doesn't want you to store tons of small pieces of data in many different files. If you're storing a 100 byte record in a file you're doing it wrong because:
1. Your FS has a minimum blocksize, often somewhere around 4k, you'll waste disk space with small files
2. Your OS MMAPs/FS caches pages with its page size, ~4-8k, you'll waste ram with small files. (MongoDB also makes this mistake)
3. Reading/writing data from files directly requires syscalls, a DB can cache a lot in memory and minimize syscalls.
5. Distributed operation is hard, take a look at a databases like couchbase, elasticsearch or S3, where data is replicated across machines and datacenters in a highly reliable way
6. Consistent backups on a live app are hard. Using postgres or another MVCC DB? Just perform a dump, it will be a true snapshot.
7. Do you need to coordinate writes across multiple app instances (E.G. every web-app ever), you'll have to figure out your own record locking system. Need a high-performance counter atomically incrementable from multiple clients? Easy in SQL (UPDATE foo SET counter=counter+1 WHERE id=1), try doing that in a simple way with plain-text files.
The list just goes on and on and on. The kicker is even if you understand these issues, it's hard enough building a non-buggy implementation of these basics. Oh, and lastly, even if you do build your own mediocre DB, future devs will have to learn your crazy system rather than leveraging all the existing SQL knowledge they currently have.
I noticed when a discussion like this comes up people separate themselves into two camps that mock each other's assumptions while both talking about the same thing. It would be nice to talk more concretely with benchmark numbers.
1. What about a database saves you disk space? Whatever it is, the filesystem can do the same thing. Perhaps you had a preconceived notion of a specific filesystem and OS in mind when you wrote this.
2. Your database MMAPs/caches pages with its custom page buffer too, and that wastes ram with small records.
3. Reading/writing data from a database requires syscalls too. An OS can cache a lot in memory and minimize disk writes. Arguably the database is duplicating the caching logic of the OS and slows things down.
4. Filesystems have atomic transactions too. You can use the flock call to lock files, and rename is atomic. It's a rather befitting hack for HackerNews which "atomically" renames a file to update it.
5, 6 and 7: You can get distribution, replication, fault-tolerance, availability, and snapshots for free out of a filesystem like GlusterFS too. (Also note SQL is crap; it's a bad serialization of a data structure.)
I wonder if the real reason we have better databases than better filesystems is people didn't want to program for the kernel, which is where filesystems had typically ran.
People not grasping you can store data on a filesystem is probably one of the most valuable things a new programming language would have going for it. It would be a staggering reduction in complexity if a language made data manipulation easy, perhaps at the dismay of programmers stripped of the amusing complexity of using this separate thing called a database.
Could you clarify on 2? I mean, there's of course a per-row overhead on a DB, but many rows get put into the same page - so I'm not sure how that's remotely as wasteful as a hypothetical filesystem that uses a minimum of a page-per-file. Of course, that's not a requirement for FS behaviour - just historically a lot of them have worked that way.
> Reading/writing data from a database requires syscalls too
Not for stuff that's in the DB's page cache.
> Arguably the database is duplicating the caching logic of the OS and slows things down.
That's a pretty biased assessment. A 'real' database like Postgres will use calls that bypass the OS page cache, so there's no duplication there. It also understands its usage patterns better, so it can do generally better than a general-purpose cache.
> and rename is atomic
Assuming your file system doesn't atomically give you a truncated file post-system-crash :-).
> (Also note SQL is crap; it's a bad serialization of a data structure.)
Agreed that SQL is not the best language, but is this an argument against set-based query languages, or just SQL specifically?
Databases make some things that are really hard less hard. Managing concurrent read/writes over sets of files (say) is currently difficult. Sure, your FS could have deadlock detection, MVCC and so on built into it, but that would be turning the file system into even more of a database system than it already is. Database systems are complicated because people often want to do complicated things.
A way to execute set-based queries is a major point you bring up here. It's also the most baffling. When tens of filesystems with sophisticated features completely omit what is one of the most common features used by database users, presumably it's not because filesystem designers are not capable of implementing it but of some more mundane reason.
> Not for stuff that's in the DB's page cache.
The database has to log what was written. It does use syscalls to write the write-ahead log. Agreed it can save read syscalls. Then again one process sending a query to another process to read something generates even more syscalls than having everything (app+db) all in the same process.
You are right about 2. I'm playing devils advocate here jumping around assumptions. I could ask why you assume an OS will use a page per file, you can say Linux does this, I can say why is your OS so heavy and point to Exokernel, you can say that's side-stepping the issue. I could say user-space filesystem having its own page cache; in the end we are talking about the same thing. In the end: the less layers the better.
When I look at the kernel I see something like the Berlin wall. A barrier that was there since you were born and which you never asked for. A kernel being hard to hack and monolithic is bound to push developers away, but it won't stop developers from building what they want in the end.
"future devs will have to learn your crazy system rather than leveraging all the existing SQL knowledge they currently have."
Super important point that sometimes trumps everything else when considering any new technology that people often forget. Some of the first servers that I purchased were SGI running Irix. Turns out that the person who advised that didn't consider the fact that while the Sun's weren't "as good" he also didn't consider that there were so many more packages available on Sun's as well as sysadmin's as well as information (at that time) on the net.
What I've found over time is that unless there is a compelling reason to do otherwise you go with the ubiquitous solution to a problem.
Back in the day: Or the solution that there was an O'Reilly book on or a row of books on the bookshelf explaining (when that is essentially the knowledge base).
I think your points are valid only for very different data scenarios.
1-2 scenario: many many small files.
3: continuously accessing your data
4: concurrent connections
5: distributed application
6: never-stopping application
7: concurrent writes to mutable data
You may need a DB, if you want several of these to happen. This might be true for many applications, but for many applications this is equally not true.
Most often than youd' expect, using a DB makes understanding an application _harder_ because the data in the application tends to get structured as in the database: only ids, ints, strings and maybe date, no graphs or algebraic data types such as sum types, optional types, etc.
That's a fair point, for many scenarios this stuff doesn't matter, like a video-game. However, most apps fall into either the "serialize a bunch of crap to disk" category, or the "multi-user data changes all the time category". Nearly everything in the second category works best with a proper database.
There are plenty of databases with support for more complex types, and in any case, it's not like the filesystem understands those either, so if you want to store something to disk, you'll need some kind of translation/serialization anyway.
Nothing very specific, I was specifically thinking about graph databases, since you mentioned them. There's also domain-specific datatypes, such as those implemented by PostGIS.
That link talks about putting the entirety of very small files into their inodes. I don't see anything about combining the tails of multiple files in a single block.
That's true, but OP talked about storing '100 byte records' into a single file. For large files (x times block size), wasting a block for the tail is usually negligible.
1) 100% yes
2) yup
3) Frequently read/written files will be stored in ram
4) This still affects DBs, as they are still a layer above the FS.
5) GPFS, GFS2, ceph, cleversafe, tahoe-lafs, all make this go away.
6) ZFS has snapshots. Snapshots are amazing. Granted they are not nice as SQL dump.
7) If you're doing this with files, you're in trouble.
Having said that, your last point is 100% valid. Don't reinvent the wheel. SQL is good for most things. Even SQLite is super awesome for slightly busy websites(depending on how you use it).
The subtext of the largely upvoted answer is that databases > text files. I may be preaching to the choir BUT:
"1. You can query data in a database (ask it questions)."
If you use your file system as a "database" you can ask it questions too with file paths
"2. You can look up data from a database relatively rapidly."
Given that lines of text can match some problems quite well, a text file can be MUCH faster.
"3. You can relate data from two different tables together using JOINs."
Denormalization makes this unnecessary.
"4. You can create meaningful reports from data in a database."
I work as a data analyst and make meaningful reports from event logs all the time. To be fair, I do prefer them being in a DB usually. ;)
"5. Your data has a built-in structure to it."
Just because it has a structure doesn't mean an RDBMS is the right place for it to be codified. Also let's be clear the poster of this answer is clearly referring to an RDBMS.
"6. Information of a given type is always stored only once."
It's important to note, this only matters if your data is not immutable. There are many applications where this is the case, but also many where it is not.
"7. Databases are ACID."
Meh. This matters but after the advent of NoSQL era I now realize this matters much less.
"8. Databases are fault-tolerant."
There are tradeoffs made in this name as well. Sometimes you can do better by letting your app be fault tolerant of your data instead.
"9. Databases can handle very large data sets."
Files on disk can't? No I think here a valid point would be a database can span multiple disks and make them appear cohesive. That's a very nice bonus. For 95% (worst case) of the world though, that's not a concern.
"10. Databases are concurrent; multiple users can use them at the same time without corrupting the data."
Again there are tradeoffs made in the name of this in an RDBMS (like locking). Reading immutable data scales across multiple users just fine.
"11. Databases scale well."
CDNs have shown that files can scale even better. Again immutability is a constraint but I think it's an interesting constraint to explore for many systems.
Just to be clear I'm not saying databases are usually the wrong choice. Not at all. Instead they're just all too often the wrong first choice. Start with files, iterate from there.
You provide a nice counter-argument in general, but let me nitpick a single concern.
After the advent of NoSQL, ACID matters not much less, but in fact a lot more. Startups lose data all around the world because their shiny new database doesn't support the basic principles of databases.
Okay, so ACID all the way, then! No concept or implementation can ever save your data from corruption or disk failure, which is why you need all the guarantees you can get in all layers.
"If you use your file system as a "database" you can ask it questions too with file paths"
You'll be much more limited in the kind of queries you can reasonably make. You'll have to decide on them ahead of time. This might be what you want, but it might not.
"Denormalization makes this unnecessary."
Denormalization means making your data match your application rather than the other way around. You can make stuff faster this way sometimes, but again, you're limiting the kinds of queries and access patterns you can make in the future.
"It's important to note, this only matters if your data is not immutable. There are many applications where this is the case, but also many where it is not."
Wouldn't a flat file which continually accrued immutable data technically be called a log? You can do analysis on flat log files, but it will get slow if your files are sufficiently large. Indexing can help here.
"CDNs have shown that files can scale even better. Again immutability is a constraint but I think it's an interesting constraint to explore for many systems."
What? CDNs improve performance by placing the point of origin for data physically closer to the client, thus reducing lag. What does this have to do with the difference between files and databases? Do you mean that static or cached webpages serve faster and use fewer resources than ones generated from sql queries at the time of request? Duh, but that doesn't really contribute to the discussion.
"Duh, but that doesn't really contribute to the discussion."
The idea that storing often requested data in something other than a database and that being a simpler solution than tuning the database was a revelation to me at the time when I learned this. The overarching point is a database is not a panacea for data storage.
I agree so much with this. I run a small service that looks like it would be a perfect use-case for a database. Even though all the data is beautifully normalized and instantly searchable, using a database directly would be impossible. Instead it starts as a series of flat text files. From these flat text files, the database is automatically generated. Three big gains here:
Version control on the text files works perfectly as expected.
Creating ad-hoc and one-off columns (while figuring out how to best normalize unfamiliar data) is trivial and painless.
Comments! It is so easy to write in-line notes. This alone is the killer feature of text files. Normalizing and curating data requires making a lot of choices and comments let me keep track of these.
Of course there are downsides. You must edit the text file to add data. (With some exceptions, such as updating live prices/inventory.) If you have equal numbers of writes and reads, this system is not for you. But if you have infrequent (daily) writes then it is no problem.
On #7: ACID is a concept that matters just as much now as it used to, but I think a lot of people that thought ACID was important realize that it might not be so. I wouldn't say a RDBMS is uniformly better than text files, but in some applications, I wouldn't have it any other way.
When dealing with crucial data where robustness and correctness are paramount, an RDBMS may not be a bad fit. I cringe at the thought of an insurance claims system being done in NoSQL or JSON instead of a more robust system. Things are probably different in the startup world, but I'd rather take a huge hit to performance than a multi-million fine and audit for losing records from my DB.
Well here's one good reason: If your data cannot all fit in memory at once, you cannot simply serialize and deserialize it from disk. You will have to create some kind of indexed solution involving reading and writing to files piecemeal and by doing so you'll have created an impromptu database by default.
Other problems including loosing data if your application crashes and isn't able save it's data, performance on very large datasets, lack of concurrent access, etc
Generally, flat files are fine for low availability applications with small datasets. As your application grows however, you generally need to find more robust methods of persisting your application. Picking a persistence engine/database which matches you application will be essential at some point. It need not be a RDBMs, as 'NoSQL' databases and object databases may fit your application better.
One more thought: A lot of the fuss about 'NoSQL' vs RDBMs come from people trying to treat RDBMs strictly as persistence engines. That is, they start with an application that run well in memory and then realize that they need to deal with one or more of the issues I raised above. They pull out a RDBM and are annoyed that they now have map the data from the perfectly fine datastructures they were already using to a normalized tabular structure.
The reason this happens is because RDBMs were created to solve a different class of problems than simple persistence. They are for when you are starting out with data that needs to be housed and organized, as opposed to an application that simply needs to save its data.
No, you'll have created a paging system, and it'll probably fight with your OS's pager.
The disk paging system is but one of the components of a database, but what makes a database a database is the leverage that you get over your data, versus just a blob on disk.
Actually what i find to be the core question in deciding to use a database or not, is whether all data can be loaded at program start into memory or not. Loading all data at start is pretty simple and efficient. In this regards, there is another problem, getting the changes from memory back to the disk. Writing all data to disk, and even overwriting existing data is out of question, so such a system is needed which saves the changed data incrementally. Is there such a "database", which: a. loads up all the data at the start + loads the incremental changes, b. writes the incremental changes in transactional manner to disk, while not messing up the OS paging? I wonder if there is such a "persistent paging file" library available?
As far as I know, no, all of them are lazy and load only what you query into the memory. But if your data fits in memory, it'll stay there, and subsequent queries will be faster.
Also, you can take the quotation marks from database, you have some quite generic requisites, most of them implement all requisits, except for the caching at startup one.
There was a great short talk by (I think) Richard Hipp - or possibly Ted T'so? about mistakes people make trying to 'just save your data to disk'. I can't find the one I remember but this does cover a lot of the same material (in a less snappy way):
https://www.youtube.com/watch?v=TxXROT_nHqo
The meat, on the implmentation of ACID in sqlite3, is around 14:20 in. More complexity than you want to think about is the answer. You could do the same without a db-as-such of course.
Anyone remember the talk I mean? It would have been around the same time - 2009ish.
Well structured relational databases are an awesome tool. I have built and used hundreds. But they are a bit of cargo cult architecture. Deciding to use one means adding lots of code. It adds a lot of complexity around testing. All of it multiplies. It's another big Artifact to maintain and you may have to decide how you will scale it, replicate it, make it highly-available, etc. NOSQL was the first crack in the seemingly unassailable assumption that you need to use a DB. Now I think a few of us are questioning whether to use a datastore at all. After all, once you've dropped SQL for some functions, why stuff them into yet another opaque datastore? You might end up having to use a relational database anyway at some point and then you're stuck with another piece of junk and all its conceptual overhead and extra code. Start with files and introduce a relational database only when necessary.
Formats are straight forward these days. Just write out JSON, XML, HTML5 w/ microformats or even Yaml. The art is in the "schema" or how you store the files. Choose wisely with some knowledge of your requirements and it seems like it can work well.
Files on disk mean you can use all your great UNIX tooling to do all sorts of complex operations that would take you many many hours of skilled development to do with a database. Then there are all the possible things you could do with Git or (imagine!) ZFS! Versioning everything for free? Keep an activity log within a hierarchy. What kind of cool stuff can you do with that? I aim to find out.
Take storing project management data as JSON files.
Suddenly the boss bursts in: he needs to know, right now, how many hours Jeremy has spent working on these three projects.
Given that your JSON hierarchy runs project->person and not person->project, you will now have to traverse your entire database to give an answer.
The deep magic of relational algebra or relational calculus is that they privilege no one view of the data over the other.
This is, incidentally, one of the causes of OO/relational mismatch. A class hierarchy locks in a particular view of the problem domain. Relational has many potential views of the problem domain.
The cost and inconvenience of traversing trees for ad hoc queries is one of the big reasons that RDBMSes swept away network and hierarchical databases.
I don't think pg understands the issue at a gut-feel level.
There is a lot of benefit to processing your data using a declarative language, which is what SQL is. It's basically prolog, with less recursion and more separation of code and data.
You get all of that power for all of the data manipulation code that you run inside the database. If you have the misfortune to be writing in VB or Java, that's a huge power-up that you gain by using an RDBMS.
But pg is already using a declarative language, because he programs in Lisp. So the advantages are fewer in his case.
So I doubt pg feels significantly empowered by an RDBMS.
I never thought of SQL as declarative or like Prolog. I think it's more like a filter-writing language, with set functions like join.
Agreed that the distance between Lisp and SQL is a lot shorter so an RDMBS doesn't gain you very much. In functional languages, it's normal to write filters and apply functions to lists. You also end up performing your joins manually, and end up "optimizing the query" as you code.
In SQL, filters and apply (update) are easier to write, but it's also easy to perform a join that is slow.
I never thought of SQL as declarative or like Prolog.
Look at the different forms of relational calculus (SQL is intended to be basically syntactic sugar on top) or look at some research papers by Fariba Sadri - hell, look at the earliest papers on relational databases by EF Codd. RDBMSs were meant to be logic systems right from the start.
it's also easy to perform a join that is slow
Have you been using MySQL? Use a real database with a working query optimizer eg I know from experience that MS SQL will do its best even when you give it a dog of a query. For example it will re-arrange your "filter" and build a temporary index to make your join go at the fastest possible speed.
If you have a specific use case nailed down you can always make it go faster outside of an RDBMS much the same way that you can always make something go faster if you hand-code it in machine code instead of using a compiler.
But if you can't put that much effort into hand-optimizing things, say because your use case might change so you'd have to change the code, the query optimizer/compiler will do a better job.
Declarative often means opaque and hard to debug. SQL is just that over huge datasets. With properly partitioned text files, at least you know you're doing a full scan of a small partition that is practically real-time.
I don't think this is true at all ... you use the same query irrespective of the size of the data in the tables. So the size of the datasets is irrelevant except for surfacing performance problems.
In the example you gave you shouldn't have problems if you put the right index on the table.
What really causes the opacity is either (a) the complexity of the query or (b) the guy writing the clearly doesn't know how to write declarative code, so he tries to write
eg procedural code in SQL, with cursors and triggers and other horrors.
If you are in scenario (a) and you really do need to do something complex, I'd pick SQL over VB to do it in any day - the non-declarative style leads to huge code with side effects are more opportunities for bugs to creep in.
It has been very helpful to me to consider SQL being an approximation of the prolog model. It made the understanding and creation of schemas and queries much easier.
It might be very easy to code with a database, but the DB still has to do a lot of behind-the-scenes work, so you might not want them updated in real time anyway. (n.b. this is all theoretical, I have no idea about the workings of HN)
You could argue that this is another plus for flat files; you have a better appreciation for what operations are hard work, whereas a 'simple' one line select/join can hide a huge performance blocker...
I think this is a really good question, and I'm surprised by the answers, many of them are not particularly accurate.
I won't tear into the wrongness of the answers and instead I'll highlight the rightness of the question - why use an incredibly complex data management system when you can store your data simply. IMHO The answer usually is "because thats the way everyone else does it" and thats not really a wrong answer but its not really all that interesting either.
If you are storing a dataset thats write once, read many, then not using a database is great and you're likely to save a lot of work efficiency. However, if you end up doing complex things with your data, its probably not that great to re-invent the wheel. You should also look at the cost of training other people to work around your storage system, if its not actually simple, you waste a lot of time getting people up to speed.
why use an incredibly complex data management system when you can store your data simply.
Databases haven't been incredibly complex for years. Anyone can be up and running with MariaDB or psql or even mongodb in literally minutes. Actually using them is absolutely no more difficult than file system operations.
Sorry, thats not my model of complex :) If I have to store one line of text IE: "This is a line of text" in a database, the amount of overhead required to do that is huge. To store that simply with a flat file, I just put it into the file, thats a low complexity system.
You're right. It's harder to start with an RDMBS; the threshold of action is higher.
RDBMSes are like that because in return they make certain guarantees about their behaviour. Most of the time, the gratification from those guarantees is highly delayed and so, humans being humans, we heavily discount their value.
Your original post referred to it as "incredibly complex". It isn't at all complex. Is it more complex -- at least to start -- that an fopen/fwrite/fclose? Sure, but it's still easier than virtually any other part of any solution worth doing.
Worse still, avoiding such non-complexity early on almost certain dooms you to serious complexity later on, because soon enough silly things like concurrency pop up -- the sorts of issues that databases solved decades ago.
Clearly stackexchange is showing pro-db bias. I'm a huge fan of unstructured data in filesystems as well, that's how the service behind my prototypesapp.com is built.
I also happen to think MySQL-backed websites are the bane of the internet, further testament to the pro-db bias of the average web developer.
A properly built sql store (or, increasingly, nosql) is definitely a necessity for large service. But what is "large"?
I once heard the rule of thumb that a db is necessary when the size of the data you store in it becomes larger than the size of the db code. I kind of like it.
It really depends on your structure I think. In the days of PHP3 I saw a lot of developers that built .txt-backed websites, with complicated parsing that would break when you looked at it.
In my opinion, they would have been better of with a database, or a more clever design like yours probably is.
A lot of my first work was rewriting PHP3 sites that used text-files to PHP4 with MySQL. Some of them had 200mb articles in text-files, and for every request they parsed this file, and didn't know why it was slow.
Unstructured data is as bad as unstructured code. Typically, your data will become more complex over time, and structure helps a lot with ensuring that this is not a painful thing to do. Relational databases are well-studied and present a well-understood abstraction, with design patterns (normal forms) that significantly improve the maintainability of databases. Sure, relational DBs are not the only solution; I personally find Lisp s-expressions to be an easier to use Q&D store (of course, I am also a Lisp programmer, and most of my code does not have to deal with untrusted data), and lots of people are using YAML these days (and of course, there is XML).
I recently did experiments with storing documents (blobs) with UUIDs as files. I stored documents within a hierarchy of a few levels of directories to avoid having too many files in a single directory (UUID abcdefg was stored as a/b/c/d/abcdefg). Performance was surprisingly low. Reading documents resulted in a lot of disk seeks. Also it seems that directory entries lookup is a O(n) operation in a standard ext4. I also did experiments with simple databases like GDBM. The effect was also disappointing. In the end I started to really appreciate RDBMSes design and all theory that is behind them. Creating a highly performant system using I/O operations is HARD.
What was your usage pattern? This doesn't sound that different than static web file serving, which can be extremely fast. The file system relies heavily on the buffer cache to be fast, and I know there some options to tune that, although I'm not an expert.
Do you mean that dir entry lookup is say O(10,000) when you have 10,000 files in a single dir? If so then I imagine that is the reason that git has a hierarchy, i.e. look in .git/objects/[0-f][0-f]/.
The question is like asking "why do I need an operating system"? Or why live in a society with government services and laws instead of living alone off the land?
Databases are most valuable when there's more than one user at the same time, doing UN-anticipated things, involving multiple resources, whose uses may conflict with each other.
Databases provide certain services that allow things to run smoothly as concurrency, amount of data, types of data, and unanticipated usage begins to scale up. In exchange, databases impose a tax that includes extra complexity, increased latency, reduced throughput, potential license fees and general bother.
For coders who are just beginning a project, who are short of time and don't feel comfortable with a database, I see no good reason to insist on immediate database use.
If their project scales to many users, a lot of data, and lots of unanticipated queries WITHOUT using any kind of database, the coders would have made a HUGELY interesting discovery - at least as interesting as map-reduce.
What databases seem to be REALLY bad at though is storing large blocks of CDATA/"unsorted data" with meta data attached.
A small business has grown up using nothing but files. These files store large blocks of text which cannot be altered, just meta data pulled out (they're flat files, between 10 KB and 5 MB each, which is basically like dumping a large char[] arrays into files, and unfortunately they need to be kept in this format because the formats vary/incompatibilities/flexibility).
Anyway... Basically they have 30+ GB of these flat files. Literally millions of them. All in *.dat files. It works quite well like that. But the downsides as discussed in the link crop up and cause problems.
So we tried to put them into a few SQL databases as LONGBLOB, and after the first couple of gig the thing just became unmanageably slow, badly formed queries would cause it to lock up, and the thing took up way more space on the file-system than files did (and was WAY harder to manage with the performance issues discussed).
So I guess my point is, SQL databases are great for sorted data. They're freaking nightmares for CDATA/BLOBs. Just a complete waste of space. Almost no database is designed to handle that amount of data and continue to work.
Not that the filesystem isn't a valid option, but there are more databases than SQL databases. That cases seems great for a key-value store (which is essentially what a filesystem is).
True there are, and I'd be lying if I said I had tried everything.
But in general even key-value stores are designed to store sorted/structured data. Not really massive BLOBs with a little meta data sprinkled in to explain it.
You just run into serious performance issues when you try and store/access >2 GB of "junk" data via some clever database management engine, which is really more accustomed to storing tiny little columns of explain-able data you can write clever queries to.
We've wound up in a situation now where we just use databases to store the meta-data and have left the actual data in files, which from a programming point of view makes things harder (since you're hooking into two methods of data access as opposed to one).
Well, PostgreSQL supports the Large Objects, which are designed to store big amounts of data (outside the normal tables) and reference them using special IDs. It may work for your use case.
I don't know if these are inappropriate uses of databases, or if these databases could perform well, but are misconfigured in some way. But I have seen SQLite cause performance problems on desktop apps so often that I think there must be something hard about it, that's not hard about other data storage mechanisms.
I know you've acknowledged the limitations of your knowledge in your comment, but it's tremendously limiting to take an attitude of 'I don't like databases' in the presence of such limited knowledge and on the basis of a single implementation.
Maybe I was too glib. I should have said, I think that databases are overused.
There's a bias in favor of databases in the comments here and in the StackOverflow page, which isn't properly tempered by negative experiences. I also see a lot of misinformation, for example, "A database is needed if you have multiple processes modifying the data."
I have seen some desktop apps with terrible performance, that issue disproportionately huge amounts of disk IO, and where backtraces show SQLite functions. I conclude that using SQLite correctly in a desktop app is harder that people say. (Other RDBMSs aren't really in the running on the desktop.)
I have also seen a lot of dog-slow WordPress sites, including my own. Wouldn't you agree that many of them would be faster and more robust if they used a flat-file generator like Jekyll? Mine sure was!
So if databases are overused, what is the explanation? Well, look at the top voted reply: use a database and now your project is fast, ACID, fault-tolerant, can handle very large data sets, is concurrent, and can scale well! And when should you use files? The second reply answers that: it's when you "don't care about scalability or reliability."
If you believe these things, it will lead you to use databases in inappropriate contexts, and make the mistake of thinking that now your I/O performance is good. If anything, it's the opposite: databases abstract I/O, so properly evaluating performance means you need to work harder to understand what's really going on.
I think my own answer would be that a database system is very often the right solution if you know what you're doing with them. In this situation, diagnosing performance problems is quite easy (thanks to diagnostics), concurrency is certainly easier, and scaling can be easier too - because problems like concurrent file access for writing are hard in a flat-file situation and easy on the database, and that's exactly the kind of problem you often run into when you scale.
Unfortunately, when you don't know what you're doing with them, they're basically a black box where magic happens, and you don't know how to fix any problems that occur. The thing is, that applies to many things. When you don't know what you're doing with concurrent file access, you'll run into even bigger problems than you will on an RDBMS - just last week I was in fact rewriting an otherwise-competent colleague's code, because concurrent read/write to files caused massive unrecoverable corruption issues.
I don't think your wordpress comparison is apples to apples - wordpress is known to be dog-slow, and if you tried similar access patterns over file systems it would be at least as bad. There's little incentive for them to fix it, though, because wordpress sites are read-mostly, and all you have to do to fix all your problems is stick one of the wordpress caches in front of it. Frankly it boggles my mind that they don't ship with a cache on by default.
There's absolutely a place for no-db - if your content is static or write-once, doesn't need indexing over, and you can guarantee that those requirements won't change. In this case, there's no need at all to use a db, and it will simplify your life not to. I'm writing such a site as we speak, in fact! If you have dynamic content, well, I'd use a DBMS personally.
tl;dr: understanding how to use RDBMSs will make the average software engineer's life much easier.
edit: Just a note on SQLite specifically: SQLite can be slow compared to the common file system trick of write new temporary file -> rename, because the file system trick (on ext3 at least) doesn't require an fsync. It does mean that the change is not guaranteed durable, but often for browser data that's OK! An equivalent in database-land would be not writing every single transaction to the transaction log when it commits. That means that on a crash you lose everything that hasn't hit the disk yet, but you maintain consistency. Pre 3.7 SQLite this wasn't supported because SQLite didn't use write-ahead logging, which meant you had to fsync to maintain consistency. I believe in WAL mode this behaviour is easily available now though!
> I don't like databases, and the reason I don't like them is that they seem to cause terrible performance problems in web browsers.
Disk I/O is slow, so any kind of disk storage will be slow. I'm sure storing web browser's data in some kind of ad-hoc files would be much slower than SQLite. And the fsync() issues are orthogonal to using flat files/DB.
I believe ad-hoc files would handily have outperformed SQLite.
One reason is that SQLite does not use optimal disk access patterns. I know this because I have watched it issue nothing but preads and pwrites for literally hours, on a database that was about 2 GB. This is obviously terribly pathological behavior, and it was in shipping products. (incrVacuumStep is the bane of my existence.)
The second reason is that SQLite provides strong data integrity guarantees by default, at the cost of performance. This has proven to be a bad tradeoff for many applications, including Firefox.
> And the fsync() issues are orthogonal to using flat files/DB
They are not orthogonal, because SQLite calls fsync a lot by default, and it takes work to understand what you're doing that causes it, or even to disable it. See https://bugzilla.mozilla.org/show_bug.cgi?id=421482 for some of the pain this caused.
Notice some of the timing differences - one user reported that disabling SQLite async IO reduced his shutdown times from 1m40s to 5 seconds. This puts the lie to your "any kind of disk storage will be slow" claim: disk access patterns can have enormous impact, and empirically, it's easy to use SQLite in a way that destroys your performance.
Depends entirely on what you're doing. If you're just saving large sets of data to the disk, and doing nothing else with them, then yeah, don't use a DB. If you're running lots of random queries on piece-mail parts of several dozen sets of those files, then use a DB.
There's a reason it's called SQL: Structured Query Language. If you're interrogating the data, you want SQL (or something like it). If you're just loading stuff / saving stuff to disk, like a Word file, and not performing any sort of random search or query on it, then yeah, using a DB doesn't make sense.
This was my first reaction, as well. This goes especially for large datasets. If you are making linear passes over very large databases, I don't think I see much advantage in using a DB beyond possible benefits of shared storage.
It really depends on the problem. Using a DH well also requires some expertise.
I'm writing a small web app where users have a spaiic list of items. I started with a db but it's so much more complex than a simple file per user. Beside you have single big indexes for all the user data while there is a clear split of information between user. I didn't know how to handle hat efficiently. It turned back to one file per user with the content loaded and cached in memory.
Database is a sophisticated form of compression. Text compression is not so good after all. Database allows you to only keep indexes, and directly access those. Text compression is naive and inefficient, database software guide programmers to optimize their data types themselves. It's either that or storing everything in binary files.
We use database because they allow us to organize a binary file. It's a standard, but it's only relevant if you need to use those kind of data patterns.
You also have to understand regex doesn't scale at all.
I'm thinking about moving some of my data out of my postgres database, and into S3 objects. I think for "documents" that will only accessed by a single user (so record locking etc. doesn't matter), it will help things scale more easily. The web app will be hosted on AWS, so I wouldn't think the S3 retrieval time should be too bad. Anyone have some good guidelines on this?
Ah nostalgia... I recall wondering about this myself back in the early days of my career.
However, that was before the current wave of document-based DBs (Redis, Couch, Mongo etc.) so the meaning of a "database" was more targeted at RDBMSs. These days a lot of the reasons listed in the answers are equally applicable to, say, JSON files stored in Redis.
The same reason you divide your programs into functions, classes, etc. A well-designed database can change as the data you store needs to change. You also get a nice abstraction for accessing/querying data (e.g. relational algebra/SQL) that is well-studied and widely applicable.
"Eventually" will very likely be "immediately" for any system that's remotely serious in any way, and being created by developers who consider themselves professionals in even the slightest sense.
1. Your FS has a minimum blocksize, often somewhere around 4k, you'll waste disk space with small files
2. Your OS MMAPs/FS caches pages with its page size, ~4-8k, you'll waste ram with small files. (MongoDB also makes this mistake)
3. Reading/writing data from files directly requires syscalls, a DB can cache a lot in memory and minimize syscalls.
4. Atomic transactions are hard, even with a database. Are you actually sure you know how fsync works? (see the famous ext4 fsync issue: http://blogs.gnome.org/alexl/2009/03/16/ext4-vs-fsync-my-tak...)
5. Distributed operation is hard, take a look at a databases like couchbase, elasticsearch or S3, where data is replicated across machines and datacenters in a highly reliable way
6. Consistent backups on a live app are hard. Using postgres or another MVCC DB? Just perform a dump, it will be a true snapshot.
7. Do you need to coordinate writes across multiple app instances (E.G. every web-app ever), you'll have to figure out your own record locking system. Need a high-performance counter atomically incrementable from multiple clients? Easy in SQL (UPDATE foo SET counter=counter+1 WHERE id=1), try doing that in a simple way with plain-text files.
The list just goes on and on and on. The kicker is even if you understand these issues, it's hard enough building a non-buggy implementation of these basics. Oh, and lastly, even if you do build your own mediocre DB, future devs will have to learn your crazy system rather than leveraging all the existing SQL knowledge they currently have.