You can tell a lot about a developer by their preferred database.
* Mongo: I like things easy, even if easy is dangerous. I probably write Javascript exclusively
* MySQL: I don't like to rock the boat, and MySQL is available everywhere
* PostgreSQL: I'm not afraid of the command line
* H2: My company can't afford a database admin, so I embedded the database in our application (I have actually done this)
* SQLite: I'm either using SQLite as my app's file format, writing a smartphone app, or about to realize the difference between load-in-test and load-in-production
This might be a stupid question, but surely no one thinks of RabbigMQ as a database right? I’ve used it from 2012 to 2018 extensively, including using things like shovels to build hub spoke topologies, however not once did I think of it as anything but a message broker.
I once worked on a system for notifying customers of events by posting to their APIs. Events came in on a Rabbit queue and got posted.
If a customer's API was down, the event would go back on the queue with a header saying to retry it after some time. You can do some sort of incantation to specifically retrieve messages with a suitable header value, to find messages which are ready to retry. We used exponential backoff, capped at one day, because the API might be down for a week.
I didn't think of RabbitMQ as a database when I started that work, but it looked a lot like it by the time I finished.
RabbitMQ stores your data, right? Then it's a database! That's pretty much all it takes. A flat file, memory-store, SQL DB, Document store, any of them can be databases if that's where you stick your data!
But also no, RabbitMQ and Kafka and the like are clearly message buses and though they might also technically qualify as a DB it would be a poor descriptor.
If memory serves the original EToys.com code treated the filesystem as tree-structured database using atomic operations (though no transactions). It worked just fine, then the rewrite with an RDBMS that should have been stabler and faster resulted in the famous meltdowns. Admittedly this is cheating a bit since you can name folders & files with semi-arbitrary or internally structured string keys. By 1997 standards pure disk access without having to walk the filesystem heirarchy was blazingly fast compared to many of the databases I was using.
[Source: I was friends with the guy who wrote it as well as other EToys employees. God that was a trainwreck.]
I don't think anyone posted about their particular system, but it's not unknown now. If you google "filesystem as a database" there are some relevant hits. One super simple and probably not ideal, but at least balanced version uses a hash of some primary key like customer row id as the file index, then partitions the items into directories with all permutations at each level (or only populated ones) based on successive parts of the hash. For example an item key that hashes to a32c4214585e9cb7a55474133a5fc986 would be located somewhere like this:
The advantage of this kind of structure is that you never need to manually scan a directory since you know exactly what path you're trying to open. You still incur the OS lookup time for the inode-equivalent in the directory entry, but a deeper heirarchy keeps that faster. You can trade off time to traverse the heirarchy versus number of entries in the final directories by adjusting the length of the hash chunk you use at each level. Two characters will put vastly fewer entries at a given level, but vastly increase your directory depth.
Basically if you're manually scanning the heirarchy for anything but a consistency check or garbage collection you've already lost.
One important note: make sure you carefully consider using atomic renames and such for manipulating the files! Overwrite in place is a great way to end up with a corrupted item if something goes desperately wrong and you're not protected by COW or data journaling.
Unfortunately eToys imploded a couple of years later (2001) and there were only a few people involved at that stage so it's possible none of them are in the industry anymore. You might start by looking at email servers, I believe there are a few that use a deeply nested directory heirarchy for much the same reasons. IIRC Apple also does something similar with the interior of the sparsebundles used in Time Machine backups, but I don't know if any of that code is opensource.
I had to work on a tool that shows what's wrong with an assembly line: missing parts, delays, etc... So that management can take corrective action. Typical "BI" stuff but in a more industrial setting.
The company went all out on new technologies. Web front-end, responsive design, "big data", distributed computing, etc... My job was to use PySpark to extract indicators from a variety of data sources. Nothing complex, but the development environment was so terrible it turned the most simple task into a challenge.
One day, the project manager (sorry, "scrum master") came in, opened an excel sheet, imported the data sets, and in about 5 minutes, showed me what I had to do. It took me several days to implement...
So basically, my manager with Excel was hundreds of times more efficient than I was with all that shiny new technology.
That experience made me respect Excel and people who know how to use it a lot more, and modern stacks a lot less.
I am fully aware that Excel is not always the right tool for the job, and that modern stacks have a place. For example, Excel does not scale, but there are cases where you don't need scalability. An assembly line isn't going to start processing 100x more parts anytime soon, and one that does will be very different. There are physical limits.
I think you drew the right conclusion from your experience, but I also want to point out that building the first prototype is always anywhere from one to three orders of magnitude easier than building the actual product.
The devil is in the details, and software is nothing but details. The product owner at the company I work for likens it (somewhat illogically, but it works) with constructing walls. You can either pick whatever stones you have lying around, and then you'll spend a lot of time trying to fit them together and you'll have a hell of a time trying to repair the wall when a section breaks. Or you can build it from perfectly rectangular bricks, and it will be easy to make it taller one layer at a time.
Using whatever rocks you have lying around is like building a prototype in Excel. Carefully crafting layers of abstraction using proper software engineering procedures means taking the time to make those rectangular bricks before building the wall. End result more predictable when life happens to the wall.
Well in these situations, the implicit ask of your company (I've been there myself) is to
basically rebuild excel but replace some of the power/flexibility of excel for safety and to remove the risk of error away from front end users (aka move the risk to the back end developers)
Unfortunately which specific features of Excel are acceptable to remove are unknown until you have already way over invested into the project.
The best I've seen this done is having Excel as a client for your data store. Where read access is straightforward and write can be done via csv upload (and heavy validation and maybe history rollback).
That way the business can self-service every permutation of dashboard/report they need and only when a very specific usecase arises do you need to start putting engineering effort behind it.
I suppose you can also supplement the Excel workflow with a pared down CRUD interface for the inevitable employee allergic to excel.
> Kafka and the like are clearly message buses and though they might also technically qualify as a DB
ksqldb is actually a database on top of this.
The thing is that they have an incrementally updated materialized view that is the table, while the event stream is similar to a WAL ("ahead of write logs?" in this case).
Because eventually you can't just go over your entire history for every query.
Oh ho ho ho. What weird things we use as a databases. I remember when I first started out as a consultant developer we were using a CMS as our data repository because someone thought that was a good idea. (It wasn't). The vendor was flown in from the states to help troubleshooting. I will never forget how he looked at me when I had to explain to him why we made so many nodes in the content tree, it was because we were using the CMS as a repository.
It's both. It's best used when it's being used as a message broker, but any sufficiently advanced message broker will need many of the features of a database – durability of messages, querying in various ways, etc. I think it's reasonable to think of it as a very specialised database.
I interpret it as they'd probably not call it a database, but they might use it in places where a database would be better suited, and effectively store data in it.
As someone who chose MySQL and provides direction to developers who really like Postgres, and who also uses Postgres for fun, I do find myself having to both defend MySQL as a prudent option and convince them that I know anything at all about Postgres or computer science. :)
I've heard MySQL (well, MariaDB, really) has improved a lot in recent years, but I still can't imagine why I'd ever choose it over Postgres for a professional project. Is there any reason?
It used to be that bargain basement shared-hosting providers would only give you a LAMP stack, so it was MySQL or nothing. But if you're on RDS, Postgres every time for my money.
I used to use PgAdmin 3, but after... I donno how many years now, the PgAdmin4 is still a buggy mess.
It's really sad because all the contributors to Postgres have made an AMAZING database that's such a joy to work with. And then there's PgAdmin4 where its almost like they just don't care...
I don't feel I'm smart enough to contribute anything to PgAdmin4 to try make it better. So I stick to DataGrip and DBeaver.
For MySQL, I haven't found anything that beats SequelPro. For Postgres, I haven't found anything that comes close to parity, but my favorite is Postico.
I know people that swear by IntelliJ for their db stuff, it just never hit home for me personally though.
Does that mean they fixed “utf-8” or that everyone is just supposed to know that it’s fucking bullshit and always has been?
You can’t cut corners like that without inviting questionS about the character of the primary committers. The pecking order in software is about trust.
People don’t let that stuff go easily, which is why you still see people harping on MongoDB. Once someone is labeled a liar And a cheat, everything they say that doesn’t add up is “guilty until proven innocent.”
The utf-8 situation is on top of a bed of half truths. Things like publishing benchmarks with backends that don’t support isolation. A cornerstone of a good DB is handling concurrent access efficiently and correctly. Drawing attention to other benchmarks is a lie by omission. Better than just being incorrect for a decade, certainly, but still sketchy.
MyRocks is a little bit janky in my experience - it doesn't support some transaction isolation levels, fails to handle some workloads that rely on locking (such as a job queue), has failed to upgrade MariaDB minor versions [0], has very sparse documentation, and overall has given me some amount of unexpected behavior.
Though I'm willing to put up with it due to its incredible compression capabilities...
I prefer PostgreSQL, but MySQL provides a better clustering experience if you need more read capacity than a lone node can provide.
Oracle is great if and only if you have a use case that fits their strengths you have an Oracle specific DBA, and you do not care about the cost. I have been on teams where we met those criteria, and I genuinely had no complaints within that context.
Given both my experience and prior research, I don't believe you that Oracle is ever better than have the stuff on the above list, and I think it's worse than Postgres on every metric.
Every time I need to work with an Oracle DB it costs me weeks of wasted time.
For a specific example, I was migrating a magazine customer to a new platform, and all of the Oracle dumps and reads would silently truncate long textfields... The "Oracle experts" couldn't figure it out, and I had to try 5 different tools before finally finding one that let me read the entire field (it was some flavor of JDBC or something). To me, that's bonkers behavior, and is just one of the reasons I've sworn them off as anything other than con artists.
My day job involves developing for / customizing / maintaining two separate third-party systems that rely on SQL Server (one of them optionally supports Oracle, but fuck that).
I gotta say, as much as I hate it with a passion, and as often as it breaks for seemingly silly reasons (so many deadlocks), it's at least tolerable (even if I feel like Postgres is better by just about every metric).
I've been working with a partner company that is using Datomic to back a relatively impressive product - but I don't really see much written about it. What has been your experience?
Most people have not in fact realized weak typing is not a good idea. I myself vastly prefer strongly typed languages and think they are superior. However there are a huge number of people I work with and know professionally who prefer dynamically typed languages. Weak versus strong typing is a highly subjective opinion. Each one has different costs and benefits and which camp you land in depends in large part on what you value personally.
It's not a type constraint. It's a hint to SQLite to try and coerce values when it can. Here's what that link parent posted says:
> As far as we can tell, the SQL language specification allows the use of manifest typing. Nevertheless, most other SQL database engines are statically typed and so some people feel that the use of manifest typing is a bug in SQLite. But the authors of SQLite feel very strongly that this is a feature. The use of manifest typing in SQLite is a deliberate design decision which has proven in practice to make SQLite more reliable and easier to use, especially when used in combination with dynamically typed programming languages such as Tcl and Python.
It's intended behavior that's compatible with the SQL spec.
I admit I was kind of thinking that, even though I appreciated the humor. :) I imagine an awful lot of web sites out there would do just fine with SQLite as their back end.
Can you elaborate? I've seen benchmarks and from their website what I understood is that it can handle really massive reads and writes, tens (maybe hundreds) thousands of ops per second, but personally never tested to this extent.
We're using it in Quassel, and as soon as you go over ~3-4 parallel write/read threads, it starts locking up completely, sometimes taking 30 seconds for simple queries that should really take milliseconds.
The big issue is that sqlite does full db locking for any operation, so during any write you can't just easily read at all.
This can be fixed with WAL mode, but WAL mode is broken in uts early versions, and new versions of sqlite aren't in all disteos yet, despite being out for almost a decade. And even WAL mode gets abysmal performance.
It really can (LXD cloud setup from personal experience), the problem is that if you don't serialise your writes then yeah, fun times to be had. There are compromises for all databases. People just like telling others their opinion as fact, and how wrong everybody is apart from themselves of course.
How far does SQLite scale? Obviously not good for anything public facing with thousands of concurrent users, obviously good enough for something you only use yourself, but what about internal tools with a couple hundred users total (few of them concurrent) - where's the limit when it starts slowing down?
Expensify aren't really scaling SQLite in the way that people would expect. To say it's scaling SQLite is not exactly wrong, but probably gives the wrong impression. The users of their database likely wouldn't see it as SQLite, and they don't use the stock SQLite code.
They have their own layer on top that happens to use SQLite as the storage format on disk[1]. This layer means they aren't using full SQLite at the application level, but rather using their custom database in the application, and SQLite within their custom database.
Further, they've customised the SQLite codebase as far as I can tell to remove much of the functionality that SQLite uses to ensure that multiple instances can safely edit the same file on disk together, then they memory map the file and just have many threads all sharing the same data.
[1]: FoundationDB also does this, and scales to thousands of nodes. The trick is that it's essentially _many_ separate, very simple SQLite databases, each being run independently.
MySQL is actually amazing, scale better than PGsql supports Json and is available everywhere. I see no reason to use any other dB for 90% of the use cases u need a dB for
I can tell you this emphatically as I spent 6 months trying to eke out performance with MySQL (5.6). PostgreSQL (9.4) handled the load much better without me having to change memory allocators or do any kind of aggressive tuning to the OS.
MySQL has some kind of mutex lock that stalls all threads, it's not noticeable until you have 48cores, 32 databases and a completely unconstrained I/O.
You comparing tech from 2 different eras... redo the benchmark today and I’ll be surprised if you come to the same results. PGsql even has a wiki page where they discuss implementing MySQL features and changing their architecture so they can scale. https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FM...
They were both the latest and greatest at the time
> redo the benchmark today and I’ll be surprised if you come to the same results.
I would, but it was not just a benchmark, it was a deep undertaking including but not limited to: optimisations made in the linux kernel, specialised hardware along with custom memory allocators and analysing/tracing/flamegraphing disk/memory access patterns to find hot paths/locks/contention. (and at different scales: varying the number of connections, transactions per connection, number of databases, size of data, etc)
It was 6 months of my life.
> PGsql even has a wiki page where they discuss implementing MySQL features and changing their architecture so they can scale.
Just because mysql has some good ideas doesn't mean it scales better. I know for a fact that it didn't in 2015. I doubt that they have fixed the things I found, I could be wrong. But it would have to be a large leap forward for MySQL and PostgreSQL has had large performance improvements since then too.
also, I read that page and it talks nothing about scaling, just that some storage drivers have desirable features (memory tables are very fast, and PGSQL doesn't support it; archive tables are useful for writing to slower media, you can do this with partitioning but it's not intuitive)
How MongoDB is dangerous or less consistant that PG? I have one for you: I can't use PG or MySQL because my app will go down if the master is down so then the entire backend fails. How do you do HA with default PG?
Perhaps that’s because some other message brokers are now being touted as databases[0][1], I remember seeing a thread about it on HN couple of days ago.
Kafka is much more like a distributed file system that has queuing semantics baked in than it is an ephemeral queue that implements some level of persistence.
The fact that you put Kafka and RabbitMQ in the same category sort of makes me feel like you're out of your element, Donnie.
It seems like you have the only good use case for it pegged down. I've worked at multiple companies that really, really didn't understand that putting something into the DB comes with some probability that it'll never come out. The arguments were "but it's a dataBASE, it stores data. They'd never sell this as a product if it LOST data; then it wouldn't be a database..."
* Mongo: I like things easy, even if easy is dangerous. I probably write Javascript exclusively
* MySQL: I don't like to rock the boat, and MySQL is available everywhere
* PostgreSQL: I'm not afraid of the command line
* H2: My company can't afford a database admin, so I embedded the database in our application (I have actually done this)
* SQLite: I'm either using SQLite as my app's file format, writing a smartphone app, or about to realize the difference between load-in-test and load-in-production
* RabbitMQ: I don't know what a database is
* Redis: I got tired of optimizing SQL queries
* Oracle: I'm being paid to sell you Oracle