Not to bikeshed, but what about Postgres makes the big scalability story any different than the MySQL story, unless you're talking about a commercial distributed RDBMS built on top of Postgres like EnterpriseDB?
In terms of scalability in the small, many of the MySQL 5.6 replication features skew this comparison in MySQL's direction: multi-threaded (optionally row-based) replication, global transaction ids, and the like make it easier to improve availability, add read-slaves, and host multiple databases/shards on a single machine (replication afaik is no longer per-server, it is now per-database).
There's also been work on tuning innodb for SSDs (which is a near certain recommendation for your dataset -- it gives plenty of breathing room to forestall horizontal partitioning).
I'd also look very heavily into performance with large buffer caches, compression (great way to reduce IOPS and squeeze the most out of SSD or memory space), etc... I am curious to see if any of these were compared between MySQL and Postgres. As far as I understand, innodb's compression is somewhat more advanced than Postgres, but running either on ZFS is an even better bet for this.
On the other hand, Postgres has a better query optimizer, supports more complex data and relation formats, and so on. However, most of these features aren't going to be used at scale (I am presuming you're talking about OLTP workloads).
Honestly this is a bit of an unusual decision -- I'd probably start a project with Postgres and avoid using MySQL until later, but choosing Postgres out of scalability reasons seems a bit odd. I'm curious to know why!
Maybe it was not only for scalability, but also for reliability or embracing the Do the Right Thing thing?
From my experience with PostgreSQL and MySQL, it is a bit like the difference between python and php, one is a great work of craftmanship that is reliable, predictible, coherent, enjoyable to work with and minimise the wtf/mn rate (which is the best quality measure in software), while the other is a bunch of hacks knit together to make it work asap and its wtf/mn is skyscraping.
To put it bluntly after managing both for years, Postgres doesn't scare the shit out of me like MySQL does. I've had quite a few moments with MySQL doing stupid things that aren't intuitive or right, particularly in the backup/restore space. For example backup one schema and restore onto later versions doesn't always work.
Postgres has the same feel as *BSD i.e. deterministic. It is well documented and does exactly what the manual says and is devoid of surprises. It feels engineered and I can provide reliably engineered solutions because of this.
Basically I can sleep at night.
I really give less of a crap about scalability. I'd throw a bigger box at the problem or use heavy caching. I've built much larger ecommerce solutions with orders of magnitude more hits/orders than sparkfun on much smaller kit
I have to say, my experience with Postgres is the opposite. Sometimes the query planner will suddenly decide to pick a bad execution plan.
I have been woken up in the night multiple times and found that the problem is that Postgres has arbitrarily decided to stop using an index and changed to do a full table scan and the database has ground to a halt under the load.
Doing an ANALYZE sometimes fixes it. Increasing the statistics target for the table in addition sometimes fixes it. One time we had to add an extra column to an index to make it decide to use it even though it shouldn't have needed it (and didn't a few hours before!).
The developers seem dead set against allowing you to override the query planner to add determinism.
I'm not saying MySQL is any better, as I have not used that in production before.
Well if you grep for MySQL gotchas you should find a bunch of them. One did bite me hard one day. I wanted to add a simple check constraint on a column, as I did often on PG. A quick glance at the doc, I send the ADD CONSTRAINT command, and, no news is good news, I thought everything was all set and good. But in fact the beast did just ignore my command! After careful check in the docs in some case the constraints are ignored because not implemented.
That's crazy. How would you rate a shell that just said nothing and copied nothing when asked to copy an important backup file somewhere? For me it is the hell of software, not even telling there is something wrong.
The vessel is in danger when the captain don't know her position, but in much greater danger when he is not even knowing he doesn't know where she is.
You have to run your ANALYZE commands... I've run into this issue on very volatile tables (lots of inserts/deletes). With these, the table profile will change and without statistics the query planner might go on the wrong path.
That said, it's unfair to compare with MySQL's planner, which, although it is there, is so infantile that by comparison you can say MySQL has no planner. Example: Invariant sub selects run on every row of the master query, instead of being stored in a temporary table. Think SELECT a FROM foo WHERE foo.b IN (SELECT b FROM bar). Or another basic one SELECT a FROM foo ORDER BY b LIMIT 5 where there exists an index on b actually sorts the entire table instead of using the index and jumping out when it hits the LIMIT.
My experience with mySQL (admittedly none of that experience is recent so things may have changed a lot since) suggested that it didn't have a planner at all and it just attacked things in the order of your FROM/JOIN clauses using indexes where obvious from the WHERE clauses and linking ON definitions: it wouldn't for instance see circumstances where it could hit the second table of a join first and so need to consider for less of the first larger table for instance, so reordering inner joins in a query could make a significant difference.
I use MSSQL a lot in my day job and sometimes the query planner there surprises us by being fairly bright about this and other optimisations. Of course it sometimes gets things very wrong and you have to use index hints to push it in the right direction, but generally speaking if your queries are complex enough to hit that problem you might need to consider refactoring your design for other reasons too.
But the difference with MySQL is that you can easily forcibly make MySQL use the indexes you want it to, as long as it is possible.
With Postgres, while it usually seems to do a better job, when it refuses to use your indexes it can be a massive pain to coax it into doing what you want it to, even when you know it's making a bad choice.
Postgres has a number of issues like that, where ideological purity has a negative impact on usability. It still mainly use Postgres these days, but I regularly swear at it over things I know would be far simple than MySQL for those reasons (replication used to be a huge one, though that is now finally getting there, 8 years after I set up my first "large" replicated MySQL setup)
Postgresql's explain command actually, while cryptic, contains the full explanation behind the planners' decision. With that info, it's quite easy to force it into a given path.
Bad execution plans are almost always the result of a missing or incompatible index, bad table statistics (corrected by running ANALYZE) or bad server configuration (corrected by adjusting memory parameters and cost parameters).
The query planner evaluates multiple query plans, and chooses the one with the lowest cost based. If that results in the wrong plan, you should attempt to correct the query planner input (that is, the statistics and the server configuration, and possibly the tables and their indices) so that it can better evaluate the possible query plans.
This is perhaps one of the more succinct ways of summarizing our core reasons for the switch. There's a lot more to it than that, of course, but in a nutshell we saw the upcoming scaling of our data footprint and complexity and wanted to use (what we evaluated to be) the better tool for the job.
I don't know the answer in this case, but a few thoughts:
* Why don't you think the optimizer is a consideration? That's a big deal, because it helps the database choose the right algorithms as your data changes. Continuing gracefully while the input data is changing sounds like scalability to me.
* Postgres is generally considered to scale to many concurrent connections and many cores. Useful for OLTP.
* Sometimes the right feature or extensibility hook can make a huge difference allowing you to do the work in the right place at the right time. For instance, LISTEN/NOTIFY makes it easier to build a caching layer that's properly invalidated -- not trivial to get right if the feature is missing, which might mean that you're not getting as much out of caching as you could be.
Tuning for SSDs, compression, buffer caching, etc., are the antithesis of scalability. Those things are pure performance issues. Scalability is about a system that adapts as input data and processing resources change. (Your point about replication does concern scalability, of course.)
> * Postgres is generally considered to scale to many concurrent connections and many cores. Useful for OLTP.
But MySQL scales trivially to many servers. Postgres replication is finally getting there, but it's been incredibly slow coming, while setting up huge MySQL installations with a dozen+ replicas, with selective sharding etc. has been effortless for about a decade.
... while setting up huge MySQL installations with a
dozen+ replicas ...
If you're careful about your function calls, sure. Include a UUID() or SYSDATE() in your INSERT statements, and the replication goes to pieces as the function calls are run separately on each slave (with differing results).
Doing compression on the ZFS level is significantly worse than InnoDB compression.
InnoDB has a lot of really smart optimizations which make it much better than just zipping things up. Included are the modification log (so you only have to re-compress occasionally) and a dynamic scaling ability to keep compressed pages in memory rather than always decompressing. These optimizations are really only possible with an understanding of the data.
I would only consider ZFS for something like an append-only data warehouse type system.
I'm sure you would want the compression done by the database and not the filesystem, since there are many ways to do compression to fit specific applications, and the database knows what it's trying to do. I read a little bit of the MySQL docs regarding how it uses compression, and it sounded pretty different than general-purpose compression.
In terms of scalability in the small, many of the MySQL 5.6 replication features skew this comparison in MySQL's direction: multi-threaded (optionally row-based) replication, global transaction ids, and the like make it easier to improve availability, add read-slaves, and host multiple databases/shards on a single machine (replication afaik is no longer per-server, it is now per-database).
There's also been work on tuning innodb for SSDs (which is a near certain recommendation for your dataset -- it gives plenty of breathing room to forestall horizontal partitioning).
I'd also look very heavily into performance with large buffer caches, compression (great way to reduce IOPS and squeeze the most out of SSD or memory space), etc... I am curious to see if any of these were compared between MySQL and Postgres. As far as I understand, innodb's compression is somewhat more advanced than Postgres, but running either on ZFS is an even better bet for this.
On the other hand, Postgres has a better query optimizer, supports more complex data and relation formats, and so on. However, most of these features aren't going to be used at scale (I am presuming you're talking about OLTP workloads).
Honestly this is a bit of an unusual decision -- I'd probably start a project with Postgres and avoid using MySQL until later, but choosing Postgres out of scalability reasons seems a bit odd. I'm curious to know why!