Hacker News new | past | comments | ask | show | jobs | submit login
Evernote blog: WhySQL? (evernote.com)
150 points by grifaton on Feb 23, 2012 | hide | past | favorite | 73 comments



Does anyone have a link to a decent comparison between MySQL and PostgreSQL? I'm really wondering why so many people use MySQL, even though it supports a lot fewer SQL features than PostgreSQL.


EnterpriseDB (who sell a commercialized version of Postgres) have a couple of MySQL vs Postgres white papers on their site, but hidden behind a registration wall : http://www.enterprisedb.com/resources-community/whitepapers-... .

Robert Haas, a Postgres committer , occasionally blogs about comparisons : http://rhaas.blogspot.com/search/label/mysql

Theres also http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL

Historically , MySQL has been more widely available on low end web hosting plans, so its what a lot of people first use when they start using databases, and a lot of web apps, such as Wordpress, support it exclusively.

Until a year or two ago, only MySQL had built in (if occasionally fragile) replication which made it popular for that reason alone. Postgres now has robust replication, with new features coming down the pipeline soon : http://www.depesz.com/2011/07/26/waiting-for-9-2-cascading-s....

I prefer Postgres, but oddly enough under Oracle theres been some interesting features added to MySQL, which is good for both.


postgres is also not (yet, hopefully?) available in amazon's RDS, probably agin due to the replication topic.


Pure inertia. Over time, a lot of people were using LAMP stack, and they continued to use MySQL for familiarity reasons.

Also, MySQL got commercial entity behind it in its early days which promoted it a lot. In addition, it worked an all platforms, including windows, while Postgres was there just in last couple of years.


I posted this elsewhere in the thread, but I ran some benchmarks on schema changes between the two on 5 million row tables (when deciding to switch) and PostgreSQL completely clobbered MySQL performance wise: https://gist.github.com/1620133. As we recently switched to PostgreSQL, one gotcha is that PostgreSQL uses separate process instead of threads for each connection and so it's slower and more memory intensive to establish new connections so connection pools held application side (such as pg_bouncer) are incredibly important.


Simple reliable replication has been a huge differentiator for a long time; enough so to put up with a lot of the other faults of MySQL. Have not revisited Postgres replication in a long time but I have seen that it has been worked on. Anyone with recent experience in both care to explain how the replication of both stacks up in recent versions?


Simple reliable replication

I cringe every time I read that. MySQL replication is many things, but it is not reliable (as anyone who has used it at scale will confirm).

I think the only reason this myth prevails is because hardly anyone ever actually verifies if their master/slave are in sync. A table checksum can be a real eye-opener here, especially on a deployment that's been running for while and undergone schema changes, restarts, network splits, etc.


<rant>Simple, but not reliable. I've seen admins enable statement-based replication without understanding it, and trash the db. Which is generally my gripe with MySQL: it has some popular features that only work if you don't look at them too closely; starting with support for the SQL standard.</rant>

PostgreSQL's built-in replication is pretty easy to set up[1] and provides a writable master, and a cascade of slaves. Slaves can be synchronous or asynchronous, and the synchronicity can be turned off per transaction.

[1] http://www.depesz.com/2011/01/24/waiting-for-9-1-pg_baseback...


In addition to the other factors listed here, Postgres's default configuration was tuned to a dramatically underpowered machine for many years. Yes, that meant that the occasional user who did have that kind of machine saw acceptable performance out of the box, but the other 9/10 systems burned a lot of the DBA's time tuning the system. I think that's a big reason why Postgres has a reputation for difficulty in some quarters.


For a long time there were no PostgreSQL for Windows. So Windows developers had to use MySQL.

MySQL is good enough so there is no need to migrate applications for a few additional features.


MySQL was very, very easy to get started with, getting to a cruise quickly. PostgreSQL offered more of a curve.

In the Windows world the same is true of SQL Server -- the setup, connectivity, and basic usage is so incredibly easy that it made it the first choice of many teams.

This seems incredible -- that products are chosen on such an irrelevant-in-the-long-term basis -- however it has proven true across almost all of the computing market, even targeting highly skilled developers. PHP has few competitive merits, yet it was the default option for many because it was so easy to make something basic in.

There's a lesson there in that.


This seems incredible -- that products are chosen on such an irrelevant-in-the-long-term basis

I don't think there is anything too incredible in that. If you want to throw together an idea quickly, get it out there and test response then use whatever technology gets the job done quickest. You can always change later.

Why waste huge amounts of time setting up a technically perfect database for a product it turns out no-one wants?


The difference between getting competent with Postgresql versus MySQL was just a few hours. In the scale of a project such a difference dissolves into complete irrelevance, yet it was enough to sway many to use MySQL when it was severely deficient comparatively (though with its adoption it saw love that brought it up to if not beyond parity).

The same is true with many technologies and approaches. Projects that consume thousands or tens of thousands of hours, with a toolset chosen because it represented an outset savings of single-digit hours.

Changing in the future is seldom as easy as it seems in those early days.


The difference between getting competent with Postgresql versus MySQL was just a few hours.

Yes and no. There are a vastly more hosting options that provide MySQL vs Postgres. So it's not an issue of getting competent with the DB system, it's an issue of using an existing LAMP stack or having to roll your own.


Very interesting to see them bucking the trends, I love his closing line:

"But we’re relatively satisfied with sharded MySQL storage for Evernote user account metadata, even though that’s not going to win any style points from the cool kids."

Indeed, hipsters beware!


There are other reasons to choose NoSQL For example, when Craigslist was using mySql and they had to change their schema, it took MONTHS to facilitate the change across all their slaves. You can also have a mixed strategy of using both RDBMS and NoSQL to achieve consistency while being able to be flexible to architecture changes. Lastly- have you looked at total overal cost? Setting up a large cluster with mySql will have a large operational cost and it may not be partition tolerant so if the wrong servers go down, it may cascade to your whole data store.


Doing data migrations up front in a NoSQL system means you've changed the fields around in your document types, then you're done. Now 95% of your documents are wrong, missing those changes. Maybe you have linkages between different types of document (akin to a foreign key) - now a lot of those might be pointing to nothing.

In this model, it's the application's job to anticipate and work around these inconsistencies. Assertions that check for data integrity, if you have them, have to be modified to work around this. For a lot of the web applications we talk about these days, who cares - it's people's lists of friends and TODO notes. It's simple data and some dangling records aren't going to hurt anyone.

In the SQL world, we instead write migration scripts that migrate the structure and data all at once. This is a little more work up front, but as long as you stuck to a mostly normalized form and use great tools (which we now have) this is not a big deal ("MONTHS" to migrate across slaves sounds like they had some very wrong decisions made earlier on). The application can move straight to supporting the new structure and entirely forget that the old one existed. In this world, we can also have really complex relationships between fields, like when we're storing accounting or medical data linked to temporal records of changes. The application can consume this structure without worrying about consistency.


I think the problem is that when you change a table structure in MySQL, it takes quite a bit of time to do the actual writes. Adding a column tends to force a rewrite of the table. Braindead I know....

I think the problem is that MySQL isn't really a standard SQL-world db. It has some of the advantages of one, but not all of them and some very annoying gotchas.


> The application can move straight to supporting the new structure and entirely forget that the old one existed.

Not always. Changing the schema can break an application, in particular when the database supports multiple applications.

> It's simple data and some dangling records aren't going to hurt anyone.

It depends on the application. In healthcare people literally die due to some dangling inconsistent records.

The problem in anticipating and working around these inconsistencies is that the workaround is added ad-hoc in code, rather than through a model defining the data change. You need a model http://chronicdb.com/preparing_schema_changes


PostgreSQL can do the majority of migrations you would need to make instantaneously (adding columns that can be null, adding/dropping indexes etc). MySQL suffers roughly O(rows) time when changing the schema, but with PostgreSQL now at feature parity for most everything else there's little reason to go NoSQL over something like Postgres if you are worried about schema changes. With respect to fault tolerance, this is only a problem if your dataset outgrows a single master (pretty rare) and you haven't figured out a proper way to shard (most places shard on users, it's pretty hard to believe anything is un-shardable).


I don't want to presume that I'm smarter than Craigslist's engineers, but why is this so hard? Schema changes should be captured in migrations, which are programs, and could be coordinated to run in parallel on thousands of machines in a moment with a shell script over SSH.


The fact that ALTER TABLE locks on most databases is not the primary reason changing a schema is hard. The primary reason is that a schema change breaks the application [1].

Migrations are programs, but your existing live application is also a program. Changing the schema breaks your live application to prepare for the new version of the application.

[1] - http://chronicdb.com/blogs/change_is_not_the_enemy


This is true for NoSQL too. If you change the format of your documents you applications might need to be modified to support both the old and the new format at the same time.


MySQL had some historic issues with schema changes -- adding a column to a table with a few million records could take hours. While it does have a lot of work to do -- it is restructuring every page of the database -- in competitive products such a change takes a minute fraction of the time (limited only by IO performance).


But altering a table would still block it? And possible even other tables as well? And take a long time?

Only ever done this in MySQL, where you really can't alter big, live, tables. Well, facebook made an utility for it: http://www.facebook.com/notes/mysql-at-facebook/online-schem...


Most changes would still block, yes, but at least in PostgreSQL many changes will only block a short time (since they do not require a full table rewrite, only altering the table descirption) and in my experience rewriting a table is generally quick for a decent sized database (tables with a handful of millions of rows).


I see. That makes much more sense. Yes, I suppose I'm used to DB servers that are more nimble in that department.


NoSQL does not support flexibility of schema change.

http://chronicdb.com/blogs/nosql_is_technologically_inferior....

Flexibility of schema definition and flexibility of schema change are two different things. Defining schemas only involves data. But changing schemas involves not just data, but code too.


There are other reasons to choose NoSQL For example, when Craigslist was using mySql and they had to change their schema, it took MONTHS to facilitate the change across all their slaves.

That isn't a reason to choose NoSQL. That's a reason to-

-doubt MySQL -- many of the purported downsides of SQL solutions (for instance the ridiculous "avoid joins" meme that has zero bearing on any good database product) are actually MySQL problems. Or rather, they were -- the product has made some pretty incredible strides.

-understand and embrace normalization. Most of the "we keep trying to change our schema and SQL is just so restrictive" stories could often be described as "the problems with denormalization".


It helps that they have a perfectly shardable product I guess.


Yes, if you're Google (where any page can link to any other) or Facebook (where a person can friend any other) you wouldn't have this. But lots of businesses that provide software solutions do usually have something that is very localised.


It's a bit funny that you use Facebook as an example since they use sharded MySQL as their primary data store.


tldr - it works, we don't care about "being cool"


The true hacker style


Can anyone explain the following bit: "They’re cleanly partitioned into 20 million data separate data sets, one per user."

Does it mean they have a database per user? That can't be right is it?


It means that the data from one user has no relations to the data from another user. So most if not all their queries only query the data of one user.

This is really useful for things like sharding, where you can split a database table onto more than one machine, because there will be few queries that will stall fetching data from one machine to another.


Why not have a database for each user? Evernote's data is partitioned perfectly for that. Notebooks and notes are accessible to one user or are public. There is no sharing notes between users.


There is sharing notes between users though - I have several shared notebooks, each holding shared notes.


Actually, his idea of a database for each user could still work, even though there is sharing of data between users. Take each database, and turn it into an executable object, which reads/writes its own data, and which communicates with other objects for sharing. It's like taking the actor model of computation, and orienting it for database use. I don't know of any working example of where this has been done, but I don't see why it wouldn't be feasible.

http://en.wikipedia.org/wiki/Actor_model



Their SQL is executed per user, so they only touch around 1/20M the size of the database for any request.


Not likely, but they can very easily have a database holding all users with id starting with 'a'


That's a really bad method of sharding. Names do not distribute equally over the alphabet.


I get where they're coming from, but I do find this to be a little smug :)

See, they haven't run into problems with their setup, as per, MySql 'just works' for them.

What would be interesting and educational (for me anyway) would be a situation where folks that ran into serious problems with their SQL setup despite doing the 'right things' persevered where conventional knowledge would have them switch to a NoSql solution.

tldr; Dog bites man article, would love to hear from someone that actually struggled with a SQL solution and soldiered on.


The notebook/note example is weak - in a nosql database you need to design your data structure appropriately to get the level of atomicity you require.

Storing an entire notebook in a single document would be the most obvious. I use postgres all the time and sql is great, but poo-pooing nosql because it wouldn't work with your relational structure is not the best idea. Also - I have found a hybrid between nosql (mongodb) and sql (postgres) is ideal - who says you need to use a single database?


What about when you want to find all notes that was made a specific day last month (say for a report)?

Traverse all notebook documents and look at each notes date? Good luck with that.


Many nosql databases support queries. If you are using one of these then you are in a better place for features like that than you are with heavily shaded SQL.


If you were to use MongoDB, wouldn't this just be a case of adding an index to the field of the nested child document (i.e. the note within a notebook) you were interested in and then querying on it? e.g. db.notebooks.ensureIndex({"notes.date": 1});


Well, taking a consistent snapshot for backups is easier when it's in a single source. I know there are ways around this (ZFS!) and not everyone needs synced backups to the millisecond but it can complicate backups (or more to the point - restores)


Storing an entire notebook in a single document would be the most obvious.

The cost of course being that a change to any note in a notebook yields a save of the entire document. Not a problem in simple cases, but that sort of mass-write-amplification can kill you (talk to Digg about that).

Also - I have found a hybrid between nosql (mongodb) and sql (postgres) is ideal - who says you need to use a single database?

Simplicity. Coherency. Maintainability. And on. Sure, it might make sense, but if you already have you toes in the "SQL" world, it is usually worthwhile to dunk your whole foot in. Many SQL products also offer the document functionality of MongoDB, for instance. SQL Server, as an example, lets you store XML documents to your hearts content, which you can index and intelligently query upon, etc. Your scheme is boundless, and on and on.


If the replication is asynchronous then SQL databases are not durable. So the most important feature of SQL databases generally isn't one.


I can imagine that while Evernote has a lot of data to store, it doesn't have the massive amount of concurrent reads that might occur with an equally large web app. Do they publish numbers on read/write usage?


The biggest news to me was that they are using MySQL.


It's amazing how when you focus on proven (but supposedly boring or old) technology that just works, and works very well, you can devote a lot of other resources to the actual product and usability.

Maybe it's the 30 year old in me showing, but I'm sticking with the 'it just works' crowd. Until some other approach provides a staggeringly overwhelming reason to switch. I find scaling up with MySQL to be ridiculously easy, allowing me to focus my time elsewhere. Ram, bandwidth, and fast storage have gotten substantially cheaper in the last few years, making it that much easier and cost effective to throw hardware at scaling up. For 99.9% of the Web, those hardware resources are expanding in value much faster than traffic is increasing.

(It's understood other developers find it just as easy to take a different approach)


It's amazing how when you use recent technology that has been designed and engineered from the ground up for the demands of modern web apps, you can devote a lot of other resources to the actual product and usability.


Modern web apps are applications. Applications, over time, run into many problems that SQL database systems were designed to address:

* The occasional join.

* Complex queries.

* Reporting.

* Schema changes.

* Transactions, with options for trading off strictness and performance.

* Performance problems that are easily addressed using stored procedures.

* Performance tuning with minimal code changes (e.g. adding an index to change the execution plan of a complex query).

* Enforcement of invariants regardless of application code.

Why would you give these up for scalability problems you won't have, with 99.999999% probability?


"Why would you give these up for scalability problems you won't have, with 99.999999% probability?"

I'm not saying people shouldn't use RDBMS', but there are many reasons for using something else:

  - The new databases/stores have different features and use cases. Many include features RDBMS' do not have.
  - Having to use table/column's for everything can be quite unnatural and tiresome.
  - RDBMS' are battle-tested and their pros & cons are well known. But they might also be based on legacy models and truths that simply no longer holds.
  - At least for me, using & learning something new is a big motivation booster :)
Use whatever is right for the job. But one should at least have a general knowledge of whats out there: features, pros & cons etc..


"The new databases/stores have different features and use cases. Many include features RDBMS' do not have."

True. My view is that the tradeoff comes out overwhelmingly on the RDBMS side most of the time.

"Having to use table/column's for everything can be quite unnatural and tiresome."

Unnatural in the sense that it is different from the first language you learn, yes. What I find tiresome is low-level, record-at-a-time programming when doing what could be expressed concisely as, say, a 5-way join with aggregation.

"RDBMS' are battle-tested and their pros & cons are well known. But they might also be based on legacy models and truths that simply no longer holds."

I really think you need to learn something about RDBMS fundamentals. Set theory certainly "holds" to this day, and is certainly not legacy. The relational model is clumsy for some kinds of data ( (e.g. time series), but saying that the foundation is obsolete is just objectively false.

"At least for me, using & learning something new is a big motivation booster"

No argument there. But now take the next step and do an objective comparison between old and new. I have no problem with NoSQL systems. I even built one. But for nearly all applications, when you account for a wide set of requirements, (not just what is needed to get the MVP up and running), I think an RDBMS is the right tool.


"What I find tiresome is low-level, record-at-a-time programming when doing what could be expressed concisely as, say, a 5-way join with aggregation"

Again, that depends on what you are doing. Converting back and forth between rows/tables and, say, data stored in hashes, is tiresome. It's also unnatural when the data doesn't "fit" in the relational model. Of course it works, but writing all this banal sql is not fun.

"I really think you need to learn something about RDBMS fundamentals...."

Right.. I was talking about the software. Maybe "models" is the wrong word.

"But now take the next step and do an objective comparison between old and new"

Believe it's pretty objective already.


The funny thing is hierarchical databases were all the rage in the 1960s (IBM's IMS for example) and the world moved on to the RDBMS.

I chuckle to myself when kids who wouldn't know a "legacy" app if it jumped up and bit them on the nose, go through the same thought process over and over again.


Is it possible that

a) not all code written is for web apps

b) that said recent technology also only fits a niche for modern web apps?

I think you view on what constitutes modern web apps or software in general is biased in favor of some examples who benefit by using mongo/couch/redis whatever. (And are very vocal about it)

[EDIT: spelling, and to add that I have tremendous joy in playing with couch, redis, node lately. But at the same time I try to stay critical: what are the real trade offs when I employ these tools?]


the demands of modern web apps

Hmm, let's see. What does a "modern web app" do? Well there is a screen with fields for users to enter something, and there is another screen where things that (other) users have previously entered formatted nicely for display. Behind the scenes, this data is exchanged between machines that do some other processing on it.

They were doing this in the '60s.


Er... you seem to be implying that people are using NoSQL solutions like Redis, Toyko Tyrant, Elastic Search, etc. for fun but that's not the case.

Sometimes you just pick the simplest thing that works and that can be a key value store, data structures in memory, or a lucrene interface. It's a better solution and less fragile than trying to force everything into a relational model.


No, I think he's implying that people are using NoSQL solutions because they've been hyped a lot and are fashionable. And that's most certainly the case (certainly not the only reason in all cases, but most certainly a reason in many cases).


I must admit that I picked mongodb (+ node) for my latest project for fun. But I then discovered that being able to dump json straight into a datastore without having to create tables etc. first was a great benefit in quickly building something out.

So something I did for fun turned out to be the simplest thing which worked. Only I didn't know that when I did it. I think developers who don't occasionally pick something new for fun on a side project are doomed not to know when that new thing could actually be applicable on a project that matters.


MongoDB isn't the best example to use as a genralization of NoSQL.

Many NoSQL databases have a structure you have to conform to---its just not a relational structure.


I'd actually like to do that too but have a hard time wrapping my head around the notion of building a website with node + nosql alone. My understanding is that node is best for realtime chat-like apps.. Is your project web related might I ask?


Node is very handy for high transaction, short transfer, high connection count applications like chat. That's because of the non blocking single thread architecture, low overhead connections, and little bloat (small footprint).

However, that doesn't mean it isn't well suited for building a website. You'd probably want to include some add-ons for that though (eg: express).

Note that even for realtime chat-like apps you'll probably need some kind of datastore and in fact that example sounds like a great use case for nosql as a datastore.


Thank you for mentioning express, I'm now excited to use it in my next project!


Hi, sorry for the slow reply. As mentioned express is a good way to go for getting a website up with node.

Since you asked, my project is over here http://jobstractor.com if you want to take a look. My email is on the site so feel free to drop me a line if you have any questions. I'm planning on writing up my experience and a bit about how I've structured things so some questions might help me think through that.


That was the worst "why we still use mysql"-post ever.


This article is very weak, they insist a lot on ACID, but those are completely orthogonal concepts from SQL. Most NoSQL products are ACID.

Also, the example itself is very weak as bitdiffusion below pointed out.


Good point about SQL not having a lock on ACID. CouchDB in particular is very proud of its out-of-the-box ACID compliance.

People sometimes conflate the DB access approach (document vs relational) with the storage approach (transactional vs warehouse). This may be because the NoSQL poster child, MongoDB, at one point defaulted to a non ACID mode of operation. But you can have a relational DB that's not ACID (MySQL 3) and an object DB that is (Couch).

I was surprised so much of the original article focused on ACID as though it were the biggest selling point for an RDBMS. It seems like the biggest win (right now) is the sheer number of things a typical RDMBS does for you -- not just ACID but also data integrity (foreign key constraints), automatic index creation (mostly), and automated schema changes across many records (ALTER TABLE). The cost, of course, is the up-front effort of fitting your data and app to the relational model.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: