Hacker News new | past | comments | ask | show | jobs | submit login
Re: Why Uber Engineering Switched from Postgres to MySQL (ayende.com)
215 points by GordonS on July 28, 2016 | hide | past | favorite | 67 comments



Explanation: Ayende is a well known programmer in the .NET community. Among other things he works on RavenDB (open-source NoSql db for .NET) and is currently working on a new storage engine for it called Voron. In this post he discusses Uber's blog post about Postgres and MySQL and compares the low level workings of both with what his team is doing with RavenDB and Voron.


I was recently informed [0] about Marten [1,2], which is like RavenDB but uses PostgreSQL for the backend.

[0]: https://github.com/jpsingleton/ANCLAFS/pull/12

[1]: https://jasperfx.github.io/marten

[2]: https://github.com/jasperfx/marten


Also, he has published a lot of review of database code. In particular, he has 18 blog entries on the leveldb C++ code base: https://ayende.com/blog/161410/reviewing-leveldb-part-i-what...


Thank you for explaining what Voron is.


BTW - "voron" is raven in Russian.


The article talks about Postgres mostly delegating caching to the OS while dealing with it via system calls (which they describe as "expensive"), while InnoDB has its own implementation and is therefore faster.

I wonder if part of the problem is really Linux, not Postgres and that another way they could have addressed this is by trying FreeBSD (which is what a lot of people run Postgres on).


>I wonder if part of the problem is really Linux, not Postgres

I'm not a PostgreSQL expert but it seems like it has a different philosophy with OS filesystem buffering.

Using Oracle and MS SQL Server as counterexamples... you can configure Oracle to bypass os filesystem buffers by using "raw" disk io. With MS SQL Server, it opens datafiles with CreateFile(,,,FILE_FLAG_NO_BUFFERING) to bypass the Windows NTFS caching. (Microsoft recently ported SQL Server to Linux so not sure what combination of techniques they are doing on there since CreateFile(FILE_FLAG_NO_BUFFERING) is a Win32 api and not relevant to Linux.)

The closest approximation PostgreSQL has is the "shared_buffers" parameter but from the documentation[1] I've read, the best practice for that setting is 25% to 40% of main memory. In contrast, Oracle and MS SQL Server db engines with their self-managed buffer pools are designed to use almost 100% of the main memory.

Conclusion: PostgreSQL relies on the os filesystem cache and its own db buffer does not supersede the os buffers. Other db engines' buffer pools are designed to bypass the os filesystem cache.

Installing PostgreSQL on FreeBSD instead of Linux wouldn't drastically change the recommended range for "shared_buffers".

[1]https://www.postgresql.org/docs/9.1/static/runtime-config-re...


>Microsoft recently ported SQL Server to Linux so not sure what combination of techniques they are doing on there

Open options:

        int f = open("your file", O_DIRECT|O_DSYNC|O_RDWR);
O_DIRECT signals that a file descriptor should by-pass kernel level caching and write directly to the device.

O_SYNC signals that a file descriptor calls should not return until all data+metadata has been synced with disk.

O_DSYNC does the same as O_SYNC but doesn't force meta-data synchronicity before the block ends.

Related Stack Overflow https://stackoverflow.com/questions/5055859/how-are-the-o-sy...

Related LWN article https://lwn.net/Articles/457667/


Ironically, in another "universe", StackOverflow doesn't have either part of this problem - they use the DB recommended by Microsoft and the OS recommended by Microsoft :)

http://nickcraver.com/blog/2016/02/17/stack-overflow-the-arc...


A userspace I/O cache is always going to be faster than the Linux one but that is not a problem with Linux per se. The userspace implementation not only has far more context about the workload with which to make decisions but the design of the underlying I/O cache itself can be perfectly tuned for the expected and desired behaviors a priori. The Linux buffer cache can't reasonably be designed for a single application. This has large performance effects.

By analogy, it is similar to how a purpose-designed hash map will almost always significantly outperform a generic hash map. There are enough design knobs that perfect foreknowledge of the use case allows significant optimization of the implementation.


I would say this is not just a case of faster, but more aware.

Linux does not have the same meta-data knowledge that the native cache has. There is a layering problem which makes it hard to add this. For example: after some page splits and fragmentation pages are no longer stored in physical order. How can a logical read ahead work on the Linux level?


The article mentions Postgres using many lseek and read calls. Does anyone know why Postgres doesn't use pread to cut the number of context switches in half (if, as the author says, the number of context switches is such a bottleneck)? On systems where pread doesn't exist, presumably they could use an inline function that emulates pread with lseek and read.

Yes, mmap would presumably be preferable in most situations, but pread would seem to be universally preferable to an lseek and read. I assume the kernel keeps per-fd statistics on sequential reads anyway, so for long-lived fds, lseek presumably wouldn't provide any better read-ahead hinting than pread.


The problem is fundamental to anything related to system calls. The only way to solve it is to somehow make system calls less expensive than they are now.


I believe that's what the parent comment was implying - that FreeBSD system calls are faster than Linux system calls.

I have no idea if this is true or not though.


Pretty much every modern DB bypass the OS for file caching, why would Postgres on FreeBSD behaves differently?


He is claiming FreeBSD provides a better environment than Linux not that PostgreSQL behaves differently.


> a lot of the pain the Uber is feeling has to do with the way Postgres has implemented MVCC. Because they write new records all the time, they need to update all indexes, all the time

This is not true in general: there are circumstances where Postgres can skip updating indexes when a tuple is updated, assuming the update only changes values in non-indexed columns. See:

http://pgsql.tapoueh.org/site/html/misc/hot.html

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...


The Uber article referenced makes it extremely clear they are working on indexed columns - the examples are all about updating the secondary indexed.


No: the example updated an indexed column (which frankly is likely rarer in practice but makes for a simpler example), but the serious write amplification comes because of the other indexes that had to be updated; in their example they change the birth year (so obviously the index has to change!) but then the name index also gets updated due to the new row location. HOT mitigates this issue.


I'm curious on the particular usecase that requires slamming indexed columns that often.


Imagine the Uber `Rides` table, indexed on `status`

Every time a new order is created the status flows from

* unassigned

* assigned

* passenger on board

* completed

That should really be an event stream, but if you want to find all `unassigned` orders, an index would be good.

We had a similar problem with this kind of table, at much lower volume than uber.

But only when there was a two column index

`status, pickup time`

Or

`status, created at`


> That should really be an event stream, but if you want to find all `unassigned` orders, an index would be good.

No, an index is not good for that, because it only has 4 values (at most 4 times faster than a full table scan), and it hammers the index on every update, using PostgreSQL or not.

Same for the two column index: single column indexes on pickup time and creation time are highly selective, and their values are updated basically never, whereas status has low selectivity and updates are common: combining them you don't significantly improve selects on the two columns, and limit the performance edge on time select where time.

Status index only performs well when you select one of the values which is is significantly underrepresented in the column (e.g. all except completed). The most performing approach without doing black magic is to split in columns with separate indexes: Assigned Time, Pickup Time, Completed Time. If you want unassigned ones, you select where Assigned Time IS NULL, same for the other columns. Performance on update for most RDBMS blasts, as index updates are now distributed among several, allowing higher concurrency and overall throughput.


Discussion on original blog post was yesterday: https://news.ycombinator.com/item?id=12166585



more info: Matt Ranney - presentation ( Senior Staff Engineer at Uber )

"Matt Ranney at All Your Base 2015"

https://vimeo.com/145842299 ( 30 min )

keywords: Uber + PostgreSQL + Chaos Monkey-style failure testing

"WHAT WILL I LEARN?

After this talk, you’ll be able to better assess the risk from the different failure modes of databases in your system’s architecture."


If you want performance and the data is not critical like a Facebook post you need to use MySQL. But if you deal with money, like a bank or an ERP then the way to go is PostgreSQL. It should be a known fact for software developers.


As someone unfamiliar with the inner workings of both, why is that the case?


Because guaranteed transactions matter.

Moving $10 from Bob's account to Jack's involves:

1. Inserting a transaction to remove $10 from Bob's account

2. Inserting a transaction to add $10 to Jack's account.

Both 1 and 2 must succeed or fail in total if either of them fail (or both).

You can imagine the disaster in banking apps, ERPs and other financial systems when 1 succeeds but 2 fails and the database is happy with that imbalanced state.


MySQL has transactions, so why does this example prove that it's unsafe for critical applications?


MySQL does not add new features because it was bought by Oracle. But they improve performance. It is an easy to use minimal database. InnoDB engine is ACID compliant but MySQL's original engine, MyISAM is not ACID compliant. Why not ACID compliant? To improve performance.

On the other hand PostgreSQL is ACID compliant from the beginning. And PostgreSQL adds new features every few months. Some of these features that do not exist in MySQL are; IP address data type, serial fields (multiple auto-increments), row level permissions, geographical data types, inheritance, XML data type, custom data types, range data type and things that I don't know but I know that it provides many features of SQL language like intersect, except and merge joins. It is like free Oracle.

ACID: https://en.wikipedia.org/wiki/ACID


I get why those features are useful, but I don't see how that supports the claim that you should only use MySQL for things that aren't critical (implying that it will randomly produce incorrect results or similar).


Here is an explanation https://youtu.be/emgJtr9tIME


Both of these articles fail to mention that as of Postgres 9.5, you can use logical replication rather than WAL. This is why Amazon's RDS now supports multi-region slaving with Postgres.


Why is there a reference and link to a wikipedia article about a sci-fi novel in the second to last paragraph? Did I miss something? Is this an inside joke?



Are we supposed to know who Ayende is? Please change the title.


I don't know who he is but I can't think of a more clear title to convey that it's not a repost and also leave the namespace open for other people to make similar posts (changing it to the post title "re: <title>" doesn't convey any extra information)


For people who do not know who Ayende is, the presence or absence of his name in the title provides no additional information. For people who do know that he is an experienced and talented database developer, his name in the title is likely to make them want to read the article more. Therefore, removing his name from the title is a net loss as it eliminates information that is useful for the many people who know who Ayende is while doing nothing for people who do not know who Ayende is.


Almost everybody in the .NET community knows who he is.


I agree with @dev360. I think it has to do with RavenDB being more common with .NET developers (at least from my experience).


Hm, isn't RavenDB a .NET solution? As such how could it not be "more common with .NET developers"?


RavenDB is just a server application. There's nothing stopping you from connecting to it via, say, Python.

It's just much more commonly used in .NET apps.


The original client was written in C# but I believe there is Java and JavaScript clients now. Also can be queried via http requests.


Then count me in the category that requires you to use the qualifier "almost". I've been working in .NET since it was invented and I've never heard of him.


I think the title was to differentiate from the actual Uber post regarding the switch.


The title would've been fine without the name drop, which would also make the title equal to the linked article's title.


I think "Ayende" should be in the title somewhere not because of any Hollywood "name drop" but simply to differentiate it from yesterday's submission. I don't want those 2 submissions to have the exact same title. Yesterday's submission was a blog directly from uber.com. Today's submission is a blog from ayende.com with his own commentary.

Therefore, maybe a reasonable compromise is to put "Ayende" at the end like this:

"Why Uber Engineering Switched from Postgres to MySQL (Ayende's comments)"


I think a quick look to the domain name has the same effect.


I agree the greyed out hard-to-see "(ayende.com)" in parentheses can have the same effect but I assume that everybody's eyes scans the page differently[1]. I'm sure a lot of readers in a hurry would ignore the grey text and would assume it's an exact duplicate of yesterday's post. Having "Ayende" in the title seems like a minor expansion of words to counteract that effect.

From the back & forth comments, I'm getting the vibe that the submitter broke some hidden HN principle??? As if putting a person's name in the title is trying to manipulate the HN readers into hero worship?? Or trying to impress HN readers with Hollywood name dropping?? I read the "Ayende" using a perfectly neutral interpretation but it's obvious others are really turned off by it. I can't see any other reason for the bikeshedding about the title.

[1]https://www.google.com/search?q=eye+tracking+gui&source=lnms...


HN guidelines regarding titles are to include the article's original title if it's not clickbaity or misleading and to omit the website's name from the title because it appears on the domain name next to it. Both guidelines seems to not have been followed.

I agree that OP did it to differentiate this response article from the original one but this was the first time the article was posted on HN so we don't really know if it needed to be differentiated (i.e. if it had been posted without getting to the frontpage then it would've been safe to assume the necessity of a different title).

Either ways I think "A response to" instead of "Ayende on" would've done the same job without risking falling into any of the pitfalls you mentioned. But enough nitpicking, the article in itself is interesting.


That is true, however it assumes that most people read the domain name.

I don't, and I don't know how many do. Do you?


The guidelines state that a website's name should be omitted from the title because the domain name gives that out.

I always check the domain name because I know despite a catchy title, some domains are not worth checking.


Nit: It was not the domain that was included in the title but the authors name, they happen to be the same.

Whilst on reddit and other sites I will check the domain, on HN they are posted so infrequently that I spare myself the strain.


>It was not the domain that was included in the title but the authors name, they happen to be the same.

Yes I understand that that was the intention but let's suppose the process of detecting domain name in the title had been automated. The post would've been flagged.

>on HN they are posted so infrequently

I take it you don't browse HN's newest page, it's full of garbage links.


OP here, yes, that's exactly why I put 'Ayende' in the title, since the original was posted yesterday. If I didn't, I figured people would skip it assuming it was a dupe, or I'd get the usual crowd bleating about it being a dupe without even bothering to read it.


Perhaps you should let the crowd answer the first part of your post before making a demand re: something you know nothing about in the very next sentence?


I didn't know who Ayende is, now I do. Please keep the title.


As someone else mentioned, this is a repost from yesterday - mods?


This isn't a repost, it's a blog post by Ayende in response to yesterday's article.

I thought it worthy of posting since Ayende heads up RavenDB and knows a thing or two about databases and their low level implentation


Oh my gosh you're totally correct! Sorry for my mistake there.


I gather reposts or re-bumps-to-the-front-page are common these days in case good articles don't get the attention they deserve. Which articles deserve attention is, of course, entirely up for debate.


Ah, actually I totally misunderstood the link and thought it was a genuine re-post from yesterday!


Very nice write-up! Hope this helps them spam the heck out of people who never opted into their advertisement in the first place.

/sarcasm

I only wish LE would treat CAN-SPAM seriously and put more sources into criminal enforcement.


Please don't do this again here. We ban accounts that repeatedly derail threads with off-topic comments.


I mean, sure, it's written to advertise his product, but it's still content that is of interest.


I don't understand your comment...


HN's own r/SubredditSimulator/ I presume.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: