Explanation: Ayende is a well known programmer in the .NET community. Among other things he works on RavenDB (open-source NoSql db for .NET) and is currently working on a new storage engine for it called Voron. In this post he discusses Uber's blog post about Postgres and MySQL and compares the low level workings of both with what his team is doing with RavenDB and Voron.
The article talks about Postgres mostly delegating caching to the OS while dealing with it via system calls (which they describe as "expensive"), while InnoDB has its own implementation and is therefore faster.
I wonder if part of the problem is really Linux, not Postgres and that another way they could have addressed this is by trying FreeBSD (which is what a lot of people run Postgres on).
>I wonder if part of the problem is really Linux, not Postgres
I'm not a PostgreSQL expert but it seems like it has a different philosophy with OS filesystem buffering.
Using Oracle and MS SQL Server as counterexamples... you can configure Oracle to bypass os filesystem buffers by using "raw" disk io. With MS SQL Server, it opens datafiles with CreateFile(,,,FILE_FLAG_NO_BUFFERING) to bypass the Windows NTFS caching. (Microsoft recently ported SQL Server to Linux so not sure what combination of techniques they are doing on there since CreateFile(FILE_FLAG_NO_BUFFERING) is a Win32 api and not relevant to Linux.)
The closest approximation PostgreSQL has is the "shared_buffers" parameter but from the documentation[1] I've read, the best practice for that setting is 25% to 40% of main memory. In contrast, Oracle and MS SQL Server db engines with their self-managed buffer pools are designed to use almost 100% of the main memory.
Conclusion: PostgreSQL relies on the os filesystem cache and its own db buffer does not supersede the os buffers. Other db engines' buffer pools are designed to bypass the os filesystem cache.
Installing PostgreSQL on FreeBSD instead of Linux wouldn't drastically change the recommended range for "shared_buffers".
Ironically, in another "universe", StackOverflow doesn't have either part of this problem - they use the DB recommended by Microsoft and the OS recommended by Microsoft :)
A userspace I/O cache is always going to be faster than the Linux one but that is not a problem with Linux per se. The userspace implementation not only has far more context about the workload with which to make decisions but the design of the underlying I/O cache itself can be perfectly tuned for the expected and desired behaviors a priori. The Linux buffer cache can't reasonably be designed for a single application. This has large performance effects.
By analogy, it is similar to how a purpose-designed hash map will almost always significantly outperform a generic hash map. There are enough design knobs that perfect foreknowledge of the use case allows significant optimization of the implementation.
I would say this is not just a case of faster, but more aware.
Linux does not have the same meta-data knowledge that the native cache has. There is a layering problem which makes it hard to add this. For example: after some page splits and fragmentation pages are no longer stored in physical order. How can a logical read ahead work on the Linux level?
The article mentions Postgres using many lseek and read calls. Does anyone know why Postgres doesn't use pread to cut the number of context switches in half (if, as the author says, the number of context switches is such a bottleneck)? On systems where pread doesn't exist, presumably they could use an inline function that emulates pread with lseek and read.
Yes, mmap would presumably be preferable in most situations, but pread would seem to be universally preferable to an lseek and read. I assume the kernel keeps per-fd statistics on sequential reads anyway, so for long-lived fds, lseek presumably wouldn't provide any better read-ahead hinting than pread.
The problem is fundamental to anything related to system calls. The only way to solve it is to somehow make system calls less expensive than they are now.
> a lot of the pain the Uber is feeling has to do with the way Postgres has implemented MVCC. Because they write new records all the time, they need to update all indexes, all the time
This is not true in general: there are circumstances where Postgres can skip updating indexes when a tuple is updated, assuming the update only changes values in non-indexed columns. See:
No: the example updated an indexed column (which frankly is likely rarer in practice but makes for a simpler example), but the serious write amplification comes because of the other indexes that had to be updated; in their example they change the birth year (so obviously the index has to change!) but then the name index also gets updated due to the new row location. HOT mitigates this issue.
> That should really be an event stream, but if you want to find all `unassigned` orders, an index would be good.
No, an index is not good for that, because it only has 4 values (at most 4 times faster than a full table scan), and it hammers the index on every update, using PostgreSQL or not.
Same for the two column index: single column indexes on pickup time and creation time are highly selective, and their values are updated basically never, whereas status has low selectivity and updates are common: combining them you don't significantly improve selects on the two columns, and limit the performance edge on time select where time.
Status index only performs well when you select one of the values which is is significantly underrepresented in the column (e.g. all except completed). The most performing approach without doing black magic is to split in columns with separate indexes: Assigned Time, Pickup Time, Completed Time. If you want unassigned ones, you select where Assigned Time IS NULL, same for the other columns. Performance on update for most RDBMS blasts, as index updates are now distributed among several, allowing higher concurrency and overall throughput.
If you want performance and the data is not critical like a Facebook post you need to use MySQL. But if you deal with money, like a bank or an ERP then the way to go is PostgreSQL. It should be a known fact for software developers.
1. Inserting a transaction to remove $10 from Bob's account
2. Inserting a transaction to add $10 to Jack's account.
Both 1 and 2 must succeed or fail in total if either of them fail (or both).
You can imagine the disaster in banking apps, ERPs and other financial systems when 1 succeeds but 2 fails and the database is happy with that imbalanced state.
MySQL does not add new features because it was bought by Oracle. But they improve performance. It is an easy to use minimal database. InnoDB engine is ACID compliant but MySQL's original engine, MyISAM is not ACID compliant. Why not ACID compliant? To improve performance.
On the other hand PostgreSQL is ACID compliant from the beginning. And PostgreSQL adds new features every few months. Some of these features that do not exist in MySQL are; IP address data type, serial fields (multiple auto-increments), row level permissions, geographical data types, inheritance, XML data type, custom data types, range data type and things that I don't know but I know that it provides many features of SQL language like intersect, except and merge joins. It is like free Oracle.
I get why those features are useful, but I don't see how that supports the claim that you should only use MySQL for things that aren't critical (implying that it will randomly produce incorrect results or similar).
Both of these articles fail to mention that as of Postgres 9.5, you can use logical replication rather than WAL. This is why Amazon's RDS now supports multi-region slaving with Postgres.
Why is there a reference and link to a wikipedia article about a sci-fi novel in the second to last paragraph? Did I miss something? Is this an inside joke?
I don't know who he is but I can't think of a more clear title to convey that it's not a repost and also leave the namespace open for other people to make similar posts (changing it to the post title "re: <title>" doesn't convey any extra information)
For people who do not know who Ayende is, the presence or absence of his name in the title provides no additional information. For people who do know that he is an experienced and talented database developer, his name in the title is likely to make them want to read the article more. Therefore, removing his name from the title is a net loss as it eliminates information that is useful for the many people who know who Ayende is while doing nothing for people who do not know who Ayende is.
Then count me in the category that requires you to use the qualifier "almost". I've been working in .NET since it was invented and I've never heard of him.
I think "Ayende" should be in the title somewhere not because of any Hollywood "name drop" but simply to differentiate it from yesterday's submission. I don't want those 2 submissions to have the exact same title. Yesterday's submission was a blog directly from uber.com. Today's submission is a blog from ayende.com with his own commentary.
Therefore, maybe a reasonable compromise is to put "Ayende" at the end like this:
"Why Uber Engineering Switched from Postgres to MySQL (Ayende's comments)"
I agree the greyed out hard-to-see "(ayende.com)" in parentheses can have the same effect but I assume that everybody's eyes scans the page differently[1]. I'm sure a lot of readers in a hurry would ignore the grey text and would assume it's an exact duplicate of yesterday's post. Having "Ayende" in the title seems like a minor expansion of words to counteract that effect.
From the back & forth comments, I'm getting the vibe that the submitter broke some hidden HN principle??? As if putting a person's name in the title is trying to manipulate the HN readers into hero worship?? Or trying to impress HN readers with Hollywood name dropping?? I read the "Ayende" using a perfectly neutral interpretation but it's obvious others are really turned off by it. I can't see any other reason for the bikeshedding about the title.
HN guidelines regarding titles are to include the article's original title if it's not clickbaity or misleading and to omit the website's name from the title because it appears on the domain name next to it. Both guidelines seems to not have been followed.
I agree that OP did it to differentiate this response article from the original one but this was the first time the article was posted on HN so we don't really know if it needed to be differentiated (i.e. if it had been posted without getting to the frontpage then it would've been safe to assume the necessity of a different title).
Either ways I think "A response to" instead of "Ayende on" would've done the same job without risking falling into any of the pitfalls you mentioned. But enough nitpicking, the article in itself is interesting.
>It was not the domain that was included in the title but the authors name, they happen to be the same.
Yes I understand that that was the intention but let's suppose the process of detecting domain name in the title had been automated. The post would've been flagged.
>on HN they are posted so infrequently
I take it you don't browse HN's newest page, it's full of garbage links.
OP here, yes, that's exactly why I put 'Ayende' in the title, since the original was posted yesterday. If I didn't, I figured people would skip it assuming it was a dupe, or I'd get the usual crowd bleating about it being a dupe without even bothering to read it.
Perhaps you should let the crowd answer the first part of your post before making a demand re: something you know nothing about in the very next sentence?
I gather reposts or re-bumps-to-the-front-page are common these days in case good articles don't get the attention they deserve. Which articles deserve attention is, of course, entirely up for debate.