If I recall correctly, this absolutely killed FreeBSD in benchmarks while Linux looked amazing. I recall the editors having to put a note in there saying that Linux wasn't committing I/O but FreeBSD was to explain the huge difference. At least I think this is the situation I'm remembering. Hmm..
The fsync vs I/O errors bug didn't have any performance impact in absence of failing disks, it was about what happened when you tried to fsync again after your storage stack reported a write error.
No, since most OS had (and still have) the "incorrect" behaviour. This was (and still is!) a generalised issue of fsync itself, since the state of buffers is undefined in case of failure.
The assumption that you could retry fsync on error was always effectively incorrect, per the posgres wiki on the subject openbsd, netbsd, and macos had essentially the same behaviour (invalidate buffers on failure). FreeBSD diverged from the rest because they rewrote brelse() long after separating from the others.
After this issue surfaced, openbsd updated their implementation to ensure the call would always fail, but only if the file is kept open which had not been guaranteed for postgres (don't know if it now is).
Linux had (has?) the additional issue of wonky error reporting, but that would not save your ass.
"with possibly disastrous consequences for data durability/consistency (which is something the PostgreSQL community really values)."
Do they though? Well, sure, they might value it, but do they prioritize it? E.g. checksums for data pages were only introduced in PostgreSQL 9.3 and they are still disabled by default today in PostgreSQL 14. That might be the right choice if your underlying filesystem provides checksums for content (e.g. ZFS) and memory is protected from corruption via other means (likely given on a DB server holding data someone cares about), but should that assumption be the default?
20 years ago I was trying to decide between the two. But sql sounded hard to learn. so I wrote my own database. The joys of being a jr programmer with no oversight.
if i recall, it rose to prominence against a backdrop of frustration with myisam not supporting transactions and stored procedures among several other things.
early postgres had its own complications though. myisam was limited, but save for dumps it basically ran itself. postgres required tuning and manually run maintenance operations. it was not a drop in replacement for myisam like innodb was...
What is the issue? What are the implications? Is there a written summary some place?
These people do great work, bless them. I love watching these sorts of videos (bless them) when I have some time to kill.
But I would like to know the implications of this, and a forty five minute video might be of some help - or might not. After forty five minutes erased from my life....
The tl;dr (and blunt truth) is that Linux doesn't implement POSIX's semantics of fsync() correctly. The standard says "the fsync() function is intended to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk". Instead, Linux merely provides this guarantee for "all data since the last* call to fsync(), whether successful or not", which means data that failed to flush once will continue to fail to get flushed on subsequent calls to fsync(). But nobody I've seen has been willing to admit this is a broken implementation yet. People instead go through extreme contortions to justify the status quo, blaming Postgres for not using a broken fsync() "correctly".
Given how it took some 2 decades for Linux folks to change their minds on other issues, like realizing the notion of a pidfd is a useful one, I'm guessing they might eventually get around to this too, but it might take another couple of decades for views to change.
I though the conclusion was that POSIX spec was not clear on this specific scenario, while the Linux kernel team advised the only real solution was for Postgres to move to direct I/O (DIO) instead, something that is still ongoing work?
Well, I mean, they can claim it's unclear, but to me they're obviously wrong—and you don't have to take my word over theirs; just go read the spec [1]. It's perfectly clear what "all data up to the time of the fsync() call" means. You obviously can't just read that phrase and claim it's in any way synonymous with "all data up to the time of the fsync() call since the previous call to fsync(), even if it failed (!)". "All data until X event" just means all data before X—plain, simple, and unambiguous. It just doesn't mean "all data until X since Y"!
"#1: Existing file systems do not handle fsync failures uniformly. In an effort to hide cross-platform differences, POSIX is intentionally vague on how failures are handled. Thus, different file systems behave differently after an fsync failure (as seen in Table 1), leading to non-deterministic outcomes
for applications that treat all file systems equally. We believe
that the POSIX specification for fsync needs to be clarified
and the expected failure behavior described in more detail."
People can say whatever they want (even in papers!), and people go through contortions to justify arbitrary things, but like I said, you don't have to take my word any more than you have to take theirs when the spec is right there. You can just read the spec yourself and witness that the meaning is pretty clear.
"...Next, the reason why fsync() has the behaviour that it does is one
ofhe the most common cases of I/O storage errors in buffered use
cases, certainly as seen by the community distros, is the user who
pulls out USB stick while it is in use. In that case, if there are
dirtied pages in the page cache, the question is what can you do?
Sooner or later the writes will time out, and if you leave the pages
dirty, then it effectively becomes a permanent memory leak. You can't
unmount the file system --- that requires writing out all of the pages
such that the dirty bit is turned off. And if you don't clear the
dirty bit on an I/O error, then they can never be cleaned. You can't
even re-insert the USB stick; the re-inserted USB stick will get a new
block device. Worse, when the USB stick was pulled, it will have
suffered a power drop, and see above about what could happen after a
power drop for non-power fail certified flash devices --- it goes
double for the cheap sh*t USB sticks found in the checkout aisle of
Micro Center.
So this is the explanation for why Linux handles I/O errors by
clearing the dirty bit after reporting the error up to user space.
And why there is not eagerness to solve the problem simply by "don't
clear the dirty bit". For every one Postgres installation that might
have a better recover after an I/O error, there's probably a thousand
clueless Fedora and Ubuntu users who will have a much worse user
experience after a USB stick pull happens.
I can think of things that could be done --- for example, it could be
switchable on a per-block device basis (or maybe a per-mount basis)
whether or not the dirty bit gets cleared after the error is reported
to userspace. And perhaps there could be a new unmount flag that
causes all dirty pages to be wiped out, which could be used to recover
after a permanent loss of the block device. But the question is who
is going to invest the time to make these changes? If there is a
company who is willing to pay to comission this work, it's almost
certainly soluble..."
and this part of the paragraph:
"...But again, of the companies who have client code where we
care about robustness and proper handling of failed disk drives, and
which have a kernel team on staff, pretty much all of the ones I can
think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try
to make buffered writes and error reporting via fsync(2) work well..."
I conclude that:
1) Fixing fsync() would cause other issues and might be technically challenging,
and some within Linux community dont have it as their highest priority
2) The onus falls mostly mostly on Postgres team, to re-implement their code
so as to use the "better" technical solution, as implemented for Linux
for similar scenarios by Oracle, Google and others...
> In that case, if there are dirtied pages in the page cache, the question is what can you do? Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak.
FreeBSD keeps the dirty buffer (always marked as dirty) in memory as long as the device exists:
Honestly even if a device disappears it's probably worth keeping around those pending writes for several seconds. It might come back, and if it comes back fast enough it probably hasn't been altered by any other machine.
Super interesting (I hadn't seen it before), thanks for sharing! I feel like there might be lots of potential solutions they're ignoring though.
> Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak.
Why not just get rid of the pages after the underlying medium is removed? Especially since, after that point, you wouldn't be able to guarantee much about the device's contents anyway, even if it were plugged back in.
Also, even if I buy that this is a trade-off with a permanent memory leak (which I don't, for lots of reasons like the above), isn't that better than outright corruption...? At least you can reboot to get rid of a leak. There's no guarantee you can get back corrupted data!
Looking at it from another side...The whole Direct I/O path looks like a worst can of worms...maybe it's easier to review again fsync :-)
"...The exact semantics of Direct I/O (O_DIRECT) are not well specified. It is not a part of POSIX, or SUS, or any other formal standards specification. The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications..."
"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances." -- Linus
Or (3) companies that make $$$ or save $$$ from using Postgres could hire kernel developers to implement a better solution for buffered I/O. The problem is that the companies who are serious about I/O recovery use Direct I/O, so engineers that are employed by those companies have plenty of other improvements such as io_uring, improving general storage performance, adding support for inline encryption engines to improve performance on mobile devices, etc., etc., etc.
People seem to forget that Open Source does not mean that users get to demand that unpaid volunteers will magically do work for their pet feature requests. It just means that the source code is available and people are free to improve the code to make it better fit their use case. A proprietary OS is like a car whose hood is welded shut, and only the dealer is allowed to service it. An open source OS means that you can take the car to whomever you like, or even service the car, or improve the car, yourself. It does not mean that you get to have service or improvements to your car engine for free.
The other thing to note here is that Postgres was issuing fsync(2) calls from different processes, and some of those processes were ignoring the error return from fsync(2). If there is an I/O error, fsync(2) will tell userspace about it. However, there is nothing in POSIX which states that once a file has an I/O error associated with it, the fsync(2) system call will return errors forever and ever, Amen. So Postgres was being a bit dodgy with error returns as well, and was demanding that something that POSIX clearly never promised.
You're quoting a sentence fragment of a non-normative section of the spec, in other words you're half quoting a part of the spec that is not intended to be interpreted formally, literally, or even required to be adhered to. The rationale section of the POSIX spec is strictly informative.
The fact that you left out the part of the sentence that says "The fsync() function is intended to..." suggests that you may be intentionally hiding this fact to support your point.
Even if we go by the non-normative text, that section explicitly states that the spec allows for, and I quote "It is explicitly intended that a null implementation is permitted."
At any rate, the normative section of the spec states only the following:
"The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined."
Emphasis mine.
It would be unwise to dismiss the wealth of resources that have been provided to you in other comments to better understand this issue, all of them quite extensive and written by experts on this subject, just because of an informal section of the standard
Setting aside the appeal to authority and the outright condescension, POLA still can't be dispelled using a "nasal demons" argument.
In practice the intentional vagueness is indistinguishable from obfuscation, and doing so to please every vendor for conformance's sake is straightforward intellectual dishonesty. In this regard, the informational section is tantamount to an extended reprimand from the editors.
If you really want to trade language-lawyer broadsides, then leaning on the phrase "implementation-defined" was a tragic mistake, because this makes the Linux documentation normative for the circumstances, yet the behaviour occurring is not described in the Linux fsync(2) manpage, nor are there cross-references to anything FS-specific. It gets worse: the engineering solution to this schmozzle is switch to direct IO, but the documentation of Linux's open(2) then describes the behaviour of O_SYNC|O_DIRECT in terms of fsync(2) (and similarly for O_DSYNC). All this simply highlights that the whole line of thinking is an unproductive pissing contest; the root cause of the postgres issue is vendor politics.
The nature of the transfer is implementation-defined.
Does "nature of the transfer" cover "the extent of the data set"? To me, the nature of the transfer refers to the technical implementation of the writeout mechanism. Especially given that the rationale explicitly mentions "since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract" it seems to me that "nature of the transfer" should not be interpreted as "the data to be written out".
But even going by the formal Description only, it does not follow to me that an implementation can choose to selectively ignore part of the data. These are the phrases used in that section:
- The fsync() function shall request that all data for the open file descriptor [..] to be transferred to the storage device
- [It] shall force all currently queued I/O operations [..] to the synchronized I/O completion state
- All I/O operations shall be completed
Each sentence contains the word "all".
The return value description reads "Upon successful completion, fsync() shall return 0". Based on this thread, the Linux kernel should never return 0 if a previous fsync failed, because the kernel has no intention of honouring the request.
Leaving aside that the normative part of the spec is still very clear that fsync() applies to all data for the open file descriptor, your line of reasoning fails for another reason. If we accept that fsync() as implemented in Linux is correct, what is its use? Is there any way to use fsync() "correctly" in the presence of IO errors?
As far as I understand, the answer is "definitely not" on Linux before 4.13, since failed async writes to the file could clear the dirty bit long before you called fsync(). There are some improvements since then, so arguably fsync() can now be used for something meaningful, though you have to take great care, and the only recourse on error is probably to start rewriting the file from scratch, even for transient errors (since the buffers may or may not be lost after the error, but future calls to fsync() will report success regardless).
Whether this behavior is a good idea or not is the only question that matters, since POSIX compliance hasn't mattered to Linux anyway for who knows how many years.
> The fact that you left out the part of the sentence that says "The fsync() function is intended to..." suggests that you may be intentionally hiding this fact to support your point.
No, that was not my intention. I just started the sentence "the job of fsync() is" and then went and looked for the portion of the quote that fit. I don't see a semantic difference, but I edited that back in, since you think I'm pulling a fast one on you.
> You're quoting a sentence fragment of a non-normative section of the spec, in other words you're half quoting a part of the spec that is not intended to be interpreted formally, literally, or even required to be adhered to. The rationale section of the POSIX spec is strictly informative.
Would be nice if you could include which part of the standard claims the rationale is non-normative. Moreover, the justifications against it are themselves based on "history" rather than anything in the text, so even if we assume it's non-normative, you can't really use this as an excuse to dismiss it as if it's somehow less valid than arbitrary tidbits from history. At that point you could try to agree that the (purportedly) non-normative text in fact does contradict the current behavior, but argue that you should have the latitude to ignore it. That's quite a bit different from denying its meaning in the first place.
> "The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined."
The "nature of the transfer" is implementation-defined, yes. They don't care if you do it all at once, in parallel or sequentially, in bursts or uniformly, by boat or by rocket, or whatever. The question of which data is to be transferred is completely unrelated to that. If I say "please take my family to the airport, the nature of the transportation is up to you", you don't get to leave half my family behind and then justify it by claiming I left the choice of family members up to you. I did not leave that up to you. I just said you can choose the nature of the transport. Not the whether or whom of the transport!
> It would be unwise to dismiss the wealth of resources that have been provided to you in other comments to better understand this issue, all of them quite extensive and written by experts on this subject, just because of an informal section of the standard
If experts claimed the standard was clear but perhaps inconsistent or self-contradictory, that would be very different than what's going on right now. Or if they argued "we agree this behavior is actually inconsistent with the rationale, but we don't believe the rationale is normative", then that would also be a different scenario. But when people are just saying the standard says something other than what it actually says, what am I supposed to do? Agree with them because they're "experts", despite the fact that the standard says the opposite in plain English?
> The fact that you left out the part of the sentence that says "The fsync() function is intended to..." suggests that you may be intentionally hiding this fact to support your point.
It's very difficult to read this accusation as having been made in good faith, since the person you're accusing of intentionally omitting part of a quote already quoted that very text in their previous comment:
>>>> The tl;dr (and blunt truth) is that Linux doesn't implement POSIX's semantics of fsync() correctly. The standard says "the fsync() function is intended to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk".
>...the person you're accusing of intentionally omitting part of a quote already quoted that very text in their previous comment:
That was edited back in after I made my comment (this is admitted to by OP themselves).
To the extent that I make an accusation it's that it's absurd to think you can dismiss the wealth of information on this subject and come to a remotely comprehensive understanding of how to implement reliable IO functions by simply reading a couple of sentences of the POSIX spec and dismiss the numerous other sources that go into very fine detail on this subject.
This is not a subject you just read a few sentences on and then make bold claims about how Linux doesn't implement fsync properly, implying that the experts who have gone to great lengths to implement fsync on Linux just don't understand it; that all the papers on this subject, the numerous mailing list posts, the very submission itself, can all readily be dismissed as a simple failure to read what is as clear as day as if none of the people discussing this issue have ever bothered to read the spec.
> […] while the Linux kernel team advised the only real solution was for Postgres to move to direct I/O (DIO) instead, something that is still ongoing work?
Or the Linux folks could have done what McKusick did on FreeBSD back in 1999:
> Don't throw away the buffer contents on a fatal write error; just mark the buffer as still being dirty. This isn't a perfect solution, but throwing away the buffer contents will often result in filesystem corruption and this solution will at least correctly deal with transient errors.
to be fair, nothing really perfectly confirmed to POSIX now that i think of it.
who remembers the myriad choices in synchronization primitives that could be used for the main loop in preforking apache? (to work around all the different flavors of broken in unix derivatives) i remember one desperate choice falling all the way through to flock() on a sentinel file on disk...
macOS is a certified unix, and after the postgresql people took a look at the situation they found out that it also loses data in that situation. So do OpenBSD and NetBSD.
Only FreeBSD would avoid data loss on fsync() failure. AFAIK that remains the case, the only major change I'm aware of is that OpenBSD's fsync is now "locked" into failure after the first failure, at least until you close the file (postgres also played silly bugger on that front).
The only certified UNIX you've listed is macOS, and it doesn't surprise me in the slightest that their frankenkernel doesn't fully comply with POSIX spec. Besides that, I don't know what to tell you. The whole business of "true UNIX" is pretty much thrown to the wayside these days anyways, the list of server-ready POSIX-compliant OSes is tiny, and their market share is even smaller.
The short of it is that fsync on Linux used to be really bad at reporting errors (POSIX APIs are still bad, but we've now sorted out some ways to work around this particular source of pain)
IIRC, after an IO error you could get into a situation where on the next fsync postgres thought things were back on track, but Linux had actually just thrown away some data.
> What is the issue? What are the implications? Is there a written summary some place?
The gist of the issue is that you can not retry fsync(), unless you're on freebsd.
There are no defined guarantees as to the state of the buffers on fsync() failure, most OS will just invalidate the buffers regardless. So if you retry fsync() on failure, the first call fails and discards your data, and the second trivially succeeds since there's no buffered data.
FreeBSD is the only one which keeps the buffer as-is on failure. Since this issue was uncovered, OpenBSD was also modified such that an fsync() failure would be sticky (so retrying is useless, but at least it's not misleading).
Linux had the additional issue that it would commonly fail to report errors entirely, but that is not the main problem.
PostgreSQL used fsync incorrectly for 20 years - https://news.ycombinator.com/item?id=19119991 - Feb 2019 (307 comments)