> When we look at the fsync() and fdatasync() man pages, we see that those system calls only guarantee to write data linked to the given file descriptor. With ext4, as a side effect of the filesystem structure, all pending data and metadata for all file descriptors will be flushed instead. This creates a lot of I/O traffic that is unneeded to satisfy any given fsync() or fdatasync() call
Does it mean that, under ext4, a call to fsync is essentially the same as a call to sync(2)?
Yes, this was a source of many problems when firefox moved its history and bookmarking engine to sqlite (I think) which used fsync for consistency guarentees. Leading to situations where making a bookmight might cause your entire computer to pause as disk writes spin up.
Last year I had a problem where I had a function that was logging and a sync() call in another process was blocking the logging for greater than 10 seconds and this caused a timeout aborting the operation my code was performing. I have now moved all logging into a queue that logs on its own thread. This was a gotcha I never anticipated.
Unless you have infinite memory, at some point you want a task to slow down, block, or whatever in the face of resource exhaustion.
It's not a bad idea to maintain a local buffer that gives you a certain amount of cushion. I recently helped a team resolve the exact problem you had, with a similar solution. But excessive, unnecessary use of non-pageable memory is one of the things which induce early I/O contention, causing these stalls to begin with. (Consider an overloaded or errant process generating and buffering a lot of logging noise, precisely because the overtaxed system is under heavy I/O contention.)
To reiterate: you want backpressure, which means that you want a process which is exhausting limited resources to slow down or block. And you want that to transitively slow down or block upstream requests. Too many developers don't understand this and insert hacks to solve their immediate problem (e.g. closing a ticket complaining about intermittent SLA latency misses) without appreciating the broader issues, which at the end of the day just contributes to these problems.
One of the alternatives people attempt is to insert a gazillion knobs to permit dedicated resource allocation. But now you just have two problems, the second being figuring out what the magic values should be--a never ending and often intractable problem. This rarely ends well except for highly specialized tasks--e.g. a dedicated DB administrator who spends all day attending to and tuning a database instance.
That said, in the old days you mounted /var (and if you were super fancy, /var/log) on different disks to minimize unrelated I/O contention.
Yes. I consider it one of my worst bugs in the Linux kernel and in big servers you pretty much can't have any programs that call fsync() because it will cripple your performance.
That paragraph caused a whiplash of emotions while reading, "Cool!!!... what???? Ug."
Yes, however when you're running a DBMS in your server, I guess it will be doing fsync() all the time. I wonder, if other filesystems like XFS or ZFS have a different behavior here, how much performance gain for a typical database load they can achieve in comparison to ext4.
Realistically, when can we expect this on fresh installs of Debian or Fedora? Will desktop users notice a perceptible performance gain when performing operations on large files?
not sure how it relates to fedora, but article also points out you need to mkfs to build a new structure and the patch to build it hasn't hit ext2progs mainline yet
For me personally, I am waiting for ENOSPC to be 100% solved and understood. It may or may not be (maybe it's theoretically impossible to solve who knows), but until then I'm content with what I have now.
> I am waiting for ENOSPC to be 100% solved and understood
It's understood, there's no black magic. When there's not enough contiguous space to perform a metadata copy (it's a copy on write filesystem, no in-place overwrites!) then it returns an error, even if there's still some free but fragmented space or some free space in the data block groups. A rebalance when metadata blockgroup total size diverges significantly from actual use should solve this.
For small filesystems there also is the option to choose mixed block groups, then there won't be the situation where there's still free space in the data groups while the metadata groups are full.
Do folks typically turn on journaling at the filesystem layer when running a database?
The database itself contains journaling, so one might choose to run with data=writeback or even directly against the block device if they were concerned about performance.
You definitely need both, these are two completely different kinds of journalling:
- Filesystem journalling is making robust changes to the data structures describing directories, files, and where files live, in units of atomic filesystem operations. For example, the filesystem journal may record "CREATE FILE", which translates to "update directory entry 1234 in directory block 5678, then allocate and initialize extent descriptor 9999, then write an inode at array entry 74234"
- Database journalling is making robust changes to the data structures describing the actual file contents, in units of atomic logical application operations. For example, a DB journal may record "INSERT ROW", which translates to "update block 123 of this index file, and 234 of this data file", application-specific relationships like that cannot be captured by the filesystem on UNIX.
(Note: NTFS is transactional on Windows. It's entirely possible to correlate independent writes and make them atomic, so on Windows at least, in theory a DB could exist without a separate journal. I don't know if this is used in practice). Even if it were in use, it places severe limits on the kinds of concurrency optimizations a database system could otherwise perform, because all of that stuff moves behind the curtain of the OS interfaces.
data=writeback does not disable the journal completely. It only removes ordering of the data writes relative to the metadata journaling. The metadata journaling itself remains active.
You can create the file, preallocate space, fsync the inode and the directory to ensure that it will be visible after a crash and then begin using the allocated space as a journal. Then you only have to fdatasync or sync_file_range whatever part of your journal needs to be persisted and those syncs can now be unordered relative to the filesystem's metadata journal without risk of data loss.
So data=writeback can be used safely, but you have to be very very careful about getting the syscall sequence right. Most applications implemented with sufficient paranoia and so are better served by stricter ordering modes and auto_da_alloc.
Probably not much -- the main win is not having to fsync files unrelated to your work, which is great for desktops which run multiple unrelated tasks (browser, terminals, rss readers, email clients, steam/game updates, package downloads). But I have to imagine SQL databases are typically run on systems dedicated to that singular task.
On the other hand, DBs call fsync() very often, which would presumably trigger this frequent "stop the world, write the data" behaviour vastly more than otherwise, which could make your I/O very stuttery.
My solution has always been to just add a RAID controller with a battery-backed write cache, and then to disable barriers on ext4 and switch to journal=writeback. The data gets written to the cache "instantly", meaning less risk of data corruption (either at DB or filesystem level) from crashes or power outages. Saves having to worry about any of this sort of stuff.
> On the other hand, DBs call fsync() very often, which would presumably trigger this frequent "stop the world, write the data" behaviour vastly more than otherwise, which could make your I/O very stuttery.
But that's because they want to stop the world and write the data. Or more specically write the data they wrote. The issue is that more data is getting written than POSIX deems required, and that's the perf hit.
Depending on your usecase, libeatmydata or battery backed RAID can help, but I'd feel a bit skeptical of 'we want to lie about fsync() to our DB' for many workloads.
Funnily enough, BTRFS and ZRAM out-of-the-box are the reasons I moved to Fedora for all but my main Gentoo/ZFS system. To each their own c:
Would you mind sharing what other features you've found pushing you away from Fedora and why? For my part, I'll admit, BTRFS and ZRAM both work against hibernation out-of-the-box, which is my most major gripe.
Neither Btrfs nor zram-based swap, inhibit hibernation. There is a kernel lockdown policy that inhibits hibernation when UEFI Secure Boot is enabled.
Since the hibernation image is neither encrypted nor signed, it's a potential attack vector to subvert the UEFI Secure Boot mechanism. Because UEFI Secure Boot systems are common (Windows hardware certification requires it exist and be enabled by default), it was decided to, by default, save the space on disk for disk-based swap and use zram-based swap instead.
It is straight forward to create a disk-based partition during installation, and it is configured to support hibernation, while still remaining subject to the lockdown policy.
The Fedora installer has let you choose between ext4 (and ext3 before it), XFS, and BtrFS for like, more than a decade? This should not really influence your distro choice one way or another.
It's defaulting to btrfs, which is a poor decision. If Fedora is making such poor decisions on basic things like default filesystems, that gives me more than enough influence to chose one way or another.
They were all aboard with Stratis, but here we are, btrfs moving back in.
> One of the things that I did discuss with Harshad was using some
hueristics, where if there are two "unrelated" applications (e.g.,
different session id, or process group leader, or different uid,
etc. --- details to be determined layer), we would not entangele
writes to unrelated files via fsync(2), while forcing files written by
the same application to share fate with one another even if only file
is fsync'ed.
Ugh. This is why we can't have nice things. I really don't want the kernel's filesystem performance to depend on the number of different UIDs writing to the filesystem. That is insanity!
Ted Ts'o is just wrong here: performance should take priority over preserving the behavior of applications that rely on non-contractual implementation details of the Linux kernel. fsync should sync only the indicated file, and that's that. We can add a mount option to let users opt into the older, safer behavior, but we shouldn't suffer for essentially an eternity because somewhere, someone might have written an application that depends on an ext4 implementation detail.
I think you have to see that in the light of all the complaints about data loss when people moved from ext3 to ext4. Ted IIRC pushed back against that with arguments like yours, but eventually gave up (leading to a heuristic for detecting such cases, cf the auto_da_alloc mount option).
Back then, people could point to specific examples of applications that depended on the sync-on-rename feature. What specific examples is he able to cite now? The entire discussion seems like a discussion of hypotheticals.
> When we look at the fsync() and fdatasync() man pages, we see that those system calls only guarantee to write data linked to the given file descriptor. With ext4, as a side effect of the filesystem structure, all pending data and metadata for all file descriptors will be flushed instead. This creates a lot of I/O traffic that is unneeded to satisfy any given fsync() or fdatasync() call
Does it mean that, under ext4, a call to fsync is essentially the same as a call to sync(2)?