Why `fsync()`: Losing unsynced data on a single node leads to global data loss

PreInternet01 · on May 16, 2023

I'm pretty sure Redpanda has identified a real issue here, but my only response to any situation where you need to be absolutely, positively, 100% sure that `fsync()` has done The Right Thing would be: "Good luck with that!"

Even in a relatively simple modern storage stack, say a single NVMe Flash device on a local PCIe controller, the question 'are these exact bits irrevocably committed to the media?' is pretty much impossible to answer without an NDA with the disk manufacturer and a hefty dose of proprietary IOCTLs (and even then, I would not bet my life on it).

In a typical 'Enterprise' stack, which would extend the above with a healthy dose of Fibre Channel over Ethernet, load-balancing switches, distributed RAM caches, and a bunch of other complexity, I would not even bet the life of my goldfish on it.

The article mentions Byzantine systems as a possible solution, but my only knowledge about those comes from reading https://www.usenix.org/system/files/login/articles/13_micken..., which makes me not exactly optimistic about their real-world applicability.

One thing that comes to mind when reading things like this is 'Distributed journaling for distributed systems', but since nobody seems to be doing that, it's probably a silly idea. I'll probably stick to SQLite write-ahead logs, at least I can wrap my mind around those...

throwaway09223 · on May 16, 2023

" the question 'are these exact bits irrevocably committed to the media?' is pretty much impossible to answer"

Well, kind of. fsync()'s job is only to ensure caches are written to durable storage. Its job is not to ensure the integrity of durable storage. Your idea of fsync()'s "Right Thing" is not quite correct, because these types of durable storage failures can happen outside of the context of a write -- so it's a bit silly to point a finger at fsync().

For example, you can write bytes to disk today and the drive might experience corruption on that sector next week.

"(and even then, I would not bet my life on it)"

All hardware fails, so of course you shouldn't bet your life on it. That's why we have both onsite and offsite backups. The durable media itself can always fail and lose data -- even if there aren't any writes occurring. Data loss can even occur on a powered off system.

Again, fsync()'s job is only to ensure the data is moved to durable storage. Solving the problem of data loss is done in other ways -- for example raid5 addresses the problem of durable storage failing by using parity and extra copies.

The real takeaway from the blog is perhaps that Raft as a protocol is inadequate, in the same way a raid1 mirror system is inadequate in terms of protecting from data corruption on durable storage (which of the two mirrored drives has the correct version of the data?). I'm frankly surprised that this situation is undetectable. It's a solved problem in other storage layers.

"I'll probably stick to SQLite write-ahead logs, at least I can wrap my mind around those"

Well, keep in mind: those have the same problem in the event of durable storage failure.

Siecje · on May 16, 2023

> Again, fsync()'s job is only to ensure the data is moved to durable storage.

The problem is that every layer in the stack lies. They say "yes it is definitely written to durable storage" when it is just in some cache layer and about to be written.

throwaway09223 · on May 16, 2023

"The problem is that every layer in the stack lies. "

Not enterprise gear. Reliable storage does exist. I have tested many vendors myself, and gone through spec sheets under NDA (as mentioned above).

"They say "yes it is definitely written to durable storage" when it is just in some cache layer and about to be written."

Enterprise hardware contains batteries specifically designed so that caches can still be written out to durable storage in the event of power loss. Have you ever dealt with managing battery learning cycles on a Dell PERC?

dboreham · on May 16, 2023

> it is just in some cache layer and about to be written

Definitely not. I get that sometimes we find things that do lie, but lying about this is a huge P0 bug and everyone with an interest in data storage knows to be on the lookout for such brokenness. E.g. such people do not buy SSD that lack some sort of power fail flush protection.

rubiquity · on May 16, 2023

> One thing that comes to mind when reading things like this is 'Distributed journaling for distributed systems', but since nobody seems to be doing that, it's probably a silly idea

That's exactly what Redpanda, BookKeeper, LogDevice, Pravega, and countless other proprietary systems at big tech companies do.

throwawaaarrgh · on May 16, 2023

There is no way to know for sure that a Byzantine system is actually operating in a way that replicated copies of your data are actually safely written. Both due to the exact same issue of the drives themselves, and also that Byzantine system software is liable to have a variety of bugs and invalid states that will keep it operating as normal, even though the nodes are actually in a fault mode. (the problem is twofold: 1) the node and/or system not refusing writes when in a fault state, 2) the system not actually knowing that it's in a fault state) Even if you do all kinds of Jepsen simulation and mathematical proofs of the software (including the operating system!), you still can't trust the drives.

I think the only way to solve the problem is new storage firmware and hardware that is open and guarantees a write is done. I'm sure some companies may claim such functionality but we need an open source architecture and code to be sure.

In the meantime I think synchronous writes to multiple nodes is the safest option. Avoids complexity and bugs in fancy software, and the hardware is what it is.

insanitybit · on May 16, 2023

> In the meantime I think synchronous writes to multiple nodes is the safest option.

This is what is being proposed though, right? Like no one is saying just use fsync, it's to use fsync across multiple systems, no?

latchkey · on May 16, 2023

> There is no way to know for sure that a Byzantine system is actually operating in a way that replicated copies of your data are actually safely written.

Isn't this how Bitcoin adds to the ledger though? Using a merkle tree and slowing things down significantly with those 6+ confirmations.

throwawaaarrgh · on May 16, 2023

Technically Bitcoin can work without ever safely writing to disk. It's potentially all magical caches until the firmware decides to put bits to media.

latchkey · on May 16, 2023

That's actually a really good point. It isn't dependent on the hardware, it depends on the math.

agallego · on May 16, 2023

That’s right. Raft uses sync writes which is what redpanda uses.

Complexity and industrial level reference impls are crucial

account42 · on May 16, 2023

> Even in a relatively simple modern storage stack, say a single NVMe Flash device on a local PCIe controller, the question 'are these exact bits irrevocably committed to the media?' is pretty much impossible to answer without an NDA with the disk manufacturer and a hefty dose of proprietary IOCTLs (and even then, I would not bet my life on it).

The answer is easy in all cases: No.

Media can always fail without notice.

deathanatos · on May 16, 2023

Media can fail, but the parent there is saying they can fail with byzantine faults.

Most of us don't protect against such faults. (Neither Redpanda nor Kafka do, but that's not the point of the OP; the OP's point is that Kafka is not protecting against non-byzantine faults in the disk.)

icedchai · on May 16, 2023

Media failures after the data is written seems out of scope.

bcrl · on May 16, 2023

Using fsync() to flush data to disk is, generally speaking, the wrong way to do things. fsync() is like using a piece of wood to hammer a nail into the ground. It'll work under very limited circumstances, but will fail horribly to do the heavy lifting required in other cases. fsync()'s semantics are complex and poorly defined, doubly so if multiple threads or processes are involved. Who gets the i/o error for a failed write? Nobody knows! fsync() is probably one of the worst APIs warts in Linux/Unix.

Developers competent in matters of storage generally use other methods / APIs like open()ing a file with O_SYNC and then checking the results of write() or pwrite() which will clearly identify when data is or is not successfully written to disk. If you want more performance, use async writes and it'll perform just fine. Seeing fsync() in code is a sign that the code has a bad smell and probably needs to be thoroughly reviewed.

eatonphil · on May 16, 2023

There may be performance reasons to still use fsync with O_DIRECT instead of only O_SYNC.

https://news.ycombinator.com/item?id=15539828

jmclnx · on May 16, 2023

> sing fsync() to flush data to disk is, generally speaking, the wrong way to do things.

I think this is an old "statement" due to the Linux ext4 sync issue from quite awhile ago. As far as I know it has been fixed. I doubt it is discouraged on the BSDs.

But, curious, what is the Right Thing to do ? There are cases where you want to be sure the data reaches the platter.

On a mini I use to program on, if a filename started with an '@' sign, that said all writes went directly to disk. Based upon that were were able to create a roll-forward recovery for our custom applications. So, as far as I know, fsync(2) should accomplish the same thing using the open descriptor.

bcrl · on May 16, 2023

I stand by my position. The problem with fsync() is the question of when and where do errors get reported. If code checks errors on completion of each write it becomes significantly easier to know what write has failed. The error reporting is the granularity of the write being performed.

In contrast, fsync() has to report errors for everything that has happened since the last fsync() call. If you are doing anything complicated like a database, which is virtually every single modern application, you need to have some idea of which writes have failed to figure out how to recover. fsync() gives you none of that. Different .

In kernel you get results for each i/o performed, so filesystems can make the appropriate decision about how to respond to a failure.

If you're building a high performance application, fsync() is a disaster. Unrelated writes will get bundled up when different threads call fsync() on the same file descriptor. Having multiple file descriptors open on the same file is silly and causes issues with applications that need to have many file descriptors open. The Linux kernel goes to great lengths to ensure that filesystems are high performance for writes on multiple threads.

fsync() sucks for anything more complex than "I have a single file to write out".

agallego · on May 16, 2023

it tends to be true. we do things at the application tier to minimize these things.

for example, assume you have to write A, B, C ... in syscalls it would be

write( A ), flush() write( B ), flush() write( C ), flush()

so if you add a debounce of say 4ms you get

Write( A ), Write( B ), Write( C ), Flush()

and saved 2 flush()-es

This is common with protocol aware storage applications.

eatonphil · on May 16, 2023

And I guess the idea is furthermore that just because you don't fsync after every write does not mean that you haven't fsync-ed before responding to the user request saying the data is stored durably. I assume that you do actually guarantee the flush before returning a success to the user.

jandrewrogers · on May 16, 2023

Yes, this is exactly how it works. It effectively batches the sync part of multiple user requests, so none of those user requests are ack-ed until the sync completes. It is a low-latency analogue to checkpointing storage write-backs to minimize the number of round-trips to the I/O syscalls.

agallego · on May 16, 2023

exactly. you improve latency and throughput at the same time. is kinda cool. by delaying the reponses just a small bit you get huge benefits

bcrl · on May 16, 2023

Wrong. Use something like aio or io_uring to submit 4 asynchronous writes in a single system call and you'll get way better performance. The kernel has all kinds of infrastructure that tries to coalesce and batch things that the write()+fsync() syscalls make horribly inefficient, and in the modern world of really fast nvme drives, you want to make your calls into the device driver as efficient as possible. You'll burn far fewer CPU cycles by giving the kernel the whole set of i/os in one single go, burn less on synchronization. It really is better to avoid write() + fsync().

agallego · on May 19, 2023

what makes you think that we haven't tested this? seastar's io engine is defaulted to io-uring... there are about 10 things here to comment on. on optimized kernels we disable block coalescing at the kernel level, second we tell the kernel to use fifo, etc. these low hanging fruit was already something we've done for a very very long time.

agallego · on May 16, 2023

Right. But that line of thinking gets you to a place where one is like “how does anything work” haha.

In general this was a response to Confluent attempting to dismiss fsync() as a neat trick rather than an actual safety problem and why when we benchmarked we showcase the numbers we did.

comet-engine · on May 16, 2023

Which numbers? The ones on the blog that shows Redpanda significantly lagging behind open source Kafka?

https://jack-vanlightly.com/blog/2023/5/15/kafka-vs-redpanda...

agallego · on May 19, 2023

this is a burner account created at the time this post was up.

tzs · on May 16, 2023

> Even in a relatively simple modern storage stack, say a single NVMe Flash device on a local PCIe controller, the question 'are these exact bits irrevocably committed to the media?' is pretty much impossible to answer without an NDA with the disk manufacturer and a hefty dose of proprietary IOCTLs (and even then, I would not bet my life on it).

This would limit throughput quite a bit and might significantly shorten the lifespan of the disk, but you could put the disk on a separate power supply that the computer could control. Don't consider data as written until you've power cycled the disk and been able to read that data back. :-)

rdtsc · on May 16, 2023

> you need to be absolutely, positively, 100% sure that `fsync()` has done The Right Thing would be: "Good luck with that!"

But it's safe to assume the right thing is not done if we haven't even called `fsync` to start with and just hope and pray that the dirty page flushing mechanism does it for us.

judofyr · on May 16, 2023

I'm confused here. If I'm reading this correctly, the following is happening:

- Node 3 is the leader.

- Node 1 is isolated (failure #1).

- 10 records are written. They are successful because a majority is still alive (persisted to Node 2 and 3).

- Node 3 loses data (failure #2).

- Node 1 comes back up again.

However, isn't this two failures at the same time? Kafka with three nodes can only guarantee a single failure, no?

rystsov · on May 16, 2023

Strongly consistent protocols such as a Paxos and Raft always choose consistency over availability and when consistency isn't certain they refuse to answer.

Raft & Paxos: any number of nodes may be down, as soon as the majority is available a replicated system is available and doesn't lie.

Kafka as it's described in the post(): any number of nodes may be down, at most one power outage is allowed (loss of unsynced data), as soon as the majority is available a replicated system is available and doesn't lie.

The counter-example simulates a single power outage

() https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...

judofyr · on May 16, 2023

It just feels like two widely different scenarios we're talking about here.

https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-... talks about the case of a single failure and it shows how (a) Raft without fsync() loses ACK-ed messages and (b) Kafka without fsync() handles it fine.

This post on the other hand talks about a case where we have (a) one node being network partitioned, (b) the leader crashing, losing data, and combing back up again, all while (c) ZooKeeper doesn't catch that the leader crashed and elects another leader.

I think definitely the title/blurb should be updated to clarify that this is only in the "exceptional" case of >f failures.

I mean, the following paragraph seems completely misleading:

> Even the loss of power on a single node, resulting in local data loss of unsynchronized data, can lead to silent global data loss in a replicated system that does not use fsync, regardless of the replication protocol in use.

The next section (and the Kafka example) is talking about loss of power on a single node combined with another node being isolated. That's very different from just "loss of power on a single node".

rystsov · on May 17, 2023

We can't ignore or pretend that network partitioning doesn't happen. When people talk about choosing two out of CAP the real question is C or A because P is out of our control.

When we combine network partitioning with single local data suffix loss it either leads to a consistency violation or to a system being unavailable desperate the majority of the nodes being are up. At the moment Kafka chooses availability over consistency.

Also I read Kafka source and the role of network partitioning doesn't seem to be crucial. I suspect that it's also possible to cause similar problem with a single node power-outage https://twitter.com/rystsov/status/1641166637356417027 and unfortunate timing

comet-engine · on May 16, 2023

For what it’s worth, this form of loss wouldn’t be possible under KRaft since they (ironically?) use Raft for the metadata and elections. Ain’t nobody starting a new cluster with Zookeeper these days.

slt2021 · on May 16, 2023

you are right, this is a failure of both zookeeper and leader node, so two independent failures at the same time

frant-hartm · on May 16, 2023

Exactly my thoughts.

benmmurphy · on May 16, 2023

I don't know what consistency Kafka is targeting but usually with 'consistent' systems in the situation where the cluster cannot serve valid data it should serve failure responses instead of giving back bogus data. if your system is 'eventually consistent' then that's a different game.

hardwaresofton · on May 16, 2023

Context:

https://news.ycombinator.com/item?id=35949771

ruuda · on May 16, 2023

> A system must use cutting-edge Byzantine fault-tolerant (BFT) replication protocols, which neither of these systems currently employ.

Cutting-edge? pBFT (Practical Byzantine Fault Tolerance) was published in 1999. The first Tendermint release was in 2015. With few exceptions, almost all big proof of stake blockchains are powered by variations of pBFT and have been for many years.

rystsov · on May 17, 2023

Yep, I still consider them to be cutting edge. Paxos was written in 1990 but the industry adopted it only in 2010s. For example I've looked through pBFT and it doesn't mention reconfiguration protocol which is essential for industry use. I've found one from 2012 so it should be getting ripe by now.

jvanlightly · on May 16, 2023

Protocol Aware Recovery is really needed if you want Raft to tolerate disk corruption. https://www.usenix.org/conference/fast18/presentation/alagap...

genxyz · on May 16, 2023

> Protocol Aware Recovery is really needed if you want Raft to tolerate disk corruption

I believe OP does not make any claims about arbitrary log corruption. Neither raft nor Kafka protocol can handle it. It is about losing tail of the log due to fsync failures or rather lack of fsyncs.

agallego · on May 16, 2023

the blog posts mentions that it is for any protocol that is non bft

sean0- · on May 17, 2023

I wish articles such as this would also talk about the risk-adjusted cost of absolute data safety or provide treatment discussing the cost of ensuring the last 0.00001% of data is unconditionally guaranteed in the event of black swan events. For example, what's better for the business? Double EC2 operating costs from $100K/mo to $200K/mo or pay out $10K in credits/yr as a direct loss to the business plus whatever the reputational or regulatory cost? Sometimes you need 100%. Sometimes you don't. Not acknowledging this tradeoff makes the article very FUD'y and is a disservice to engineers who get sucked into black-and-white thinking.