> So I understand the rationale for writing your own storage layer and think thi...

jasonwatkinspdx · on Sept 15, 2020

> The filesystem does provide guarantees when you use fsync() and fdatasync(). Postgres relies on these guarantees. So does RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's. Our crash testing is not testing the filesystem guarantees, only that we're correctly using fsync/fdatasync (which is hard enough to get right).

For anyone unfamiliar, fsync/fdatasync are infamous for all sorts of subtle sharp edges: https://www.usenix.org/conference/atc20/presentation/rebello

Having synchronous replication via paxos/raft can mitigate a lot of this.

petermattis · on Sept 15, 2020

As far as I'm aware, the fsync/fdatasync sharp edges are around what happens after an fsync/fdatasync failure. My understanding is that you can't rely on anything. The only sane option is to crash the process and attempt recovery on restart. Even that is fraught because data can be in the OS cache but not synced to disk. Pebble (and RocksDB) both take a fairly pessimistic view of what can be recovered. Sstables that were in the process of being written are discarded. The WAL an MANIFEST (which lists the current sstables) are truncated at the first sign of data corruption. Getting all of this right definitely takes time and effort.

From the Rebello paper: > However, on restart, since the log entry is in the page cache, LevelDB includes it while creating an SSTable from the log file.

Pebble and RocksDB both inherited this behavior. The nuance here is that the sstable is then synced to disk and no reads are served until the sync is successful. If the machine were to crash before the sstable was synced, upon restart we'd rollback to the durable prefix of the log.

ryanworl · on Sept 15, 2020

Have you considered using direct IO for the log?

petermattis · on Sept 15, 2020

Yes. So far performance was worse in experiments, and the durability improvements are questionable because it is extremely difficult to get a clear understanding of the durability semantics of direct IO. If you can find a pointer to clear documentation of what those semantics are I'd be extremely interested in reading it.

dis-sys · on Sept 16, 2020

RocksDB supports using Direct IO for flush and compaction (use_direct_io_for_flush_and_compaction), enabling that can improve write throughput for my workload in RocksDB. Any plan to do that in pebble?

petermattis · on Sept 16, 2020

Direct IO is on our radar, though when I experimented with enabling direct IO in RocksDB it only hurt CockroachDB benchmarks. This was several years ago. I believe newer releases of RocksDB have made improvements in this area.