Hacker News new | past | comments | ask | show | jobs | submit login

So I understand the rationale for writing your own storage layer and think this is an awesome project, but there's something missing for me. One of the issues Peter brings up is they've come across a number of serious bugs in RocksDB. My question is, why would Pebble have less bugs. In fact, I would expect it to have significantly more bugs because Coackroach is the only company using Pebble.

They mention briefly how they are going about randomized crash testing:

> The random series of operations also includes a “restart” operation. When a “restart” operation is encountered, any data that has been written to the OS but not “synced” is discarded. Achieving this discard behavior was relatively straightforward because all filesystem operations in Pebble are performed through a filesystem interface. We merely had to add a new implementation of this interface which buffered unsynced data and discarded this buffered data when a “restart” occurred.

but this seems to only scratch the surface of possibilities that can come up with a crash. For example, it's possible the filesystem had synced some of the buffered data to disk, but not all of it. There's no guarantee about what buffered data was synced to disk. All you know is that some, all, or none of it made it to disk.

Bugs in this area are still regularly found in e.g. Postgres, so I'm having a hard time seeing how Coackroach is making sure Pebble doesn't have similar problems.




> So I understand the rationale for writing your own storage layer and think this is an awesome project, but there's something missing for me. One of the issues Peter brings up is they've come across a number of serious bugs in RocksDB. My question is, why would Pebble have less bugs. In fact, I would expect it to have significantly more bugs because Cockroach is the only company using Pebble.

We're only worried about functionality in Pebble used by CockroachDB. RocksDB has a huge number of features that sometimes have bugs due to subtle interactions. There is a very stable subset of RocksDB: the configuration and specific API usage patterns used internally by Facebook. That precise combination has seen extreme testing. But that isn't the subset of RocksDB used by CockroachDB. I would guess that the most significant testing of the subset of RocksDB used by CockroachDB is the testing we do at Cockroach Labs. Now that testing is being directed at Pebble along with the Pebble-specific testing detailed in the post.

> For example, it's possible the filesystem had synced some of the buffered data to disk, but not all of it. There's no guarantee about what buffered data was synced to disk. All you know is that some, all, or none of it made it to disk.

The filesystem does provide guarantees when you use fsync() and fdatasync(). Postgres relies on these guarantees. So does RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's. Our crash testing is not testing the filesystem guarantees, only that we're correctly using fsync/fdatasync (which is hard enough to get right).


> The filesystem does provide guarantees when you use fsync() and fdatasync(). Postgres relies on these guarantees. So does RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's. Our crash testing is not testing the filesystem guarantees, only that we're correctly using fsync/fdatasync (which is hard enough to get right).

For anyone unfamiliar, fsync/fdatasync are infamous for all sorts of subtle sharp edges: https://www.usenix.org/conference/atc20/presentation/rebello

Having synchronous replication via paxos/raft can mitigate a lot of this.


As far as I'm aware, the fsync/fdatasync sharp edges are around what happens after an fsync/fdatasync failure. My understanding is that you can't rely on anything. The only sane option is to crash the process and attempt recovery on restart. Even that is fraught because data can be in the OS cache but not synced to disk. Pebble (and RocksDB) both take a fairly pessimistic view of what can be recovered. Sstables that were in the process of being written are discarded. The WAL an MANIFEST (which lists the current sstables) are truncated at the first sign of data corruption. Getting all of this right definitely takes time and effort.

From the Rebello paper: > However, on restart, since the log entry is in the page cache, LevelDB includes it while creating an SSTable from the log file.

Pebble and RocksDB both inherited this behavior. The nuance here is that the sstable is then synced to disk and no reads are served until the sync is successful. If the machine were to crash before the sstable was synced, upon restart we'd rollback to the durable prefix of the log.


Have you considered using direct IO for the log?


Yes. So far performance was worse in experiments, and the durability improvements are questionable because it is extremely difficult to get a clear understanding of the durability semantics of direct IO. If you can find a pointer to clear documentation of what those semantics are I'd be extremely interested in reading it.


RocksDB supports using Direct IO for flush and compaction (use_direct_io_for_flush_and_compaction), enabling that can improve write throughput for my workload in RocksDB. Any plan to do that in pebble?


Direct IO is on our radar, though when I experimented with enabling direct IO in RocksDB it only hurt CockroachDB benchmarks. This was several years ago. I believe newer releases of RocksDB have made improvements in this area.


Well, I think what they're saying is that they'd rather have bugs in code they've written than in code that is written by other people and in another language, and for which they don't control the patching pipeline.

If RocksDB had had no bugs, they wouldn't have needed to write Pebble.


I'm sure 'not have to cross the cgo boundary' is significant when debugging, as well.


That's an argument for them using it, but it's also basically arguing why nobody else should.


Avoiding cgo would be a selling point for anyone else using go. Presumably other pure go kv stores like bbolt/badger/goleveldb would also solve that problem, but I don't know enough about them to understand the trade-offs.


Yes, but I think CockroadDB is in also in a position most other people are not: they are a database, so data storage is not only their expertise, it's their reason for existing. They have the people, the expertise, and it's core to their business. Most people don't have that, so they can't justify writing their own storage engine.


Hi, I'm on the team that works on Pebble. Partially synced WAL records are easier to detect, as they would just appear as corrupt records and we can stop WAL replay at that point. Non-WAL writes are even easier to handle as SSTable files are immutable once fully written and synced. We rely pretty heavily on fsync/fdatasync calls to guarantee that "all" the data in a given range made it.

In addition to randomized crash tests, we have a suite of end-to-end integration tests on top of Cockroach, called Roachtests, that put clusters under a combination of node crash/restart scenarios and confirm data consistency.


> We rely pretty heavily on fsync/fdatasync calls to guarantee that "all" the data in a given range made it.

Reminds me of fsync gate: https://news.ycombinator.com/item?id=20491965 and https://news.ycombinator.com/item?id=19119991 (not implying Pebble uses fysnc incorrectly).


Hi! Awesome work.

One of the cool things about the LevelDB codebase is the `Repository contents` section. A brief description of the most relevant modules so people can get familiarized with the code base quicker. As someone very interested in storage engines, I would love to see something similar here. Are you guys planning on adding some extra documentation to the project?


Fewer features and fewer lines of code, and those LOC are written in Go, which is the language in which CockroachDB is written and which, presumably, for which their team and tooling are best optimized. It's a reasonable thesis.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: