So I understand the rationale for writing your own storage layer and think this is an awesome project, but there's something missing for me. One of the issues Peter brings up is they've come across a number of serious bugs in RocksDB. My question is, why would Pebble have less bugs. In fact, I would expect it to have significantly more bugs because Coackroach is the only company using Pebble.
They mention briefly how they are going about randomized crash testing:
> The random series of operations also includes a “restart” operation. When a “restart” operation is encountered, any data that has been written to the OS but not “synced” is discarded. Achieving this discard behavior was relatively straightforward because all filesystem operations in Pebble are performed through a filesystem interface. We merely had to add a new implementation of this interface which buffered unsynced data and discarded this buffered data when a “restart” occurred.
but this seems to only scratch the surface of possibilities that can come up with a crash. For example, it's possible the filesystem had synced some of the buffered data to disk, but not all of it. There's no guarantee about what buffered data was synced to disk. All you know is that some, all, or none of it made it to disk.
Bugs in this area are still regularly found in e.g. Postgres, so I'm having a hard time seeing how Coackroach is making sure Pebble doesn't have similar problems.
> So I understand the rationale for writing your own storage layer and think this is an awesome project, but there's something missing for me. One of the issues Peter brings up is they've come across a number of serious bugs in RocksDB. My question is, why would Pebble have less bugs. In fact, I would expect it to have significantly more bugs because Cockroach is the only company using Pebble.
We're only worried about functionality in Pebble used by CockroachDB. RocksDB has a huge number of features that sometimes have bugs due to subtle interactions. There is a very stable subset of RocksDB: the configuration and specific API usage patterns used internally by Facebook. That precise combination has seen extreme testing. But that isn't the subset of RocksDB used by CockroachDB. I would guess that the most significant testing of the subset of RocksDB used by CockroachDB is the testing we do at Cockroach Labs. Now that testing is being directed at Pebble along with the Pebble-specific testing detailed in the post.
> For example, it's possible the filesystem had synced some of the buffered data to disk, but not all of it. There's no guarantee about what buffered data was synced to disk. All you know is that some, all, or none of it made it to disk.
The filesystem does provide guarantees when you use fsync() and fdatasync(). Postgres relies on these guarantees. So does RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's. Our crash testing is not testing the filesystem guarantees, only that we're correctly using fsync/fdatasync (which is hard enough to get right).
> The filesystem does provide guarantees when you use fsync() and fdatasync(). Postgres relies on these guarantees. So does RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's. Our crash testing is not testing the filesystem guarantees, only that we're correctly using fsync/fdatasync (which is hard enough to get right).
As far as I'm aware, the fsync/fdatasync sharp edges are around what happens after an fsync/fdatasync failure. My understanding is that you can't rely on anything. The only sane option is to crash the process and attempt recovery on restart. Even that is fraught because data can be in the OS cache but not synced to disk. Pebble (and RocksDB) both take a fairly pessimistic view of what can be recovered. Sstables that were in the process of being written are discarded. The WAL an MANIFEST (which lists the current sstables) are truncated at the first sign of data corruption. Getting all of this right definitely takes time and effort.
From the Rebello paper:
> However, on restart, since the log entry is in the page cache, LevelDB includes it while creating an SSTable from the log file.
Pebble and RocksDB both inherited this behavior. The nuance here is that the sstable is then synced to disk and no reads are served until the sync is successful. If the machine were to crash before the sstable was synced, upon restart we'd rollback to the durable prefix of the log.
Yes. So far performance was worse in experiments, and the durability improvements are questionable because it is extremely difficult to get a clear understanding of the durability semantics of direct IO. If you can find a pointer to clear documentation of what those semantics are I'd be extremely interested in reading it.
RocksDB supports using Direct IO for flush and compaction (use_direct_io_for_flush_and_compaction), enabling that can improve write throughput for my workload in RocksDB. Any plan to do that in pebble?
Direct IO is on our radar, though when I experimented with enabling direct IO in RocksDB it only hurt CockroachDB benchmarks. This was several years ago. I believe newer releases of RocksDB have made improvements in this area.
Well, I think what they're saying is that they'd rather have bugs in code they've written than in code that is written by other people and in another language, and for which they don't control the patching pipeline.
If RocksDB had had no bugs, they wouldn't have needed to write Pebble.
Avoiding cgo would be a selling point for anyone else using go. Presumably other pure go kv stores like bbolt/badger/goleveldb would also solve that problem, but I don't know enough about them to understand the trade-offs.
Yes, but I think CockroadDB is in also in a position most other people are not: they are a database, so data storage is not only their expertise, it's their reason for existing. They have the people, the expertise, and it's core to their business. Most people don't have that, so they can't justify writing their own storage engine.
Hi, I'm on the team that works on Pebble. Partially synced WAL records are easier to detect, as they would just appear as corrupt records and we can stop WAL replay at that point. Non-WAL writes are even easier to handle as SSTable files are immutable once fully written and synced. We rely pretty heavily on fsync/fdatasync calls to guarantee that "all" the data in a given range made it.
In addition to randomized crash tests, we have a suite of end-to-end integration tests on top of Cockroach, called Roachtests, that put clusters under a combination of node crash/restart scenarios and confirm data consistency.
One of the cool things about the LevelDB codebase is the `Repository contents` section. A brief description of the most relevant modules so people can get familiarized with the code base quicker. As someone very interested in storage engines, I would love to see something similar here. Are you guys planning on adding some extra documentation to the project?
Fewer features and fewer lines of code, and those LOC are written in Go, which is the language in which CockroachDB is written and which, presumably, for which their team and tooling are best optimized. It's a reasonable thesis.
I definitely was. I still use one now, more than 3 years after their business failure. I really wish someone would make something like the Pebble Time 2.
3 years later and it looks like they are just at the trying to get hardware features accessible. At this rate all pebble devices will have died / been discarded before its able to show a notification on your wrist.
I still use my Pebble Time Steel from the Kickstarter campaign.
Thinking of switching to Apple watch because I have iPhone and I also want to use Apple Pay without the phone. If they had better battery life, I probably would have done it already.
Frankly, I'm satisfied enough with my Pebble 2 that I'm not sure I care whether anyone makes a new one. I just hope I don't end up in a situation where it breaks and I can't get a replacement.
I’ve run into serious house burning down problems with myrocks too. Simple recipe to crash MySQL in a way that is unrecoverable: do ALTER TABLE on a big table and it runs out of RAM, crashes, and refuses to restart, ever.
Googling and people have been reporting the error on restarting several times on lists and things. What help is it to report to Maria dB or something? But do FB notice? Seems not.
Here’s hoping someone at FB browses HN...
I don’t get why FB don’t have some fuzzing and chaos monkey stress test to find easy stability bugs :(
I am one of the creators of MyRocks at FB. We have a few common MySQL features/operations we don't use at FB. Notably:
1) Schema Changes by DDL (e.g. ALTER TABLE, CREATE INDEX)
2) Recovering primary instances without failover
We use our own open source tool OnlineSchemaChange to do schema changes (details: https://github.com/facebook/mysql-5.6/wiki/Schema-Changes), which is heavily optimized for MyRocks use cases like utilizing bulk loading for both primary and secondary keys. ALTER TABLE / CREATE INDEX support in MyRocks is limited and suboptimal -- it does not support Online/Instant DDL (so blocking writes to the same table during ALTER), and enters non bulk loading path and trying to load the entire table in one transaction -- which may hit row lock count limit or out of memory. We have plans to improve regular DDL paths in MyRocks in MySQL 8.0, including supporting atomic, online and instant schema changes.
I am also realizing that a lot of external MySQL users still don't have auto failover and try to recover primary instances if they go down. This means single instance availability and recoverability is much more important for them. We set rocksdb_wal_recovery_mode=1 (kAbsoluteConsistency) by default in MyRocks, which actually degraded recoverability (higher chances to refuse to start even if it can be recovered from binlog). We're changing defaults to 2 (kPointInTimeRecovery) so that it can be more robust without relying on replicas for recovery.
It would have been a really bad experience when hitting OOM by 1) then failing to restart because of 2). We have relations with MariaDB and Percona, and will make default behavior better for users.
Thanks for explaining this! Really appreciate that you joined in here.
We've been test running our real-time dwh etls on myrocks (and postgres and timescale and even innodb) to comppare with our previous workhorse, tokudb. We've chewed through cpu years iterating over every switch and setting we can think of, to find optimum config for our workloads.
Like for example we've found that myrocks really slows down if you do a SELECT ... WHERE id IN (....) from too long a list of ids.
So we have lots of thoughts and data points on things my team have found easy, hard, painful, better etc. I'd be happy to share with you folks.
(FWIW we are moving from tokudb to myrocks now, with tweaks to how we do data retention and gdpr depersonalization and things)
Ping me on willvarfar at google's freemail domain if that's useful!
That's a long standing problem with MySQL, DDL statements (ALTER TABLE and such) where not atomic until version 8.0. This required some serious work on InnoDB as you can read some details at https://mysqlserverteam.com/atomic-ddl-in-mysql-8-0/
The problem is that the program runs out of RAM. The challenge is to write the data and metadata in such a way that a program crashing at any point for any reason is recoverable.
This is the basic promise of the Durability in ACID, and people using MyRocks expect it.
Rather than actually making sure that MyRocks is durable, they simply slap on a 'max transaction rows' to make it unlikely you run out of RAM. Instead, you simply get an error, and can't do stuff like ALTER TABLE or UPDATE on large tables.
Of course its easy to run out of RAM despite these thresholds, and its easy to find advice when you google the error messages you get that lead you to up the thresholds and to even set a 'bulk load' flag that disables various checks you probably haven't investigated.
The whole approach is wrongheaded!
A database that crashes should not be corrupt!!! Isn't this reliability 101? Why doesn't myrocks have chaos monkey stress testing etc etc?
>A database that crashes should not be corrupt!!! Isn't this reliability 101? Why doesn't myrocks have chaos monkey stress testing etc etc?
Because Facebook has little incentive to ensure that RocksDB works well in your use case. MyRocks was built for Facebook and anything that Facebook doesn’t do probably isn’t particularly hardened. They aren’t going to invest time doing chaos monkey stress testing on codepaths they don’t use. Things like durability might not be super important to them because they will make it up in redundancy.
I remember being burned by something similar during the early days of Cassandra. I’m sure Cockroach has hit the same bugs.
"A final alternative would be to use another storage engine, such as Badger or BoltDB (if we wanted to stick with Go). This alternative was not seriously considered for several reasons. These storage engines do not provide all the features we require, so we would have needed to make significant enhancements to them.
...
Lastly, various RocksDB-isms have slipped into the CockroachDB code base, such as the use of the sstable format for sending snapshots of data between nodes. Removing these RocksDB-isms, or providing adapters, would either be a large engineering effort, or impose unacceptable performance overhead.
"
I looked into the source code of both badger and pebble during the lockdown. From what I learned, they don't operate on the same level. I don't want to publicly bad mouth any open source software, but if you just spend 15 minutes on the source code of pebble and badger, you wouldn't be asking the same question.
Really curious to hear your findings! We're happy users of Badger, but we have never looked it's internals. I guess you can list the differences you've found without using a bad mouth. Thanks in advance.
Badger is written by mad people from my point of view, who disabled issues on github, from my understanding declared it as "done" and "bug free", and any issue tracking is now done on the forum where the threads roll off to the void with no further trace.
Wao. You describe us as “mad people” because we choose to not use GitHub issues? Is that all it takes to dismiss an open-source software and badmouth its authors? You have gone really low on this.
All the issues have been ported over to Discourse. And no one has declared Badger, bug-free. I don’t know where you got that idea.
Concurrency and multithreading are a major focus of both Go and RocksDB. This introduction makes little mention of those areas, and I'm curious if there's any more to be said on this. The article lists several features being reimplemented, including:
> Basic operations: Set, Get, Merge, Delete, Single Delete, Range Delete
It makes no mention of RocksDB's MultiGet/MultiRead -- is CockroachDB/Pebble limited to query-at-a-time per thread? I'm genuinely curious how this all translates into Go's M:N coroutine model currently and moving forward with Pebble.
Pebble does not currently implement MultiGet as CockroachDB did not use RocksDB's MultiGet operation. CockroachDB can use multiple nodes to process a query by decomposing SQL queries along data boundaries and shipping the query parts to be executed next to the data. CockroachDB can't directly use MultiGet because that API was not compatible with how CockroachDB reads keys.
RocksDB MultiGet is interesting. Parallelism is achieved by using new IO interfaces (io_uring), not by using threads. That approach seems right to me. See https://github.com/facebook/rocksdb/wiki/MultiGet-Performanc.... My understanding is that io_uring support is still a work in progress. We experimented at one point with using goroutines in Pebble to parallelize lookups, but doing so was strictly worse for performance. Experimenting with io_uring is something we'd like to do.
Indeed the conceptual fork point mentioned is RocksDB 6.2.1 which came before those features. The problem with RocksDB is that one thread only makes one request at a time. I should've phrased my question more succinctly: Is Pebble/CockroachDB capable of saturating the backplane with requests in parallel? Does it multiplex a single query by dispatching smaller requests to a thread-pool?
> Is Pebble/CockroachDB capable of saturating the backplane with requests in parallel?
Yes.
> Does it multiplex a single query by dispatching smaller requests to a thread-pool?
Yes, though it depends on the query. Trivial queries (i.e. single-row lookups) are executed serially as that is the fastest way to execute them. Complex queries are decomposed along data boundaries and the query parts are executed in parallel next to where the data is located.
The TLDR is that the GC did cause problems so we had to avoid it for the block cache. Luckily we were able to do so without exposing the complexity in the API. Not for the faint of heart. Don't try this at home kids.
Any reason why the block cache needs to be 10s of GB in size? Cassandra, for example, usually has a rather small key cache on heap, and then just relies on the kernel page cache. I don't have experience with cockroach, it looks like the block cache is similar to cassandra row cache, which can be configured to be on heap or off heap, but usually not beneficial to enable.
Not to be confused with Let's Encrypt's ACME client testing CA server project (the "scaled down" version of Boulder), Pebble: https://github.com/letsencrypt/pebble
Why would someone remove a non-GC database engine with a database engine with GC?
Has Go evolved better low-GC features? As I understand Go GC vs JVM GC, Go avoids major GC by simply pushing it to the future and consuming memory more readily.
But a database is a long-running program, so you have to pay the piper eventually.
Upthread [0], you can find some notes and an explanation from one of the devs that they actively work around the GC, because the approach it uses just... Doesn't work for this kind of workload.
> This percentage can be configured by the GOGC environment variable or by calling debug.SetGCPercent. The default value is 100, which means that GC is triggered when the freshly allocated data is equal to the amount of live data at the end of the last collection period. This generally works well in practice, but the Pebble Block Cache is often configured to be 10s of gigabytes in size. Waiting for 10s of gigabytes of data to be allocated before triggering a GC results in very large Go heap sizes.
I wonder the same thing. I have not programed a DB but..
I would think the worst thing to program in a gc'd language would be the cache. So, if you write that layer of the database separately in a non-gc'd you would avoid most of the headache.
EDIT: I read the parent comment as why program a db in a gc language. I think the new wave 1st gen db's are doing alot of novel things. Minus the cache I imagine the velocity improvements form the 'simpler' language makes sense. Key value stores are become well trod ground however.
I have had long running Java programs and their memory usage if done correctly is a nice saw tooth. As long as you are not leaking you can run a well written Java server for long long periods of time (months) without a bounce.
Pebble and FoundationDB are apples and oranges. Pebble is per-node KV storage engine. FoundationDB is a distributed multi-modal database. Internally, FoundationDB uses a library like Pebble for the per-node data storage. I think at one point it used SQLite. I'm not up to date on what it currently use. I seem to recall FoundationDB was writing their own btree-based node-level storage engine to replace the usage of SQLite.
The equivalent of FoundationDB is present inside of CockroachDB: a distributed, replicated, transactional, KV layer. This is where a big chunk of CockroachDB value resides. This is where our use of Raft resides. Pebble lies underneath this.
The current production storage engine is an old-ish version of the SQLite btree. A new btree engine is being written now and is available but I don’t know if it is being used in production anywhere.
RocksDB is shipping soon thanks to some work by members of the community.
As a consumer, why would I want something like this written in Go vs. Rust?
Is it just that Rust is really good with developer relations? Because it feels like to me that all new foundational technology is safer and faster in a language like Rust, and things written in Go should be higher up the food chain.
This is not a standalone DB. Its a key-value store implemented as a library. You would want this if you were a Go developer working on an application that needed a built in key-value store.
If you were a Rust developer you'd want something similar written in Rust.
In addition to what others mentioned, even if Cockroach is written in Go, they could have used Rust, but the trade-off is that they would need to use cgo which introduces extra complexity for building, debugging, and has performance trade-offs that a pure-Go based solution doesn't necessarily have.
Reading this, I wonder how hard it would be to write or generate assembly bindings for libraries like rocksdb and sqlite, and whether you'd be able to do any better than cgo.
Cgo exists because goroutine stacks are small and grow on demand. C/C++ expect large fixed size stacks. While you can technically call into C++ without going through cgo (see https://github.com/petermattis/fastcgo), you can't avoid the stack problem. Cgo solves it by switch from the goroutine stack to the stack of the thread underlying the goroutine.
I really don’t understand the downvotes. I’m not experienced with either language - and this has nothing to do with a flame war.
The question, unstated and unopinionated AND intellectually honest was: does language impact community adoption - and if so, what are the drivers behind it.
If I were going to write a foundational technology, I probably wouldn’t write it in NodeJS, not that it couldn’t be done, but because I’d be concerned mainstream adoption might suffer. For example, I’d expect a hypothetical JsSql (a SQL engine written in JavaScript - assuming this doesn’t already exist) would achieve lower general adoption than writing it in C++.
I don’t think Cockroach cares about adoption. This is not meant to be a generally useful product in itself outside of their database. So the language was chosen mainly based on their familiarity with Go and its ability to integrate with their existing codebase.
I think their omission of major features such as transactions is more likely to limit adoption than the language choice, so language choice is kind of irrelevant from that perspective.
The integration with their existing codebase might benefit from an explanation. In Go, calls to C libraries are done using a compiler feature called cgo, and there's a performance penalty that the cockroachdb team measured at 171ns per call, plus you sometimes have to copy additional data around for memory safety[0]. So far as I know, there's no way to avoid this penalty that doesn't involve building a tool that does the same thing as cgo.
In this case, given that they already have a database written in Go, writing the backing kv store library in Go has a clear performance benefit of 171ns every operation. This is above and beyond just familiarity and easy integration, although those are also important.
They mention briefly how they are going about randomized crash testing:
> The random series of operations also includes a “restart” operation. When a “restart” operation is encountered, any data that has been written to the OS but not “synced” is discarded. Achieving this discard behavior was relatively straightforward because all filesystem operations in Pebble are performed through a filesystem interface. We merely had to add a new implementation of this interface which buffered unsynced data and discarded this buffered data when a “restart” occurred.
but this seems to only scratch the surface of possibilities that can come up with a crash. For example, it's possible the filesystem had synced some of the buffered data to disk, but not all of it. There's no guarantee about what buffered data was synced to disk. All you know is that some, all, or none of it made it to disk.
Bugs in this area are still regularly found in e.g. Postgres, so I'm having a hard time seeing how Coackroach is making sure Pebble doesn't have similar problems.