Hacker News new | past | comments | ask | show | jobs | submit login

Slava @ rethink here.

This is a really interesting subject -- I should do a talk/blog post about this at some point. Here is a quick summary.

RethinkDB's storage engine heavily relies on the notion of immutability/append-only. We never modify blocks of data in place on disk -- all changes are recorded in new blocks. We have a concurrent, incremental compaction algorithm that goes through the old blocks, frees the ones that are outdated, and moves things around when some blocks have mostly garbage.

The system is very fast and rock solid. But...

Getting a storage engine like that to production state is an enormous amount of work and takes a very long time. Rethink's storage engine is really a work of art -- I consider it a marvel of engineering, and I don't mean that as a compliment. If we were starting from scratch, I don't think we'd use this design again. It's great now, but I'm not sure if all the work we put into it was ultimately worth the effort.




I really think there are a couple of levels of immutability that it is easy to conflate.

Specifically immutability for

1. In memory data structures...this is the contention of the functional programming people.

2. Persistent data stores. This is the lsm style of data structure that substitutes linear writes and compaction for buffered in-place mutation.

3. Distributed system internals--this is a log-centric, "state machine replication" style of data flow between nodes. This is a classic approach in distributed databases, and present in systems like PNUTs.

4. Company-wide data integration and processing around streams of immutable records between systems. This is what I have argued for (http://engineering.linkedin.com/distributed-systems/log-what...) and I think Martin is mostly talking about.

There are a lot of analogies between these but they aren't the same. Success of one of these things doesn't really imply success for any of the others. Functional programming could lose and log-structured data stores could win or vice versa. Pat Helland has made an across the board call for immutability (http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf), but that remains a pretty strong assertion. So it is worth being specific about which level you are thinking about.

For my part I am pretty bullish about stream processing and data flow between systems being built around a log or stream of immutable records as the foundational abstraction. But whether those systems internally are built in functional languages, use lsm style data layout on disk is kind of an implementation detail. From my point of view immutability is a lot more helpful in the large than in the small--I have never found small imperative for loops particularly hard to read, but process-wide mutable state is a big pain, and undisciplined dataflow between disparate systems, caches, and applications at the company level can be a real disaster.


Excellent points, yes it's important to clarify what we're talking about here. Samza sounds like an event-sourcing style immutable event log. You could think of it like the transaction or replication log of a traditional database. Having that be immutable is very sensible! But you can't always query that in "real-time".

On the other hand, the data structures you query in real-time, making that immutable is problematic, because then you'll need a LevelDB style compaction step. That doesn't mean to say that it can't be done well, but that it's hard to do well.


LMDB does ACID MVCC using copy-on-write with no garbage collection or compaction needed. It delivers consistent, deterministic write performance with no pauses. It is actually now in use in a number of soft realtime systems.


I was specifically thinking of LMDB as a counter-example when I wrote that it's not impossible, just hard to do well. A much more sensible set of tradeoffs than LevelDB.


> I really think there are a couple of levels of immutability that it is easy to conflate.

Complete agree. I was talking about immutability on the storage engine level. Totally different tradeoffs apply at different levels in the stack (that you described).


It's that merge (edit: GC is a better term) step that's difficult to get right. Google screwed this up badly with LevelDB which had(still has?) horrible performance issues caused by compaction. Even with concurrent compaction it can be difficult due to needing additional disk space, adding additional read and write pressure to the storage subsystem and the effects that has on latency. I'm not sure what RethinkDB's approach was there, but I'm very curious to know.


> Even with concurrent compaction it can be difficult due to needing additional disk space, adding additional read and write pressure to the storage subsystem and the effects that has on latency. I'm not sure what RethinkDB's approach was there, but I'm very curious to know.

Yes, we ran into all these issues with the RethinkDB storage engine. Unfortunately I can't summarize the solution, because there are no silver bullets. It took a long time to perfect the engine, and there was enormous amount of tuning work to get everything right.

For example, we have a "young blocks" subsystem that treats recently updated blocks differently (since, empirically, recently written blocks are dramatically more likely to be updated again, so we hold off on trying to collect them). How long should you wait? How many young blocks should you consider?

Working out solutions to these questions takes a lot of trial and error, and that's where the bulk of the work is (and that's just one subsystem!)

I'd love to write about it in depth, I'll try to make it a priority.


How much similarity is there between the JVM's G1GC, CMS or other collectors and Rethink's compaction? It looks like the heuristics and trade-offs are very much the same. (Latency, hard space constraint, usage patterns.) Okay, you don't have to do pointer/object graph chasing, but queries and consistency and whatnot has to do something similar.


Remember -- this is an on-disk compactor, so it's not quite the same as collecting garbage in memory. There are other differences -- database dependencies are typically trees (or in the worst case DAGs, as there aren't any circular references). So our compactor can be much simpler (in fact, it's closer to a purely functional collector like Haskell's).

But overall it's very similar to a programming language GC. The devil, as usual, is in the details.


In another thread, you had mentioned that "I don't think we'd use this design again". Curious to hear how you would design such a system differently, without using a LSM tree approach that's popular these days?


I don't think the LSM tree approach really panned out (in a sense that the benefits don't outweigh the drawbacks). You get much better insert and update throughput at the cost of significant production stalls. Most people don't need that level of insert throughput (and if they do, they can get it by scaling horizontally). Even if you need that throughput on a single box, most people aren't ok with long stalls. Facebook has been doing some work to minimize stalls in an LSM storage engine, but this is a significant engineering effort that only really makes sense for a few companies.

RethinkDB's storage engine uses a different architecture -- it gets you better insert/update performance on SSDs without stalls (but not as good a throughput as LSM-based engines), in exchange for significant engineering effort to make the engine bulletproof. Again, most of the time, people can get that by scaling horizontally.

I think that in 99% of cases a traditional storage engine approach works just fine. We all tried to reinvent the wheel, but ultimately it turned out to be a lot of work for fairly little benefit.


> I think that in 99% of cases a traditional storage engine approach works just fine. We all tried to reinvent the wheel, but ultimately it turned out to be a lot of work for fairly little benefit.

Please publish this in a paper or at least a blog article so I can properly quote you the next time a discussion on ACID comes up. :)


If you do, please ping me at dan.eloff @ populargooglemailservice.com, I don't want to miss reading that one!


Please do -- I've been reading the linkedin/confluent/samza writeups and thinking there's a lot of truth to their ideas. It'd be great to hear more on-the-ground experience from a different perspective.


Is this not how most 'modern' (read 90s) relational db's work?


It's similar, they use MVCC, which means no in-place updates or deletes. Postgres then has a compaction step called vacuum to clean up old tuples. Redis is one of the few "databases" that truly uses in-place updates, but it has that luxury because it's single-threaded.


MVCC doesn't necessarily mean no in-place updates, it just means that you can distinguish between multiple versions. For example, Oracle:

- keep most recent version of all keys in B-tree

- store updates in undo log ("rollback segments")

- queries for older versions dynamically undo recent changes

http://docs.oracle.com/cd/B19306_01/server.102/b14220/consis...


Good point, but do they seriously do that? It's stupid.

If you overwrite data in place that's being concurrently read, you get garbled data. So you must guarantee nobody is reading it. One way is to lock the data for both readers and writers using a mutex of some form. Another way is Linux-RCU style[1]. Both make readers always pay a price for what should be an uncommon case.

It makes more sense to me to put your updates in a new place, and if need be copy them over the old data once nobody can see the old data anymore.

[1] http://lwn.net/Articles/262464/


Isn't that just called "journaling", as opposed to "immutable"?


Journaling typically implies that there is a separate log of operations/changes, but the main data file (the BTree) is still updated in-place. You can then use the journal to roll back the changes if necessary.

RethinkDB's storage engine doesn't have a journal -- the main data file is essentially journaled, which is quite different from the traditional meaning of the word.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: