LogDevice: a distributed data store for logs

mleonard · on Aug 31, 2017

Would someone from Facebook be able to comment on LogDevice versus (Twitter's) Apache Bookkeeper/DistributedLog??

The two sound very similar. I've been studying Bookkeeper this week to understand how it works and was excited to see the blogpost about LogDevice.

My understanding isn't fully there yet, but BookKeeper seems to have a more involved protocol versus LogDevice: ie it has fencing to ensure that only one writer is writing to a log at a time a 2-phase-commit-like protocol, and opening/closing of ledgers.

To me LogDevice's sequencer and epoch number sounds much simpler. Does BookKeeper achieve some more consistency or other guarantees that LogDevice doesn't by having a more involved protocol? How do the two compare in terms of goals/pros/cons/tradeoffs/usecases/etc?

(From reading the blog post it seems to me that in LogDevice an old sequencer could still be writing to LogDevice when a new one with a higher epoch number is also writing. As I understand it BookKeeper uses a CAS operation on metadata version number (like the epoch number of LogDevice) AND fencing on the storage nodes to make sure that only a single writer is writing at a time).

martincmartin · on Sept 1, 2017

I'm on the LogDevice team at Facebook. Glad you're excited to hear about LogDevice!

I'm not familiar with Bookkeeper, so can't comment on that. I do know that, compared to most other systems, LogDevice emphasizes high write availability. So even if we're still sorting out the details of which records made it at the end of one epoch, we'll still take writes for future epochs. We just won't release them to readers until we've made enough copies of earlier records.

We do our best to ensure that only one sequencer is running at a time. We use Zookeeper to store information but the current epoch & sequencer, and whenever a new sequencer starts, it needs to talk to Zookeeper. So that helps with races where several machines want to become the sequencer for a single log at the same time.

Also, when clients are looking for the sequencer for a given log, they all try the same set of machines in the same order. Essentially there's a deterministic shuffling of the list of servers, seeded with the log id. So all clients will try to talk to the same server first, and only if that server is down, or learns that another sequencer is active, will the client try a different server.

Hope that helps!

kthielen · on Sept 1, 2017

This approach to processing real-time and historical data works very well. We use a similar approach at Morgan Stanley and have a PL/compiler for real-time index construction and query processing on top (https://github.com/Morgan-Stanley/hobbes).

In addition to the general model of log writing + reactive processing, I've also found that we spend a lot of time focused on getting the _type structure_ of our data right (which allows for flexible/generic downstream processes that can also be efficient and type-checked against the structure of the data ahead of time). I wonder if/how this aspect of the problem has been considered by the Facebook team.

jbit84 · on Aug 31, 2017

Reading the blog its not clear to me how they deal with gaps in the LSN sequence. The scalability & performance properties derive from 1) Using a separate sequencer that issues increasing sequence numbers, 2) Uncoordinated distributed writes of actual record value to storage nodes, 3) reconstitution of ordered log at consumer side.

How does a consumer that have retrieved N and N+2 know if N+1 is not yet written, or if it failed and will never be written? Perhaps they write-through the sequencer with subsequent writes waiting on acknowledgements, so 'gaps' only occur on epoch changes?

martincmartin · on Sept 1, 2017

Hi, I work on the LogDevice team at Facebook.

Within an epoch, the sequencer is a single process on a single machine and gives out LSNs sequentially. When a sequencer dies, and a new one is started, its first job is to fix up the end of the last epoch. If it can't find any copies of a given record, it inserts a "hole plug," to store the fact that the record is lost. So, except for hole plugs (which should be very rare), the only gaps are between epochs as you say.

Suppose you have 10 LogDevice servers, and you store 3 copies of every record. Then the client will have connections to all 10 machines, and each machine will push whatever records it has to the client. Crucially, the servers always push records in order. So if a client gets record N from machine 7, then gets record N+2, it can be sure that machine 7 doesn't have a copy of record N+1.

Once you've got record N+2 or higher from at least 8 machines, without getting record N+1 from any of them, you can be sure that at least one copy of the record has been lost. If those 8 servers have complete information, you can report to the user that the record has been lost.

jbit84 · on Sept 1, 2017

Hi, thank you for the answer.

Can you comment on how writes to the LogDevice servers are performed - particularly if the writes have to go through the sequencer, or if the individual producers can write directly after obtaining a LSN?

Since a given LogDevice server will only receive a non-deterministic sub-sequence of records, I would think it has to receive its set of writes in order? (To have enough information to push to clients in order). Unless there is actually some mechanism by which it can determine the elements it should receive.

Trying to understand if your per-log throughput will be capped by the max traffic that can be pushed through a single sequencer, or the round-trip latency for waiting for acknowledged writes from LogDevice servers.

martincmartin · on Sept 1, 2017

Trying to understand if your per-log throughput will be capped by the max traffic that can be pushed through a single sequencer,

Yes. You can think of an individual log like a database shard.

We have plans to allow the sequencer to just give out sequence numbers, and allow the clients to send the data directly to the storage nodes. But its not a high priority for us, since it's just a constant factor improvement. (Although a large one!) Users would still need to shard their data, although not as much.

joekrill · on Sept 1, 2017

I wish someone would explain why the licensing issue is not a pertinent discussion here And why those comments have been removed. In any other case that would be a perfectly reasonable point to consider.

onewaystreet · on Sept 1, 2017

Because there is no licensing issue because this project hasn't been released. This blog post is a preannouncement. When it's released and its known what license it uses then it will become a pertinent discussion.

joekrill · on Sept 1, 2017

That line of reasoning could be applied to pretty much every single comment in this thread. Literally. Pick any comment and you could simply respond with: "Well it's not released yet so it's not pertinent".

The very fact that the announcement includes "the ultimate goal of contributing it to the open source community later in 2017" makes the licensing issue pertinent, particularly given Facebook's licensing history.

justinjlynn · on Sept 1, 2017

Well, the underlying database RocksDB appears to have been relicensed under the Apache 2.0 license -- we'll have to see what Facebook chooses to do with this project when it's released. No sense getting the pitchforks out yet.

ccleve · on Sept 1, 2017

A couple questions:

1. Do the log records pass through the sequencer nodes on the way to their destination, or does the writer issue a request for the sequence number and wait for a response?

2. If you're maintaining the current epoch number (and sequencer location) in Zookeeper, and a sequencer starts going up and down, you're definitely going to have problems, because nodes will get potentially stale epoch/location data from Zookeeper. Unless, that is, you have Zookeeper run a full consensus algorithm for every single query for every single record you write, which would kill performance.

So as long as you're using single sequencer nodes, and dealing with replication, and doing some kind of leader election to pick a new sequencer node when one goes down, why not just run a normal consensus algorithm (like Raft) for storing records? The Raft leader node serves as the sequencer, it takes care of replication, it tolerates failures, it will be no slower than what you're doing now, and it will be way more reliable.

polskibus · on Aug 31, 2017

How does it differ from Apache Kafka?

prohor · on Aug 31, 2017

For me the difference seems to be in handling large number of distinct logs. In Kafka every log & partition is a separate file and moreover it keeps it open. So, storing multiple logs results in writing to many files so eventually random write IO; and also you may hit limits of open files. You can multiplex logical logs in each Kafka log, but then you read unnecessarily other logs.

Keeping SS tables makes it more sequential write and reasonably sequential write, as long as you have enough RAM to get multiple records of each log, so they constitute a continues blocks in flashed file.

Actually you could get very similar result using Cassandra, which also uses SS tables. The difference is that Cassandra keeps merging files, which actually makes much more IO traffic than clients. Cassandra will typically need 16x more IO for merging then actual data write rate. You can limit it a bit if you create time shard tables.

qaq · on Aug 31, 2017

written in C++ :)

mleonard · on Aug 31, 2017

>> We ensure that only one copy of every record is read from disk and delivered over the network by including the copy set in the header of every record copy. A simple server-side filtering scheme based on copy sets coupled with a dense copy set index guarantees that in steady state only one node in the copy set would read and delivery a copy of the record to a particular reader.

Can someone explain/expand on the above please? I've read the article a couple of times and tried to understand the above paragraph in context but I don't get it.

adriencrth · on Sept 1, 2017

Adrien here from the LogDevice team.

A copy set is the list of storage node ids that the sequencer chose as recipients for a record. This piece of metadata is stored on storage nodes alongside record payloads in an index that maps the sequence number to copy set mapping.

Storage nodes consult the index to filter out payloads they shall not send to readers. The filtering logic consists in sending the copy if the storage node sees itself as the first recipient in the copy set after it's been shuffled using the client id as a seed.

This results in readers receiving exactly one copy of the payload.

adrianratnapala · on Aug 31, 2017

They've neglected to define the term "copy set". I assumed it was some kind of record ID (although I don't get why that should be distinct from the Log Sequence Number).

martincmartin · on Sept 1, 2017

I'm on the LogDevice team at Facebook. The "copy set" is the subset of server machines that a given record is stored on. For example, if you have a cluster with 10 machines numbered 0 through 9, then the copy set for record 1 might be 3, 6, 8; for record 2 might be 0, 2, 5, etc. Copy set can change with each record. The node set (the list of 10 machines) is fixed for the length of an epoch.

captainpete · on Sept 1, 2017

Similarities to CORFU / Tango design (see Zlog post: http://noahdesu.github.io/2014/10/26/corfu-on-ceph.html).

Edit: This is a good thing

eternalban · on Aug 31, 2017

~OT but it would be excellent to hear HN feedback on https://bookkeeper.apache.org/ and Twitter's DistributedLog and Manhattan projects.

borlum · on Aug 31, 2017

This looks similar to Humio

politician · on Aug 31, 2017

At this stage, the only thing I care about is which license they are going to apply to this project. That'll be enough for me to know if I should bother investing any further time here.

unkown-unknowns · on Aug 31, 2017

Exactly. Specifically, whether it'll include the usual patent clause. I am sure it will and as such it will be of no interest to me.

walterbell · on Sept 1, 2017

Are there Facebook OSS projects without the patent clause?

shock · on Aug 31, 2017

> We continue to iterate on LogDevice with the ultimate goal of contributing it to the open source community later in 2017.

Frankly, if it's the same license as React (BSD + PATENTS), I'm not interested.

Edit: Here come the facebook fanboys with the downvotes.

sctb · on Aug 31, 2017

We've had many discussions about licensing issues, but this isn't one of them. Please don't post off topic and then complain about downvotes—it breaks the guidelines.

https://news.ycombinator.com/newsguidelines.html

kordless · on Aug 31, 2017

> we've had many

> don't post off topic

Users of HN, please note "we" is Facebook and Facebook, or a representative of Facebook, is telling us what we can and can't discuss here.

This is why I quit posting here and use /r/hackernews on Reddit for my links. However, I occasionally make an appearance when I see patterns emerge, such as control where we didn't assume control existed.

grzm · on Aug 31, 2017

'sctb is an HN mod and does not work for Facebook nor is a representative of Facebook.

shock · on Aug 31, 2017

I never imagined that the project's license would be off-topic. To me a project's license is very material, a deciding factor if I'm using it or not. That's what I was trying to express.

sctb · on Aug 31, 2017

The project under discussion doesn't have a license.

shock · on Aug 31, 2017

I'm assuming you mean it doesn't have a license because it's not released. By that same logic we shouldn't be discussing the project because it doesn't exist.

shock · on Aug 31, 2017

> We've had many discussions about licensing issues, but this isn't one of them.

Why do you decide what we discuss? Since you squashed my comment about licensing, another one popped up. It means people do think licensing is important to discuss.

shock · on Aug 31, 2017

I want to elaborate a bit why I dislike Facebook's BSD + PATENTS license:

I believe Facebook was built, in part, using software under MIT, BSD, Apache, etc – I see it as immoral when they turn around and release OSS software under BSD + PATENTS.

BSD + PATENTS sounds like a great idea in theory – if every company released under BSD + PATENTS it would essentially render the patent system toothless, except for the patent trolls who would still be free to sue whomever they want. In practice, however, it would be hard for BSD + PATENTS to have widespread adoption because it would be seen as a loss of value to shareholders, especially as for some companies their patent portfolio is an important part of their valuation.

IANAL.

TAForObvReasons · on Aug 31, 2017

What you are describing is essentially "ladder kicking", which was the top comment from the discussion a few months ago: https://news.ycombinator.com/item?id=14779881

> Having known so many people involved with Facebook for so long, I have come up with a phrase to describe the cultural phenomenon I’ve witnessed among them – ladder kicking. Basically, people who get a leg up from others, and then do everything in their power to ensure nobody else manages to get there. No, it’s not “human nature” or “how it works.” Silicon Valley and the tech industry at large weren’t built by these sorts of people, and we need to be more active in preventing this mind-virus from spreading.

shock · on Aug 31, 2017

Thank you for linking the other discussion, it is much more eloquent than my own comment. The bit that worries me is: They’ve built their entire business on the back of open source software that wasn’t ever encumbered with the sort of nonsense they’ve attached to their own projects. And this industry is just going to let them have it, because the stuff they are putting out is shiny and “convenient” and free?

Are we really allowing them to do this for the sake of convenience?

jasonwatkinspdx · on Aug 31, 2017

I'm not a facebook fanboy, but I did vote this down because it's a derailment based on an assumption. This isn't released yet and no one knows the terms.

Atop that, I'm just getting really tired of seeing the patent issue trotted out any and every time facebook or react are mentioned here.

qaq · on Aug 31, 2017

In this case since main alternatives are under licenses that come with less restrictive patent grants this is a very valid criticism. In case of React not so much.

amelius · on Aug 31, 2017

Frankly, if this comes from a company that treats its users as products, I'm not interested no matter what the license says.

shock · on Aug 31, 2017

These days I get the impression that there are fewer and fewer companies that don't see users as products. I hope I'm wrong.