The Log: Real-time data's unifying abstraction

justinsb · on Dec 16, 2013

The idea of the log goes well beyond just real-time data (as the blog post describes, although the title does not). I think it might well turn out to be one of core building blocks of _all_ stateful systems. Amazon's Kinesis has for the first time exposed a reliable Log-aaS on the cloud; I think we'll start to see more systems built around it.

My personal "24 commits for December" project is to build a set of open-source cloud data-stores, all backed by a distributed log using Raft, "blogging all the way". I'm half-way through, and I've implemented a simple key-value store, put a Redis front-end on it, used that to implement a Git server, and am currently working on building a document store with SQL querying. All with the same architecture: the log provides fault-tolerance and consistency, we have a data structure specific to the particular service (e.g. message queue or key-value store), we periodically take state snapshots so that we don't have to reply the whole log after every failure.

Feel free to follow along / provide feedback: http://blog.justinsb.com/blog/categories/cloudata/

boredandroid · on Dec 16, 2013

Yeah totally agree. I focused on real-time data because offline data processing is often able to be less principled about the usage of time and doesn't need an explicit log (e.g. many ETL pipelines work this way).

justinsb · on Dec 16, 2013

Hi there - I see you're Mr Kafka :-) I wanted to use Kafka instead of Raft for my project, but for my application I couldn't tolerate the (even highly unlikely) possibility of losing a message (when we lose all the nodes in the In-Sync Replica set). I understand why this tradeoff is there (for the clickstream use-case), but I hope it might be possible to retrofit reliable writes to support other use cases as well. The more open-source reliable log services we have the better!

boredandroid · on Dec 16, 2013

Yup, agreed. Wanna help? :-)

justinsb · on Dec 16, 2013

I'd love to help (I opened KAFKA-1050 on the issue), and I even started coding, but I got stuck on how to deal with the rollback problem when a write isn't acknowledged by a quorum. I think it requires a "real" distributed consensus protocol (unlike the current design), which is a fairly big change! It may be that the answer is to use Kafka's efficient storage design with Raft, which is something I explored a little on day 1 of my project: http://blog.justinsb.com/blog/2013/12/07/cloudata-day-1/

Maascamp · on Dec 16, 2013

First, great article.

Coming across article now was quite serendipitous as I'm in the process of designing a Databus like system myself. In regards to Kafka and Databus, one thing I've been wondering about is why the Databus relay isn't implemented with Kafka? Kafka seems to provide the same semantics (at-least-once in order delivery, clients can pull from arbitrary positions in the stream, etc.). I know Databus provides transaction semantics in the stream and a few other differences, but those differences don't seem too large. Is it because Databus is to be embedded in Espresso? Or maybe even something as simple as two different teams converging on the same solutions?

Anyway, thanks again for the article.

chaz · on Dec 16, 2013

> Finally, we implemented another pipeline to load data into our key-value store for serving results. This mundane data copying ended up being one of the dominate items for the original development.

I'm always amazed by how difficult it is to simply get the data. Sounds so simple, but it's fraught with all kinds of issues in structure, timing, reliability, and scale -- and it's usually underestimated. Every BI project I've ever worked on was mostly spent simply getting the required data together into a single database. After that, it's a relative snap.

A bit off topic -- I'm guessing that healthcare.gov's biggest technical hurdles were similar. Simplistically put, it's a shopping comparison site and the UI/functionality is fairly trivial (which is why several HN comments suggested that a small team could built out the whole thing in days/weeks). But imagine if the data to support it was from 36 different legacy sources that were unreliable, poorly documented, and built/managed by completely different vendors. That's going to take up the majority of your time and frustration. Database-driven websites are easy if the data is already built for you.

Great article -- thanks for writing it.

strictfp · on Dec 17, 2013

Hmm.

>Event Sourcing. As far as I can tell this is basically the enterprise software engineer's way of saying "state machine replication". It's interesting that the same idea would be invented again in such a different context. Event sourcing seems to focus on smaller, in-memory use cases.

Fowlers article doesn't read like that at all if you ask me. He is talking about the general concept of using an event log as a storage mechanism. Very similar to the OPs afticle. And if you look at the date, Martins article is from 2005. Credit where credit is due.

In the Java world, the idea of event sourcing was made publicly known by the project 'prevlayer'. These guys boldly suggested to store the current snapshot in RAM, yes, but had also built mechanisms for event logging, snapshotting and replay. The log was persisted on disk and replayed at startup.

The prevlayer guys were in fact not enterprise at all, quite the opposite. Their ideas caused quite a stir in the enterprise world.

kitsune_ · on Dec 17, 2013

A couple of years ago, Event Sourcing / CQRS was a really hot topic in the DDD enterprise world, especially within the .NET community:

http://msdn.microsoft.com/en-us/library/jj554200.aspx

http://codebetter.com/gregyoung/2010/02/13/cqrs-and-event-so...

http://blog.jonathanoliver.com/2011/05/why-i-still-love-cqrs...

http://vimeo.com/28457510

LinkedIn's approach sounds fairly similar in a lot of ways.

pixelmonkey · on Dec 17, 2013

Great article. A highly relevant quote:

    The log is similar to the list of all credits and debits 
    and bank processes; a table is all the current account
    balances. If you have a log of changes, you can apply 
    these changes in order to create the table capturing the
    current state. This table will record the latest state 
    for each key (as of a particular log time). There is a
    sense in which the log is the more fundamental data
    structure: in addition to creating the original 
    table you can also transform it to create all kinds
    of derived tables.

Also, a good architecture diagram:

http://engineering.linkedin.com/sites/default/files/full-sta...

At Parse.ly, we just adopted Kafka widely in our backend to address just these use cases for data integration and real-time/historical analysis for the large-scale web analytics use case. Prior, we were using ZeroMQ, which is good, but Kafka is better for this use case.

We have always had a log-centric infrastructure, not born out of any understanding of theory, but simply of requirements. We knew that as a data analysis company, we needed to keep data as raw as possible in order to do derived analysis, and we knew that we needed to harden our data collection services and make it easy to prototype data aggregates atop them.

I also recently read Nathan Marz's book (creator of Apache Storm), which proposes a similar "log-centric" architecture, though Marz calls it a "master dataset" and uses the fanciful term, "Lambda Architecture". In his case, he describes that atop a "timestamped set of facts" (essentially, a log) you can build any historical / real-time aggregates of your data via dedicated "batch" and "speed" layers. There is a lot of overlap of thinking in that book and in this article. It's great to see all the various threads of large-scale data analytics / integration coming together into a unified whole of similar theory and practice. Interestingly, I also recently discovered that Kafka + Storm are widely deployed at Outbrain, Loggly, & Twitter. LinkedIn with Kafka + Samza and AWS deploying a developer preview of Kinesis suggests to me that real-time stream processing atop log architectures has gone mainstream.

pragmatic · on Dec 18, 2013

The link to Nathan Marz's book: http://www.manning.com/marz/

Still in early access as of Dec, 17 2013.

adolgert · on Dec 16, 2013

It makes me so happy to see such a clear picture of how service logs relate to the input and output token streams of finite state machines. We usually think of finite state machines as these little objects with a few states, but the category theory version of them is an essential definition of what it means for a system to be deterministic and depend only on current state and given inputs. It's a set of allowable input tokens (the input log), a set of allowable output tokens (the output log of actions taken by the system), an internal state Q, a dynamics delta that decides the next state from the previous one, and an output function lambda that decides what output token to return given the current input state.

By making the statement that logs are streams of tokens which have deterministic effect, this author is assuming that the services are finite state machines. This may not be the case if, for instance, they do not set random number generators to a known state. Any way a service doesn't just depend on its previous state violates this principle. If it meets this principle, then the logs are, by definition, taken from the strings of allowable input tokens X or the strings of allowable output tokens Y.

The debate in the article about at what level to log is a debate about which portion of the service to treat as an FSM. It boils down nicely.

Oh, a dense but beautiful article on this is Machines in a Category by Arbib and Manes.

akrymski · on Dec 17, 2013

Logs should be studied in CS together with Turing Machines - they are a vital component of today's architecture. I applaud the effort of clearly describing the role of logs in today's distributed architectures in concise and easy to grasp way. Everyone studying database systems and distributed architectures should read this article. Thank's Jay!

We too have arrived at using logs at Post.fm, however with a slightly different application: syncing email clients that can go offline with remote servers (similar to Exchange). Instead of the traditional approach taken by most web apps - calling remote APIs directly (the new-age remote procedure calls in effect) I believe the new client-server architectures for web-apps will use logs to synchronise state. This is increasingly possible with the availability of local storage (web-sql, indexed-db, etc).

Another fascinating concept is Acid-State (http://acid-state.seize.it) which "keeps a history of all the functions (along with their arguments) that have modified the state. Thus, recreating the state after an unforeseen error is a simple as rerunning the functions in the history log." The idea of a log being transparently generated at application run-time is fascinating. Function calls elegantly map to 'transactions' when modifying multiple rows this way.

Another interesting outcome of thinking about database systems as logs, is that the tables are in effect read-only. You don't really "modify a row in a table", but add an entry to the log. At some point the database system updates the table to reflect the additions to the log (eventual consistency). If you make the database system wait for the log processing to complete before returning - you essentially get ACID.

Sometimes I wish there was a simpler, more transparent database system that made the log front and center, letting me specify if a SELECT requires the table to be updated with respect to the log or not. Current DBMSes seem to hide lots of functionality instead of providing a simple model that can be tweaked to a particular application.

irickt · on Dec 17, 2013

Just a coincidence I guess that logs will be denied to Linkedin's customers. From an email this week:

"... We'll be retiring the LinkedIn Network RSS Feed on Dec. 19th. All of your LinkedIn updates and content can still be viewed on LinkedIn, or through the LinkedIn mobile app. ..."

chubot · on Dec 17, 2013

This is a fantastic and insightful article, and I'm sure the relatively few comments are a result of people's minds slowly bending to this new way of thinking :)

I particularly liked the list of related resources of the end. I have been looking through academic papers, open source, and also "enterprise integration" stuff, and it always strikes me how people re-invent the same things under different names.

One question though: What about access control and security? Everyone having the chance to subscribe to all data at a company is of course fantastic for product development and productivity. But as a company grows it will also become the case that not every system should potentially access all information.

timmclean · on Dec 16, 2013

> I suspect we will end up focusing more on the log as a commoditized building block irrespective of its implementation in the same way we often talk about a hash table without bothering to get in the details of whether we mean the murmur hash with linear probing or some other variant. The log will become something of a commoditized interface, with many algorithms and implementations competing to provide the best guarantees and optimal performance.

Very insightful. Thanks for the in-depth write-up.

EGreg · on Dec 17, 2013

What can I say: http://qbixstaging.com/QP/features/streams#column=2