Hacker News new | past | comments | ask | show | jobs | submit login
Runaway complexity in Big Data... and a plan to stop it (slideshare.net)
96 points by timf on Sept 25, 2012 | hide | past | favorite | 61 comments



We invented RDMS databases to store large amounts of important data safely on extremely constrained storage hardware (kilobytes were expensive in the 1970s) where you could not economically store redundant copies due to astronomical cost.

Append only immutable DBs are inherently better just because they essentially add a 4th dimension to your data (all history, all data, all the time), and hook up to the fact that Von Neuman machines love splitting things up into constrained sub-streams/problems and crunching through the entire data set in memory across clusters using discrete memory frames/units of work (think Google search index/GPU framebuffers/Integer arithmetic).

Storage is now unlimited and random seeks are very expensive. The more you serialize your processing and split them into in memory units that utilize cache locality - the faster you'll perform. You'll turn performance problems into throughput problems - all you have to do is to keep the data flowing.

RDMS databases are dead for the same reason I don't manage my YouTube bandwith use or bother managing files or bother managing RAM - I have cable now with terabyte hard disks and 32-64 GBs of RAM.

The future is immutable with periodic/continuous historical stream compaction (pre-computation) with queries running across hundreds of clusters and reducing all searches to essentially linear map-reduce (sometimes with b-trees/other speed ups) + a real time dumb/inaccurate stream layer.

1 machine cannot store the Internet - a million machines can. 1 machine cannot process the Internet - but a million machines operating on one 64GB slice of it located in RAM and operating with cache friendly memory-local discrete operations can run through the Internet millions of times a second.


Just a few facts that I have to deal with:

1) I can't afford a million machines or hundereds of clusters or keeping petabytes of data around in a way that doesn't make the data completely useless. Yes mutable state causes complexity, but I can't afford to rid myself of that problem by using brute force.

2) Performance is not replaceable by throughput if you have users waiting for answers on questions they just dreamt up a second ago and you have new data coming in all the time.

3) Cache locality and immutability don't go well together. Many indexes will always have to be updated, not just replaced wholesale with an entirely new version of the index.


1) Even with hard drive prices inflated today, we're at over 9GB/$1 for storage. That means for $1 you can store 1.25 billion 64-bit integers. From that perspective, there is very little data (at least in terms of traditional RDBMS data) that is actually cost effective to throw out. To the extent that you do need to throw it out, you can wait until all transactions involving the data have long since completed.

2) I think you misunderstood what was meant here. His point was that your limiting factor really becomes throughput if you have the right architecture.

3) Indexes and cache locality also tend not to play well together. ;-)


1) Obviously it's about tradeoffs - if you don't have the capital then operating with last gen environments is a perfectly reasonable decision. For others who do - it is not.

2) What kind of queries do you have that take that long? If it's scientific computing - no way around it. If it's just DB slicing/aggregating - if you use last gen structures you'll get last gen perf.

3 ) Cache locality on repeatable computable units on local in memory data do go better - continuous batched background updates to indices with a real time layer like the OP suggested will become the new standard.


if you don't have the capital then operating with last gen environments is a perfectly reasonable decision.

I am indeed operating in an environment of limited resources. It's called "The Real World". Reading marketing language like "last gen" makes me lose interest in a debate very quickly.


I also operate in this "Real World" and I'm not constrained by your limitations - i.e. large clusters, lots of data etc.

Just because you happen to be constrained by your resources it does not follow that your world is any more "Real" than mine - it's just different - which is what I said.

What part of "last gen" is marketing speak? Original Xbox is "last gen", the iPod is "last gen" - quite literally the last generation.


It is marketing speak because marketing people use these kinds of phrases to stop readers thinking about the merits of different choices they have, and instead focus their minds on the idea that old technologies are superseded by new ones. Other popular phrases are "legacy" or "conventional".

The original Xbox is no longer available in the stores. That's because it has been superseded by the new model, which does everything the old one did, only better! Poor me who just doesn't have "the capital" to get myself the shiny new one just yet.

Apparently, that's the way you want me to think about immutability versus mutability in data structures. Makes no sense.


You said that everyone else's use case is "dead".


"Conflating the storage of data with how it is queried" is a misinformed criticism of the relational model. Codd's Relational Algebra (Turing Award material) was in large part a move towards data independence, relaxing what was previously a tight coupling of storage format and application access patterns. Take a look at Rules 8 and 9, or just read the original paper: http://en.wikipedia.org/wiki/Codds_12_rules

> "Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed." http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

It's not clear from the slides how RDBMSs manage to conflate these two concerns in practice.

It's also unclear to me how the "Lambda Architecture" differs from what we've been calling [soft real-time] materialized view maintenance for decades.


It's not a criticism of the relational model. It's a criticism of the relational database as they exist in the world. The normalization vs. denormalization slides explain how these concerns are conflated. The fact that you have to denormalize your schema to optimize queries clearly shows that the two concerns are deeply complected.


Your ideas will work very well for small data, big storage. eg daily stock price histories

They will not always work for big data, which often means "max out our storage with crap, don't worry disk is cheap" eg web metrics

When a decision maker goes "don't worry disk is unlimited", the resulting application is prone to maxing out storage.

Whenever your application maxes out your storage, you have no space for previous versions of the data.


You realize he works at Twitter, right? And that he's responsible for Storm, and Cascalog?

I'm not going to say anything better than it was said in the slides / book draft, so I'll just encourage you to take these techniques seriously... they're born out of necessity, not because they sound like fun, and real people are using them to solve problems that are hell to solve any other way.

That said, these are not problems that everyone has. If you're not nodding you're head along with the mutability / sharding / whatever complaints at the beginning of the deck, you can probably still get by with a more traditional architecture.

(Also, rereading... I should probably note that not everything needs to be kept forever; only the source data, since the views can be recomputed from them at any time. That makes things a bit cheaper.)


> problems that are hell to solve any other way.

'Bigness' of data != data size

'Bigness' of data == data size / budget

Twitter isn't a typical company. I assume they have both a budget and competent management that will let them get away with something like the Lambda architecture.

I reckon it's a lot harder to scale to even a terabyte under the constraints of a grubby setting like a datawarehouse for some instrument monitoring company.

Those guys will allow at best MS SQL for storage, and won't mind putting their developers through hell.


> They will not always work for big data, which often means "max out our storage with crap, don't worry disk is cheap" eg web metrics

Having worked on this kind of stuff myself, I'd have to argue the exact opposite. I've always ended up building something precisely like what is described when trying to tackle those kinds of problems at large scale.


> complected

This word, I don't think it means what you think it means.


Perhaps my DBA-fu is limited, but what databases are doing soft real time materialized view maintenance? The big boys seems to do recomputation for materialized views rather than updating the materialized view on insert or update. I tried to google for this, but saw only research papers rather than implementation.

I know that in places I worked, Materialized views were mostly limited in the application realm because too many of them over enough data brought the DB to its knees.


Oracle updates views on update. Pstgres allows you to write a trigger to do the same.


Has anyone come up with a good explanation for why, given the recent proliferation of databases, nobody seems to be making anything more relational than the SQL DBMSes?

It should be easier than ever, particularly if you cheat and declare rotational media an unsupported legacy format.


more relational -> neo4j perhaps? not sure what you mean by rotational media


Rotational media = spinning disks


What is more relational?


As in Codd's 12 rules above.


I'm interested enough in the idea to pick up the (partially-completed) book, but I wonder if this isn't far too complex of a solution for 95% of cases. By the time that your system is in the 5% of cases, you're looking at a massive rearchitecting effort.

Of course, to be fair, you're probably already looking at a rearchitecting effort at this point!

However, looking at the architecture diagram, much of the complexity could be hidden behind the scenes with a devoted toolset built on top of Postgres, rather than trying to cobble together Kafka, Thrift, Hadoop, and Storm. Sometimes, one big tool beats a lot of small tools.

For that matter, I wonder if a series of Postgres-XC (or Storm) servers couldn't do the same thing without learning a series of complex tools.

Step 1: Everything goes through a stored procedure. Deletes, updates, and creates are code generated through a DSL. Migrations would be a pain under this system, but mitigated by the fact that the complexity is being handled by the tool.

This stored procedure then writes to the database of record and then sends a series of updates to, essentially a system of real-time materialized views that serve the same purpose as the standard NoSQL schema.

The lambda architecture purposes would still be fulfilled with lower data-side complexity. After I read through Big Data I'll revisit this with a more nuanced view, but I wonder if hadoop and nosql really give you anything.


I really like this. Has some overlap with Command-Query Responsibility Segregation, which I have found very useful: http://www.udidahan.com/2009/12/09/clarified-cqrs/


There was an interesting production-grade event store released recently which fits the described use-case: http://geteventstore.com

Basically, you store all of the data as events (event sourcing) and create separate query views (projections) which are populated from event streams.


Interestingly I am just building a complete analytics setup for my ecommerce site, since Google Analytics doesn't give me everything I want to track. I looked at different things like statsd and redis and whatnot but realized that my data is not actually big. I will just put everything in Postgres and will be able to do that on cheap hardware even when I grow tenfold. Since revenue per pageview is high as in multiple cents there will never be an unsolvable performance issue.

tldr - the hand me down from the 90s can be good enough if you are not big yet


Interestingly the chart at the end suggests implementations for everything except the raw data, but generating batch views from an arbitrary store and keeping your raw data reliably are hard problems. The charts motivate dumping this stuff in Hadoop (or some distributed file system), but any reliable store would do. (@nathanmarz: would love recommendations)


A distributed filesystem, such as HDFS or MapR, is ideal for the master dataset.


This is a reasonable architecture, but it introduces complexity of its own - you have to implement all your view creation logic in two places, which should go against every programmer's instincts. If you have a system capable of streaming all the events and writing out the reporting views, why not use it for everything.

At my previous^2 job, we had a distributed event store that captured all data and could give you all events (or events matching a very limited set of possible filters) either in a given time range, or streaming from a given time onwards. For any given view we'd have four instances of the database containing it, populated their own streaming inserters; if we discovered a bug in the view creation logic, we'd delete one database and re-run the (newly updated) inserter until it caught up, then repeat with the others (queries automatically went to the most up to date view, so this was transparent to the query client - they'd simply see bugged data (some of the time) until all the views were rebuilt).

The events system guaranteed consistency at the cost of a bit of latency (generally <1 second in practice, good enough for all our query workloads); if an event source hard-crashed then all event streams would stop until it was manually failed (ops were alerted if events got out of date by more than a certain threshold). This could also happen if someone forgot to manually close the event stream after taking a machine out of service (but at least that only happened in office hours); hard-crashes were thankfully pretty rare. Rebuilding the full view after discovering a bug was obviously quite slow, but there's no way to avoid that (and again this was quite rare).

In use it was a very effective architecture; we handled consistency of the events stream in one place, and building the view in another. We only had to write the build-the-view code once, we only had one view store and one event store to maintain. And we built it all on mysql.


I like this idea but my first impression is it sounds like a lot of work to setup and scale. Am I wrong?

Secondarily, if this were a service, wow!


It does seem like a lot of work to me.

One of the problems is servicing two workloads, one for "transactional" processing and another for analysis.

For transactional systems you need to be able to change things quickly and consistently. For analysis you need to be able to query lots of data quickly.

For decades people have realized that these are two separate workloads, so have built 2 systems, a transactional system (on an RDBMS) and a data warehouse (generally on an RDBMS). Data is then shipped between the 2 in batch jobs.

The transactional system is normalized, and the data warehouse is normalized. Within the data warehouse you make denormalized copies of the data that fit the reporting workload required so that as much of the workload is pre-computed as possible.

The problem is that as reporting requirements change, you need to modify these pre-computed stores, as they are very heavily tuned for the particular reporting requirement. Building pre-computed stores is generally done in SQL and can be challenging as you are generally shifting a lot of data and you are trusting the RDBMS to get its optimizations right.

There is a trend now to use Hadoop for the building of these pre-computed stores (and even to use Hadoop for the entire data warehouse). However, writing map-reduce jobs for queries is cumbersome compared to SQL so your productivity suffers. But you don't have to pay Oracle or IBM for licenses.

The key problem is that you need to pre-compute stuff to do fast aggregation, but you can't pre-compute everything. So what you pre-compute is dependent on what your users want, and that changes all the time.

So what you want is a system that lets you change what is pre-computed easily and efficiently.


The only novel part of this seems to be that you issue realtime queries over the batch view AND realtime view together; otherwise I don't see what else is interesting here. And I think that novelty is unneeded (and therefore only additional complexity).

Why don't you just: 1) Master your data in "normal form" (graph/entity-based data models like Datomic or Neo4j are perfect for this) in a scalable database (e.g., Cassandra) 2) Use Hadoop for offline analytics 3) Index all of the data needed for realtime queries in a search engine (e.g., elasticsearch), smartly partitioning the data for performance

As for "schema" changes -- you don't need them in an entity-based model because you can always be additive with the attributes in your schema. (You may wish to have a deprecation policy for some attributes but doing this allows you to as slowly as you'd like update any applications and batch jobs that depend on deprecated attributes.)


It all comes down to Query = Function(All data). The architecture I describe is a general approach for computing arbitrary functions on an arbitrary dataset (both in scale and content) in realtime. Any alternative to this approach must also be able to compute arbitrary functions on arbitrary data. Storing data in Cassandra while also indexing it in elastic search does not satisfy this, and it also fails to address the very serious complexity problems discussed in the first half of the presentation of mutability and conflating data with queries.

The Lambda Architecture is the only design I know of that:

1) Is based upon immutability at the core.

2) Cleanly separates how data is stored and modeled from how queries are satisfied.

3) Is general enough for computing arbitrary functions on arbitrary data (which encapsulates every possible data system)


Okay, sure -- if your need is realtime Query = Function(All data) -- then your approach may be necessary. But this requirement (realtime Function(All data)) is a rarity, I would think. And in this rare case, perhaps, the complexity you're introducing is warranted.


I'm writing a distributed DBMS based on immutable data. I have a different way of providing fast realtime arbitrary querying of all data and it's much simpler. You just cache intermediate results.

For example, if you want to know the sum of the elements in a tree, but you've already calculated the sum of the elements of another tree that differs on only one path, then you won't need to do much work, you can use the precomputed work for most of the tree and just perform calculations for the parts that are different.


In fact, my approach is more general than yours because it isn't restricted to catamorphisms.


Afraid not. Nothing is more general than function(all data).


You rely on mapreduce, so you don't have the ability to calculate any function. You can only calculate catamorphisms of your data, ( http://en.wikipedia.org/wiki/Catamorphism ). You can't solve a SAT problem effectively using mapreduce over the tree of the expression for example. I can't think of a case where you'd want to calculate a function that isn't a catamorphism, but saying that you aren't restricted to them is wrong.


Is "median" expressible as a catamorphism? What about a neural network propagation?


First, let me explain what I mean by catamorphism. I mean a function that operates on the data within a node of a tree and the values returned by performing the catamorphism on its child nodes.

To find the median, you need the data sorted. You can express something like mergesort as a catamorphism. You take the lists of the child nodes and the singleton list of the value at the node and merge them together. However you have to do a linear amount of work at each node and store a linear amount of data, so it's not really within the spirit of mapreduce because it doesn't scale well, although it is possible.

Catamorphisms are operations on trees, so the first problem with neural networks is that you'd have to embed your neural network in a tree. In the worst possible case, a fully connected network, to find the next value of each neuron, you'll have to find the current value of every other neuron. The problem is that you don't have access to "cousin nodes'" data, so it isn't possible.


Datomic satisfies these three points.


The interesting part about timestamping change to records is essentially "effective dating" and it is bread and butter in any transactional system that needs to record employee info. This is done varying degrees to columnar values or entire rows themselves. In an RDBMS, you will create a compound key to allow you to know the history and the current record. I guess the new fangled NoSQL will make it easier to store it easily..?


I don't have a critique for the content yet, and HN is usually the place to post this kind of stuff so here goes.

My initial reaction is to close the window immediately due to the font on the slides. I don't know if that is the fault of my browser (chrome 21.x on this PC), slideshare, or the presenter who put together the slides, but the staggered characters drive me bonkers.


My bet is on Chrome's sometimes clumsy font rendering. Go fullscreen on the presentation and the characters will render fine.

Edit: scratch that - each letter shows up as a separate span in the generated HTML. There's no sane way to typeset that.


It would be useful if there was a speaker narrative, it would be better still if there was a white paper that this presentation was just summarizing.

What I got out of it was "Just store the raw data", always compute results/queries, BigTable and Map/Reduce are cool. I felt like I missed something perhaps someone here can help.


The first chapter of my book discusses a lot of what's in the presentation http://manning.com/marz/BD_meap_ch01.pdf


Out of curiosity, why is it called the lambda architecture?


Lambda is the standard symbol used to represent functions (coming from the lambda calculus). Since the purpose of this architecture is to compute arbitrary functions on arbitrary data, and since the internals of the architecture make heavy use of pure functions (batch view is a pure function of all data, and queries are pure functions of the precomputed views), "Lambda Architecture" seemed like an appropriate name.


Awesome thanks for the link, so your target audience is SQL DBAs ?


I'm not sure how that follows. Reading the linked chapter and the slideshow suggests that the target is anyone who is using a data storage system that has core data that is mutable. A significant portion of the problems discussed apply equally to nosql and sql and newsql databases.

Basically the 'lamba architecture' he refers to is event sourcing, or write-ahead logging, but with scalability in mind, and some cool hooks for maintaining correctness.

You use your hdfs store as your event log, and a couple layers to handle making the batch processing (map-reduce jobs) into real-time queryable databases (along with disposable caches that can be updated real-time to handle the stuff that comes in between batch-jobs in a real-time way).

The goal is to never lose the raw actions - so even updates to various layers (including the batch processing!) don't result in data corruption, just some time to reprocess all the raw inputs again.


I think it is a useful reference (the book), it seemed to go over a lot of things which folks who are dealing with large data sets today are already familiar with. If however you were a DBA on a typical DBMS or RDBMS system and were told to develop a "Big Data" system I could see how many of the things that Marz points out would trip you up. So I was asking if that was the target market for the book.

As to the content, see the papers on data flow architectures from the 70's and 80's [1]. They are very cool. We've done something similar at Blekko where we store raw data in a table structure and build in pre-computed results with combinators [2]. The Map/Reduce paper [3] is an excellent introduction to a number of these concepts. This is all good stuff and something that is helpful for people to have in their toolboxes. The title of the post gave me the impression that there was something new here (I'm always on the prowl for new stuff on these problems) and I didn't see what the new stuff was, it seemed like the stuff we know just presented more coherently rather than as a collection of links. Perhaps that is more clear, perhaps not.

[1] https://en.wikipedia.org/wiki/Dataflow

[2] http://highscalability.com/blog/2012/4/25/the-anatomy-of-sea...

[3] http://research.google.com/archive/mapreduce-osdi04.pdf


I'm not sure how to square the normalized schema with the immutability; normalization implies schema changes which imply data changes, no?

In particular, what do I do when there is erroneous historical data that violates the new schema (the newly discovered constraint that was there all along)?


Essentially they end up rebuilding the "transaction log" that most RDBMS's use to commit data safely.

Before any data is written to the "data" files of the database, it is written to a "log" file sequentially. Once that has succeeded, it will write to the data file, then write to the log file again to say that the write was successful.

That way if the database breaks during the write, the system knows about it. It's also useful because it allows you to restore the database to a point in time, as you have a full log of all the transactions run (and you know how to un-run them).

In this case, they just make the transaction log a little more accessible.

For your example, erroneous historical data would remain in the system but with a new record being added indicating that at a certain date and time the old record was replaced with the new one.

Basically if you think of a store processing a purchase for $100 and refund of $100 as 2 transactions (for $100 and -$100) rather than simply deleting the first transaction.

This is important because things may have happened between the deposit of money and its refunding (e.g. interest payments).


You have options. You can apply the schema restriction only to new inserts (since you have full control over where it's applied - the "all data" store is generally not an SQL DB). You can "retcon" the old data (and knowing when to violate immutability is what separates the good from the great). You can make a new "table", and put the logic for dealing with the difference between old and new formats in the batch processor.


Does anyone know what his NewSQL critique was? (The "“NewSQL” is misguided" slide. BTW, I've read the released chapters of his _Big Data_ book, so I know a fair bit of the slides' context.)


NewSQL inherits mutability and the conflation of data and queries. It's no longer 1980, and we now have the capability of cheaply storing massive amounts of data. So we should use that capability to build better systems that fix the complexity problems of the past.


... but it is no longr 1980, amd we have massively more data.


> MapReduce is a framework for computing arbitrary functions on arbitrary data.

No, MapReduce is a framework for computing catamorphisms over trees.


I think he meant catamorphisms of arbitrary functions over trees of arbitrary data. As in, you can apply any function at the leaves, and the leaves can be any type.


This is directly contradicted in the first chapter of his book:

The Lambda Architecture, which we will be introducing later in this chapter, provides a general purpose approach to implementing an arbitrary function on an arbitrary dataset and having the function return its results with low latency.


The OP is an engineer at Twitter...doesn't Twitter use NoSQL to store tweets? I know your critique of NoSQL is that it is an overreaction to the pain of schemas, so I'd like to see some discussion on how to implement lambda with data architecture that, at some point, uses document databases.

Thanks for the post...slideshare is annoying to read (on an iPad) but the content was digestible. Mostly, it helped remind me to download the update of the OP's book




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: