Let's Build a Modern Hadoop

jandrewrogers · on Feb 10, 2015

The article makes many good points but also misses on a few in my opinion. Hadoop was designed to solve a pretty narrow problem a long time ago; modern use cases tend to be far outside its intended design envelope.

Being more explicitly Linux-centric and dropping the JVM paves the way for some major performance optimizations. In practice, virtually everyone uses Linux for these kinds of applications so portability is not a significant concern. Efficiency in the data center is becoming a major concern, both in terms of CapEx and OpEx, and the basic Hadoop stack can be exceedingly suboptimal in this regard. Free software is not cost effective if you require 10-100x the hardware (realistic) of a more optimal implementation.

However, I would use a different architecture generally rather than reimplementing the Hadoop one. The Achille's Heel of Hadoop (and Spark) is its modularity. Throughput for database-like workloads, which matches a lot of modern big data applications, is completely dependent on tightly scheduling execution, network, and storage operations without any hand-offs to other subsystems. If the architecture requires network and storage I/O to be scheduled by other processes, performance falls off a cliff for well-understood reasons (see also: mmap()-ed databases). People know how to design I/O intensive data processing engines, it was just outside the original architectural case for Hadoop. We can do better.

A properly designed server kernel running on cheap, modern hardware and a 10 GbE network can do the following concurrently on the same data model:

- true stream/event processing on the ingest path at wire speed (none of that "fast batch" nonsense)

- drive that ingest stream all the way through indexing and disk storage with the data fully online

- execute hundreds of concurrent ad hoc queries on that live data model

These are not intrinsically separate systems, it is just that we've traditionally designed data platforms to be single purpose. However, it does require a single scheduler across the complete set of operations required in order to deliver that workload.

From a software engineering standpoint it is inconvenient that good database kernels are essentially monolithic and incredibly dense but it is unavoidable if performance matters.

jdoliner · on Feb 10, 2015

Hi. Co-founder and lead Pachyderm dev here. This is a really interesting comment and raises some questions that I haven't fully considered. Building the infrastructure in this decoupled way has let us move quickly early on and made it really easy to scale out horizontally. We haven't paid as much attention to what you talk about here. However I think there might be some optimizations we can make within our current design that would help a lot. Probably the biggest source of slowness right now is that we post files in to jobs via HTTP, this is nice because it gives you a simple API for implementing new jobs but it's reaaally slow. A much more performant solution would be to give the data directly to the job as a shared volume (a feature Docker offers) this would lead to very performant i/o because the data could be read directly off disk by the container that's doing the processing.

hedgehog · on Feb 11, 2015

Andrew knows this stuff better than I do but one way to start thinking about this is to look at what the sources of latency are, where your data gets copied, and what controls allocation of storage (disk/RAM/etc). For example mmaped disk IO cedes control of the lifetime of your data in RAM to the OS's paging code. JSON over HTTP over TCP is going to make a whole pile of copies of your data as the TCP payloads get reassembled, copied into user space, then probably again as that gets deserialized (and oh the allocations in most JSON decoders). As for latency you're going to have some context switches in there which is not helpful. One way you might be able to improve performance is to use shared memory to get data in and out of the workers and process it in-place as much as possible. A big ring buffer and one of the fancy in-place serialization formats (capnproto for example) could actually make it pretty pleasant to write clients in a variety of languages.

jdoliner · on Feb 11, 2015

Thanks for the tips :D. These all seem like very likely places to look for low hanging fruit. We were actually early employees at RethinkDB so we've been around the block with low level optimizations like this. I'm really looking forward to get Pachyderm in a benchmark environment and tearing it to shreds like we did at Rethink.

angersock · on Feb 11, 2015

So, I'm not so sure that building a product that mandates both btrfs and Docker really counts as "decoupled".

jdoliner · on Feb 11, 2015

Hi, jd @ Pachyderm here.

Long term we don't want to mandate either of these technologies. We'd like to offer users a variety of job formats (docker and rocket come to mind more domain specific things like SQL are interesting as well) and a variety of storage options (zfs, overlayfs, aufs and in-memory to name a few). However we had to be pragmatic early on and pick what we thought were the highest leverage implementations so that we could ship something that worked.

We'll certainly be looking in to getting rid of this mandate in the near future, we like giving people a variety of choices.

SEJeff · on Feb 11, 2015

A very performant way would be native RDMA verbs access over an Infiniband fabric, 40g to 56g throughput. At that rate remote storage is faster latency wise than local SATA. Many many HPC shops operate in this fashion and the 100g Ethernet around the corner only solidifies this.

shmerl · on Feb 11, 2015

What about other options like HPCC?

frowaway001 · on Feb 11, 2015

I hate Hadoop with a passion, but this reads like written by people even more clueless than Hadoop developers.

> Contrast this with Pachyderm MapReduce (PMR) in which a job is an HTTP server inside a Docker container (a microservice). You give Pachyderm a Docker image and it will automatically distribute it throughout the cluster next to your data. Data is POSTed to the container over HTTP and the results are stored back in the file system.

Now I'm not even sure anymore whether they are trying to troll people or if they are that out of touch with reality.

- No focus on trying to take on the real issues, but working around a perceived problem which would have been a non-issue after 5 minutes of research? Check.

- Trying to improve things by looking at bad ideas which weren't even state-of-the-art ten years ago? Check.

- Being completely clueless about the real issues in distributed data processing? Check.

- Whole thing reads like some giant buzzword-heavy bullshit bingo? Check.

- Some Silicon-Valley-kids-turned-JavaScript-HTML-"programmers"-turned-NodeJS-developers-turned-Golang-experts thinking that they can build a Git-like distributed filesystem from scratch? Check.

- "A git-like distributed file system for a Dockerized world." ... Using Go? ... Please just kill me.

If I ever leave the software industry, it's because of shit like this.

batbomb · on Feb 11, 2015

> Second, Etcd and Fleet are themselves designed to be modular, so it’s easy to support a variety of other deployment methods as well.

Zookeeper isn't Modular? No mention of HTCondor+Dagman? No mention of Spark?

I've written a general DAG processor/workflow engine/metascheduler/whatever you want to call it. It's used by various physics and astronomy experiments. I've interfaced it with the Grid/DIRAC, Amazon, LSF, Torque, PBS, SGE, and Crays. There's nothing in it that precludes Docker jobs from running. I've implemented a mixed-mode (Streaming+batch processing + more DAG) version of it which just uses job barriers to do the setup and ZeroMQ for inter-job communication. I think something like Spark's resilient distributed dataset would be nice here as well.

We don't use hadoop because, as was said, because it's narrow in scope and it's not a good general purpose DAG/Workflow Engine/Pipeline.

I think this is a small improvement, but I don't really see it being much better than hadoop, or HTCondor, or Spark.

HTCondor is pretty amazing. I think somebody should be building a modern HTCondor, not a modern Hadoop.

jdoliner · on Feb 11, 2015

Hi, JD @ Pachyderm here.

Our feeling is that there's a big difference between having projects like this in the ecosystem and having a canonical implementation that comes preloaded in the software. "Batteries included but removable" is the best phrasing of this we've found (stolen from Docker). I've seen tons of pipeline management tools in the Hadoop ecosystem (even contributed to some) it's not that they're bad it's just that they're not standardized. This winds up being a huge cost because if 2 companies use different pipeline management it gets a lot harder to reuse code.

I think etcd is a pretty good example of this. Etcd is basically a database (that's the way CoreOS describes it) by database standards it's pretty crappy. But it's simple and it's on every CoreOS installation so you can use it without making your software a ton harder to deploy.

Hope this clears some stuff up!

on Feb 11, 2015

[deleted]

exacube · on Feb 11, 2015

Redis cannot replace etcd: etcd (like zookeeper and chubby) offer strong consistency, which Redis does not. These requirements are key for building things like a directory service (which is how CoreOS uses etcd).

_hyn3 · on Feb 11, 2015

Here's a Jepsen runthrough on etcd:

https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul

nickysielicki · on Feb 11, 2015

what's not modern about Condor?

sgt101 · on Feb 11, 2015

Isn't Tez good for DAG's?

a8da6b0c91d · on Feb 11, 2015

> somebody should be building a modern HTCondor

A bug-free, cross platform, rock-solid HTCondor would be great. Sadly the existing HTCondor is a buggy turd. I've fixed some stuff in it and a lot of the code is eye-roll worthy. I can practically guarantee you hours of wondering why something that by any reasonable expectation should be working isn't working, and not giving any meaningful error message.

But like you said, a really solid HTCondor rewrite with the Stork stuff baked in would pretty much be all anyone should need, I think. It's not like you'd even really need to dramatically re-architect the thing. The problems with HTCondor really are mostly just code quality and the user experience.

lrm242 · on Feb 11, 2015

I use HTCondor every day. I find it to be rock solid with top notch documentation. It is cross-platform today as well, so I'm not sure where you're coming from on this.

chubot · on Feb 11, 2015

Interesting -- does HTCondor solve the same problem as Hadoop? I know it is for running batch jobs across a cluster.

How big are the data sets? How does it compare with the MapReduce paradigm?

I see there is this Dagman project which sounds similar to some of the newer Big Data frameworks that use a DAG model. I will make an uneducated guess that it deals with lots of computation on smaller sized data (data that fits on a machine). Maybe you have 1 GB or 10 GB of data that you want to run through many (dozens?) of transformations. So you need many computers to do the computation, but not necessarily to store the data.

I would guess another major difference is that there is no shuffle (distributed sort)? That is really what distinguishes MapReduce from simple parallel batch processing -- i.e. the thing that connects the Map and Reduce steps (and one of the harder parts to engineer).

In MapReduce you often have data much bigger than a single machine, e.g. ~100 TB is common, but you are doing relatively little computation on it. Also MapReduce tolerates many node failures and thus scales well (5K+ machines for a single job), and tries to reign in stragglers to reduce completion time.

I am coming from the MapReduce paradigm... I know people were doing "scientific computing" when I was in college but I never got involved in it :-( I'm curious how the workloads differ.

a8da6b0c91d · on Feb 11, 2015

Do you administer the clusters, or has somebody set everything up for you? Do you use it entirely on linux, or have you made the mistake of trying to get it to work on Windows? Which universe do you use: java, vanilla, or standard? I can imagine a perception that it's rock solid in a particular configuration and usage pattern, especially one where somebody else went through the pain of setting it up.

lrm242 · on Feb 11, 2015

I admin the cluster, so you're gonna need to try again.

a8da6b0c91d · on Feb 11, 2015

[flagged]

peteforde · on Feb 11, 2015

Why are you being so mean? You look like a huge asshole in this thread.

lmm · on Feb 10, 2015

The JVM runs every language I care about, in a mature system with a simple, stable interface. It has its problems (startup time), but for the kind of jobs you use Hadoop for it's rarely an issue.

Switching to Docker feels like a real step backwards. Rather than a jar containing strictly-specified bytecode, cluster jobs are going to be random linux binaries that could do any combination of system calls. You need another layer of build/management system to make your docker containers useable. Worst of all, rather than defining common shared services for e.g. job scheduling, this is going to encourage users to put any random component in their docker image. We'll end up with a more fragmented ecosystem, not less.

jdoliner · on Feb 11, 2015

Hi, JD @ Pachyderm here.

Our feeling isn't that Java is bad it's that it shouldn't be the only option for big data processing. There are a lot of people that want to do this type of computing and a lot of them hate Java. We want to build a solution that can serve them better and Docker is actually an incredibly high leverage way to do it.

I feel like I don't fully understand your criticism of choosing Docker as platform.

> cluster jobs are going to be random linux binaries that could do any combination of system calls

Could you give an example of what could go wrong here? Docker prevents you from talking to the outside world so you can't call out to an outside service in the middle of a job or anything like that.

> Worst of all, rather than defining common shared services for e.g. job scheduling

In Pachyderm job scheduling happens outside the user submitted containers. The job scheduler is what actually spins up the containers and pushes data through them. I'm afraid I don't understand what it would even mean for one of those containers to then do job scheduling itself.

Hope this clears up our thoughts on this a bit.

Thaxll · on Feb 11, 2015

The JVM is the best VM out there, it has been developped for a long time by very smart people, I don't see how docker is better in that case.

timtadh · on Feb 11, 2015

Docker isn't a VM? In docker everything is native code. There is no emulation. It uses OS level sandboxing primitives to provide protection and isolation. Running an application in docker gives you the same performance profile as running it on the host because it is running it on the host. So if you write your job in C or Fortran you can beat any Java code on the JVM nearly everytime. Especially for linear algebra applications.

derefr · on Feb 11, 2015

But in this use-case, you don't want to pick a VM; you want each task to specify a VM. A docker container is, effectively, an app that bakes in whatever VM it requires.

I can see the argument for managing workload definitions as an {app repo, VM descriptor} tuple—kind of like Heroku app definitions—rather than a single app+VM image blob. But when you're actually deploying them to run on machines, nothing beats baking everything into a reproducable static-binary-equivalent slug.

You don't need to let people run arbitrary docker images as workloads, but the ability to let people pick any VM—rather than just your preferred VM—while having this be just as safe and sandboxed, opens up a lot of flexibility.

(This is the same argument I give for Chromium's PNaCl, by the way: the advantage doesn't come from your ability to run one-off native apps in the browser; the true advantage is in <script>s that can specify exactly the interpreter engine—a PNaCl-binary VM—they need. Nobody seems to see eye-to-eye on this one, either.)

politician · on Feb 11, 2015

> Our feeling isn't that Java is bad it's that it shouldn't be the only option for big data processing. There are a lot of people that want to do this type of computing and a lot of them hate Java. We want to build a solution that can serve them better and Docker is actually an incredibly high leverage way to do it.

Thank you.

lmm · on Feb 11, 2015

> Our feeling isn't that Java is bad it's that it shouldn't be the only option for big data processing. There are a lot of people that want to do this type of computing and a lot of them hate Java.

JVM != Java. It runs a wide variety of languages, and this is not just theory - you can do Spark in Python right now. "Every language I care about" was perhaps overly flippant, but between Groovy, Clojure and Scala there's something to suit most programming tastes.

Meanwhile I hate Linux, and Docker doesn't have a good answer for that; the suggestions I've seen for developing with Docker on other platforms boil down to "run Linux in a VM".

> Could you give an example of what could go wrong here? Docker prevents you from talking to the outside world so you can't call out to an outside service in the middle of a job or anything like that.

I'm more worried about forward compatibility, standardization - and partly attack surface I guess. I can run jars from 15 years ago, on today's JVM - and not because it bundles an old version; I can link today's programs against 15-year-old libraries and they'll just work. If a library I built against has disappeared from the internet, that's fine.

Meanwhile I can't run linux binaries from 15 years ago (I did try with some of the loki games - maybe it's possible but it's decidedly nontrivial). Even if I found the right .sos, if I wanted to use a library with the ABI of ten years ago in my program, I'd have to rebuild the whole world. I guess I'm meant to rebuild my docker image, but that doesn't work if the library source is no longer available, and I'm not at all confident that the tooling does fully reproducible builds to the same extent as e.g. maven (from what I can see plenty of dockerfiles use unpinned versions).

Finally, the spec for the JVM is in one place, and it's published; apache harmony shows that a ground-up reimplementation is possible (that the TCK isn't open is very unfortunate, but the patents will expire sooner or later). I have a reasonable level of confidence that we'll always be able to run the jars we're producing today, without having to emulate x86s or what have you. Linux doesn't have a spec like that (yes the C code is available), and I don't think a ground-up reimplementation would be feasible.

> In Pachyderm job scheduling happens outside the user submitted containers. The job scheduler is what actually spins up the containers and pushes data through them. I'm afraid I don't understand what it would even mean for one of those containers to then do job scheduling itself.

Job scheduling was a poor example; I guess I'm thinking more things like e.g. running your own database on the cluster nodes, instead of talking to one outside the cluster via a proper interface. (Admittedly that's possible on hadoop with derby or h2).

Thinking about it more I guess it's less the service part and more the interface part that I'm worried about. A docker image could implement critical functionality with shell scripts. I bet they will. I've already spent too much of my life trying to get people to move away from shell scripts into something safer and more maintainable.

vorg · on Feb 11, 2015

> between Groovy, Clojure and Scala there's something to suit most programming tastes

> the spec for the JVM is in one place, and it's published

Between Groovy, Clojure and Scala, there's something to suit most tastes on the scale of how much language specification there is. Scala has a published spec (along with many white papers on the R.I. internals), Clojure has its published spec in the comments in the source code, and Groovy actively changes the spec between versions to prevent addons and alternative implementations being built.

lmm · on Feb 11, 2015

Sure, and I'd prefer people used the more specified options. But even if you have a Groovy library that you can't compile under modern Groovy, the bytecode will still run on new JVMs - more than you can say in the Docker case.

enigmo · on Feb 12, 2015

PySpark does not run Python code on the JVM. It uses py4j and it is slow as molasses. This has the benefit of supporting native Python libraries like NumPy. If you're not using such libraries you'd be better off using a JVM language, including Jython.

eropple · on Feb 11, 2015

I'm not really sure why this post got downvoted, because I think there's some merit to it. In a spherical-cow universe I think there's an argument for opening up the universe to all tools, but I think in practice this makes good practices much harder to maintain and harder to proof against issues and failures at a glance.

Personally, I dig Spark on EMR or Spark on Mesos (with Scala as a domain language), and I'm not sure how this plays with the rest of the world.

sgt101 · on Feb 11, 2015

I think Spark on Yarn on Mesos co habiting with Docker on Mesos managed by Twill is the future.

I just need to take a big breath every time I say it though.

spacecowboy_lon · on Feb 11, 2015

Sounds like some one doesnt want to put the work into learn java - which I admit is a stone cold bitch to work with (and I have done MR with Pl/1G and Fortran 77 back in the day ) but you can write hadoopp jobs in python.

rkwasny · on Feb 11, 2015

Scheduling jobs by distributing docker container? Looks really heavyweight.

There is a good reason most of Map-Reduce frameworks are written on top of JVM - it is VERY easy to serialize Java/Scala code ship it to remote host and execute there.

This is not trivial in compiled languages ( I believe some hacks for Haskell exists )

What we need in Big Data space is more people that understand how computers really work. Some effort must be put into supporting TCP/IP Stacks in Userspace using new 10GBe cards or support for IB verbs. Also the long forgotten art of writing zero-copy code.

why-el · on Feb 11, 2015

> Some effort must be put into supporting TCP/IP Stacks in Userspace using new 10GBe cards or support for IB verbs

Obviously my understanding of networks is not at that level yet. What would a layman do to become somewhat vested in this? Any books or resrouces?

chaotic-good · on Feb 11, 2015

This technique usually called 'kernel bypass'. http://highscalability.com/blog/2013/5/13/the-secret-to-10-m... and http://lukego.github.io/blog/2013/01/04/kernel-bypass-networ...

why-el · on Feb 13, 2015

Thank you. :)

codygman · on Feb 12, 2015

I wouldn't consider StaticPointers a hack. I believe this is the feature you are talking about.

ericfrenkiel · on Feb 10, 2015

I find it very odd that the author completely glossed over the recent Spark developments in the Hadoop ecosystem. In many ways, Spark is meant to replace the Map-Reduce paradigm to enable easier access to the data.

jdoliner · on Feb 10, 2015

Hi, co-founder

This is something we probably should have spoken directly to in the article because of how popular spark is. The main reason we didn't is that we like Spark and don't feel it needs to be replaced. It doesn't seem to have most of the ecosystem problems we discuss because it's got a company behind it. My understanding of Spark is that it's designed to be used with different storage backends and I'm very curious to see what would happen if we got it talking to pfs. I think it could work very well because Spark's notion of immutable RDDs seems very similar to the way pfs handles snapshots.

Hope this clears a few things up and apologizes if I've mischaracterized Spark here I've only used it a little bit.

quadrature · on Feb 10, 2015

Its not that odd, spark is very similar to the hadoop way of doing things and there are a few other projects like it that are attempting to be a better hadoop. So I dont think that much was lost by not mentioning it.

threeseed · on Feb 10, 2015

How is Spark similar to the Hadoop way of doing things ?

It operates very differently from Hadoop for us. SparkSQL allows third parties apps e.g. analytics to use JDBC/ODBC rather than going HDFS. And the in memory model and ease of caching data from HDFS allows for different use cases. We do most work now via SQL.

Combining Spark with Storm, ElasticSearch etc also permits a true real time ingestion and searching architecture.

ianburrell · on Feb 10, 2015

Spark is a more general data processing framework than Hadoop. It can do map-reduce, can run on top of Hadoop clusters, and can use Hadoop data. It can also do streaming, interactive queries, machine learning, and graph processing.

CurtMonash · on Feb 11, 2015

This sounds bizarre. Spark is the replacement for MapReduce, with Map and Reduce being just 2 of its 15 (?) primitives. Who needs another MapReduce engine?

The bit about "We'll get around to Hive and Pig eventually" is also odd, as they're approaching legacy status too. (Albeit not there yet; please recall that I'm somewhat of an Impala skeptic, and of course there's the Tez booster to Hive.)

Also bizarre are some of the history claims in the article, e.g. that YARN is part of more or less original Hadoop.

jdoliner · on Feb 11, 2015

It's been talked about a few times in this thread already so I'll try not to rehash too much. We think Spark is way cool but disagree that it's a complete replacement for MapReduce. There were people else where in this thread who described pretty well their use cases that didn't work with Spark.

My experience is that Hive is pretty widely used, I have a lot fewer data points about Pig. We're not particularly attached to either of these. That sentence was more to indicate that we'd be adding some sort of interfaces for specific data types.

Seems like we got our history wrong here. When was YARN added we'll add a correction.

Hope this helps clear things up.

thinkmassive · on Feb 11, 2015

YARN was introduced with Hadoop 2.0, by MAPREDUCE-279: https://issues.apache.org/jira/browse/MAPREDUCE-279

Note that YARN also supports Docker containers as of Hadoop 2.6...

YARN-1964: Create Docker analog of the LinuxContainerExecutor in YARN https://issues.apache.org/jira/browse/YARN-1964

CurtMonash · on Feb 11, 2015

Thanks.

I'll stick with http://www.dbms2.com/2014/04/30/spark-on-fire/ however, for the most part. (The mid-2014 PR blitz I thought I was engineering didn't actually happen.)

trhway · on Feb 10, 2015

Writing new implementation of a batch processing misses the main point that batch processing in general is just a trade-off made in order to get things started. This is why new generations of systems are going the way of online and/or stream processing.

threeseed · on Feb 10, 2015

Have you actually seen this happening ?

I have terabytes of new data arriving per day that needs to be verified, ingested and translated. There simply is no way to do this via online or stream processing. If you ingesting clickstream or Twitter data then sure it will work. But more often than not you need to work with sets of data. And for that batch processing is the only option.

jdoliner · on Feb 11, 2015

JD @ Pachyderm here. We try to talk to as many people as possible about their data infrastructure so I have a decent sample size on this. My experience is that there a specific use cases for which streaming solutions are gaining ground but there's still a lot for which it isn't and probably never will. Streaming is great but there are a lot of situations where it just doesn't fit. I think what might be going on is that there isn't a lot of chatter surrounding batch based processing so it seems like it's going out of style but the use cases are still real

scott_s · on Feb 11, 2015

What is it about stream processing that doesn't fit your needs? I do research and development on a streaming system, so I'm very interested to hear about how you need to use your data.

angersock · on Feb 11, 2015

So, I've got to ask...what do you actually get for that data? Like, what does your business actually use it for?

Are you folks doing sensor analysis or something, or eating logfiles, or what?

threeseed · on Feb 11, 2015

Financial Services with millions of customers (and growing rapidly).

What we can get for that data is a major competitive advantage. We can offer much cheaper financial products since we model risk individually rather than as a cohort. It also allows us to have a single customer view despite reselling other companies products.

Building a single customer view with lots of disparate data sets is a big trend right now.

angersock · on Feb 11, 2015

Ah, okay, so that makes sense. I've seen a few cases where data is just indiscriminately hoovered because reasons, and I can never help but wonder what the expected ROI on it is.

jdoliner · on Feb 10, 2015

Hi cofounder / lead Pachyderm dev here. Pachyderm actually does a lot more with MapReduce than batch processing. We need to work on making this clearly in our marketing. The way this works under the hood is pretty cool though. Pfs is a commit based filesystem, which means that we think of the entire world as diffs. When your entire world is diffs then you can transition gracefully between batch processing and stream processing because you always know exactly what's changed.

This is generally one of the things that's frustrated us about the Hadoop ecosystem. Hadoop's implementation of MapReduce, which only supports batch processing, has become conflated with MapReduce as a paradigm, which can be used for way more than just batch processing.

Hope this clears a few things up!

yid · on Feb 10, 2015

So you're trying to build a realtime MapReduce? The other part about the diff-vs-batch tradeoff really depends on what the performance penalty of moving to a stream of diffs is going to be, over a batch. If it's uniformly better than batch processing, then you've also just invented a better batch processing transport.

jdoliner · on Feb 11, 2015

MapReduce is actually an incredibly flexible paradigm. If you have a clean implementation of MapReduce with well defined interfaces you can make behave in a number of different ways by putting it on top of different storage engines. We've built a MapReduce engine as well as a the commit based storage that lets us slide gracefully from streaming to batched.

Our storage layer is built on top of btrfs, we haven't put together comprehensive benchmarks yet but our experience with btrfs is that there isn't a meaningful penalty to reading a batch of data from a commit based system compared to more traditional storage layers. I really wish I had concrete performance numbers to give you but we haven't had the time to put them together yet. I will say that our observations match with btrfs' reported performance numbers and are what I'd expect given the algorithmic complexity of their system.

SEJeff · on Feb 11, 2015

So pretty much everyone here mentions Hadoop or Spark, but even spark streaming is not realtime, but glorified microbatches. Am I missing something here or has no one in this thread considered storm or LinkedIn's samza for true real time stream parsing / ETL type jobs? Seems like the slam dunk to me. Bonus points that they both use Kafka

kod · on Feb 11, 2015

Spark streaming will handle 0.5 second batches. If you deeply care about sub-second latency, JVM in general is not where you want to be.

kev009 · on Feb 11, 2015

No disclosure of inspiration from Manta https://github.com/joyent/manta?

Also, why yet another DFS that makes excuses for shitty local filesystems instead of leveraging ZFS? Use ZFS as the base for integrity, snapshots, send/receive and make a small service on top for managing cluster metadata.

jdoliner · on Feb 11, 2015

Hi, JD @ Pachyderm here.

We think Manta is pretty cool but it wasn't actually an inspiration for us.

I'm not sure I fully understand your criticism of the filesystem. Pfs was actually initially conceived as almost exactly what you're describing, a small service on top of ZFS. But we realized early on that ZFS was going to be a pain because of its patent status so we used btrfs instead. I think btrfs is worse than ZFS but I'm not sure I'd call it a shitty local filesystem.

bipin_nag · on Feb 11, 2015

I agree with your findings on Hadoop. It is a very complicated ecosystem. The core is tied to JVM and abstractions of map-reduce are heavy and difficult to use. There are multiple domain specific languages/api to do the same thing : pig, cascading. Yet if you pick one you cant use other. Hadoop feels like tower of babel.

But I am not in the favour of creating another stack as Hadoop replacement. Spark fills the role of distributed computation framework very well. It uses scala a functional language which provides neat abstractions for map,reduce,filter and other operations. It introduces RDD to abstract data. All the components graphx, sparksql, blinkdb use scala.

Also building a new map-reduce engine is a difficult task. The computation engine will lie in the middle of stack not on top as shown. Libraries will sit on top of that, so if you build an engine you will have to build everything on top of that.

See the Spark ecosystem : https://amplab.cs.berkeley.edu/software/

eaxitect · on Feb 18, 2015

TL,DR; Although I'm not a fan of entire Hadoop system, some judgements about it is totally wrong.

1) Do one thing well - Hadoop is so stacked up because of this (which has pros and cons). Every single component of Hadoop is designed specific to a job (there are many overlapped components, though)

2) Hadoop's main idea is to be data-local, yet this Pachy thing not to be: "Data is POSTed to the container over HTTP and the results are stored back in the file system". Data-locality (almost) always win.

3) There is no scheduler like YARN specified, yet fleet and etcd are not schedulers.

4) You may not like Java or JVM-based things (like I do), but there is no problem of being Java-based or JVM-based.

…

spennant · on Feb 11, 2015

Funny... I just put up a hadoop cluster last week and named it pachyderm.mydomain.com (i'm probably not the first to do that either)

sytse · on Feb 10, 2015

Sounds awesome, do you have any performance stats?

jdoliner · on Feb 11, 2015

Hi!

We unfortunately don't right now. It's surprisingly hard to get benchmarks that give you a true apples to apples comparison between us and Hadoop. However it seems from the comments in this thread that there's a lot of interest in those though so we'll try to put those together ASAP.

sytse · on Feb 11, 2015

Thanks!

thinkmoore · on Feb 11, 2015

Still trying to wrap my head around the idea that 10 year old software is not "modern."

That said, I have no opinion on this article.

amelius · on Feb 11, 2015

Personally, I'd like to run purely functional jobs, and get progress information while they are running (this is not as easy as it sounds). Automatic memoization of jobs. And automatic killing of jobs and removal of data when results are no longer required (this should work transitively/recursively).

sp332 · on Feb 11, 2015

You want a purely functional job with a side effect of updating progress info?

scott_s · on Feb 11, 2015

It's a reasonable request. The job is purely functional in the computation on its incoming data. That means you don't need to worry about persistent state management, which is a problem if you want to easily move jobs around a distributed system. The job (or part of the job) can still send metrics about what its doing to some other service, and still remain functional with respect to the data it's processing.

sp332 · on Feb 11, 2015

Functional doesn't just mean repeatable. If you modify global state that is not represented in the "input" and "output" of the job, then it's not purely functional.

scott_s · on Feb 11, 2015

Yes, I am aware of what "functional" means. But I think you are being too literal. If you view the computation through the lens of the actual data involved in the computation, then it is functional. That it also has some internal state that gets updated and reported as part of the monitoring of the computation - but not part of the computation - then it is still functional with respect to the actual data in the job.

The key distinction is that this internal metrics data that is just used for monitoring is incidental, and can be thrown away without impacting the computation's result. It's just convenient information that humans might want to have so they can reason about the internal state of the overall system.

amelius · on Feb 11, 2015

You are right. It is also possible to view the process as a lazy functional computation, which produces a list of progress-items, with the last element of the list being the result of the computation. Such a view is useful when job A invokes another job B, and job A needs to compute new progress information based on the progress information from job B.

Anyway, I think such a tool would be highly useful. Also, "progress" is an often overlooked feature, but users actually appreciate it a lot, and often even require it. And you simply can't easily add progress-computation to a system as an after-thought. You have to "weave" it through the design as you build it.

scott_s · on Feb 11, 2015

You generally don't want things such as metrics to be on the critical path for the computation itself. What you're proposing is clean by functional programming standards, but it would probably effect performance because now computing the metrics is tied to the computation. You often want such things to be disjoint for both performance and software engineering reasons.

There's also the fact that in such distributed systems, the entity that consumes the result of the computation is generally not the same entity that consumes the metrics.

pjmlp · on Feb 11, 2015

It is not modern if GNU/Linux is the only deployment option.

I love the ability to run the cluster in any OS that has a JVM available, from big commercial UNIXes, Windows, *BSD and GNU/Linux distributions.

KaiserPro · on Feb 11, 2015

Sorry, but using HTTP for transport is just terrible.

Hadoop is a poor imitation of a mainframe. When you are processing large lumps of data you need the following:

A task scheduler A message passing system(although in some workloads this can be proxied by the scheduler or filesystem) File storage

In mainframe land, you have a large machine with many processors all linked to the same kernel. As its one large machine, there is a scheduler that's so good you don't even know its there. However mainframes require specialist skills.

If you want shared memory cluster, you're going to need inifiband (which is actually quite cheap now, and has 40gig ethernet type majigger in it aswell)

but people are scared of that, so that leaves us with your standard commodity cluster.

In VFX we have this nailed. before leaving the industry I was looking after a ~25k CPU cluster backed to 15pb of storage. No fancy map reduce bollocks, just plain files and scripts. Everything was connected by ten gigs ethernet. Sustained write rate of 25 gigabytes a second.

firstly task placement is dead simple: https://github.com/mikrosimage/OpenRenderManagement/wiki/Int... look how simple that task mapping is. It is legitimate to point out that there are no primitives for IPC, but then your IPC needs are varied. Most log churn apps don't really need IPC, they just need a catcher to aggregate the results. Map:reduce has fooled people into thinking that everything is must be that way.

This is designed to run on bared metal. Anything else is a waste of CPU/IO. We used cgroups to limit process memory, so that we could run concurrent tasks on the same box. But everything ran the same image, anything else is just a plain arse ache.

The last item is file storage. HDFS is just terrible. the reason that its able to scale is that it provides literally nothing. Its just a key:value store, based on a bunch of key:value stores.

If you want fast super scalable posix filesystem, then use GPFS(or elastic storage). The new release has a kind of software raid that allows you to recover from a dead disk in less than an hour. (think ZFS with shards)

failing that you can use lustre (although that requires decent hardware raids, as its a glorified network raid0)

or if you are lucky you can split your file system into name spaces that reside on different NFS servers.

Either way, Hadoop, and hadoop inspired infrastructures are normally not the right answer. Measos is also probably not the answer unless you want a shared memory cluster. but then if you want this you need inifinband...

bra-ket · on Feb 11, 2015

fantastic idea, unfortunate name (Pachyderm, seriously?)

TeMPOraL · on Feb 11, 2015

Excuse me for going off-topic, but heck, Hadoop itself was released 3 years ago. It's a testament to how insanely fast things change in web that Hadoop is no longer considered "modern"...

michaelochurch · on Feb 11, 2015

This sounds promising.

I'd love to see a Haskell-centric (performance and robust code) but ultimately language-agnostic alternative to the JVM-heavy stack that's currently in vogue. (Spark is a promising and powerful technology.) I choose Haskell because pretty much anything that the JVM does well (e.g. concurrency) Haskell also does well and, while I like Clojure and can enjoy Scala when written by competent people, Java is ugly and its community has worse aesthetic sense than a taint mole.

necubi · on Feb 11, 2015

You may be interested in Cloud Haskell[0], which is an erlang-like distributed computation engine. The main thing that's missing in Haskell-land is an alternative to HDFS.

[0] https://wiki.haskell.org/Cloud_Haskell

enigmo · on Feb 11, 2015

Why not use Ceph? There are solid Haskell bindings for RADOS: https://github.com/anchor/rados-haskell