Hadoop's Uncomfortable Fit in HPC

jandrewrogers · on May 18, 2014

As someone who has straddled the world of HPC and Big Data for years, this is a pretty solid summary of the issues. From my perspective, both sides could learn quite a bit from the other but there is little cross-fertilization.

Basically, Big Data is much more sophisticated at distributed systems computer science and HPC is much more sophisticated at massive parallelism algorithms computer science. HPC tends to throw money at hardware to solve distributed computing problems, and Big Data is still very limited when it comes to parallel algorithm design techniques because their sharding models are incredibly primitive (hash and range only? no thanks) partly because of their original use cases.

In my perfect world, a good platform would combine the areas of sophistication from both but people in one domain tend to lack the perspective and experience of the other. And in many cases, the only way you can learn the more advanced techniques is by working in the field; I can't tell you how much really sophisticated HPC computer science is not documented anywhere on the web but I used it every day when I was working on those systems.

KaiserPro · on May 18, 2014

The one thing that HPC (and VFX) have over anyone else, is rational thought.

A thing is used because it is the very fastest. we know its fast because we have the numbers to prove it.

Infiniband is expensive, but it's much faster than ethernet for DMA. HTTP is far to inefficient for MPI. Posix file systems are brilliant. Object storage is far to slow and unreliable (see Http) ans usually miss important features like locking or concurrent access.

when you're paying by the hour, that 5-30% VM/bytecode penalty is ruinously expensive.

anonymousDan · on May 18, 2014

What other sharing models are commonly used in HPC?

colin_mccabe · on May 18, 2014

This article is a good overview-- one of the most informed I've seen from the HPC side of the world.

I don't think "HDFS is very slow and very obtuse" is true, though. If you use HDFS for what it was intended for-- reading large files that are mostly local-- it does pretty well. MapReduce and Impala explicitly support this paradigm by looking at where files are and scheduling computation there. If your workload has a ton of metadata operations, this can become a bottleneck, but that is not unique to HDFS. For example, if you use "ls" with the wrong options on Lustre, things can get quite slow due to the large number of metadata operations generated: https://wikis.nyu.edu/display/NYUHPC/Lustre+FAQ

Lustre is a cool filesystem in many ways, and has been battle-hardened for HPC. However, I think Lustre is relatively unlikely to gain ground in Hadoop due to the complexity of administering and upgrading its in-kernel server. Ceph might have a chance, though-- we'll see.

The article also doesn't discuss a lot of the areas in which Hadoop is way ahead. Hadoop handles faults in individual nodes, rather than just assuming we will use gold-plated hardware with infalliable RAID. Node failures happen, and HDFS and Hadoop can recover-- even from NameNode failures, these days. Google's Spanner paper points the way towards having strong consistency without sacrificing (as much) availability.

I agree that cramming traditional HPC problems into the Map-Reduce framework is not really a great idea. Instead, it would be better to use something like Spark (which, to his credit, Lockwood mentions). Spark is getting a huge amount of attention right now, and I think it's going to resolve some of these issues.

batbomb · on May 18, 2014

Hadoop will not penetrate HPC.

I'll say it again:

Hadoop will not penetrate HPC.

The reasons why vary quite a bit. But the core reasons are this:

Data is rarely local.

Resources are shared.

Users are heterogenous, even if the cluster isn't.

Users don't typically have CS degrees or understand more than basic networking (Computational Chemists and biologists tend to be a but more advanced)

Users still heavily rely on Fortran.

Users rarely know much about SQL, so even Hive is often out of the question.

Clusters are rarely under the control of users. On top of that, Users are rarely able to launch VMs.

Lots of other reasons too.

seanmcdirmid · on May 18, 2014

And yet, CUDA is currently dominating in many HPC applications. There are reasons Hadoop/MapReduce is a poor fit for HPC, but your argument points above aren't very universal.

KaiserPro · on May 18, 2014

In HPC they are.

CUDA wins because it is wonderfully fast. There is a huge benifit for implementing it, even if it requires lots of work to get efficient MPI, because the rewards are literally massive.

Hadoop uses REST ffs. You're never going to get low latency with REST. Its a nasty hack in the world of web, let alone HPC where there really isn't any excuse for using HTTP for something its really not designed to do.

But the question is this:

what do people actually use Hadoop for?

growse · on May 18, 2014

What do people use Hadoop for? Are you asking because you genuinely don't know?

Hadoop lets me pick up PBs of (typically) record-format data and stick it on a bunch of cheap servers that are already standard items in my datacenter catalogue and are well-understood by my datacenter/ops teams. I can then open up any Java IDE, write a program now I can process that data. Up until this point, I've not really needed much specialist knowledge, perhaps with the exception of understanding the map/reduce paradigm which is a fairly basic programming construct anyway.

There's lots of companies out there who have data problems that are bigger than a few TB, and they're bored of shovelling money at Oracle so they can have a fancy, expensive database when their data isn't actually relational to begin with.

KaiserPro · on May 18, 2014

Thats the thing, So I get that it gives you network aware Fork() Its just the scheduling is a bit, well, lacking.

In my datacenter, we run dirty cheap iron. The only thing special are the core switches(lots and lots of ten gig ports, to cope with the petabytes that we shove across it daily)

we use standard linux tools, sharded posix file systems with a decent task dispatcher that understands dependencies.

cabinpark · on May 18, 2014

I guess, but at an HPC conference I went to last year there was a big session on Hadoop and introducing it to the researchers as a potential alternative for certain tasks. I also believe that technologies take a lot longer to penetrate academia then industry given the nature of academics so I don't doubt Hadoop's slow acceptance into the HPC community. Look at GPUs for example. They've been around for a while but it's really only been in the last few years that the tool has been put into the toolkits of researchers.

I do know one professor math professor who just a few hundred thousand on a Hadoop cluster to experiment with numerical algorithms. Talking with him, and I trust his opinion since he does research specifically in high performance linear algebra, was that he thought the technology was very appealing and can be an excellent addition to the repertoire. However there is still much work to be done since numerical linear algebra (which is the backbone of almost every single HPC calculation) using the Hadoop framework is very young. I believe it'll take a few years for researchers to determine whether or not Hadoop could be a useful tool for solving the types of problems academics do or whether it's just not as effective as the current tools.

tdurden · on May 18, 2014

> Its runtime environment exposes a lot of very strange things to the user for no particularly good reason (-Xmx1g? I'm still not sure why I need to specify this to see the version of Java I'm running

I am pretty sure you don't need to specify a maximum memory allocation size to determine the version of Java being used.

EdwardDiego · on May 18, 2014

Yep, it's just `javac -version` or `java -version`

MichaelGG · on May 18, 2014

He added(?) a note explaining that the default is 4GB and will fail if the user has limits which is common in HPC configurations.

EdwardDiego · on May 20, 2014

Default is 1/4 of physical memory for a server JVM. I'd emphasise the word server. Client JVMs are 1/4 of physical memory to a max of 256MB.

I always enjoying reading blogs by people unfamiliar with a technology about why their poor understanding of a technology indicates that the technology they know better is better.

ChuckMcM · on May 18, 2014

This was a fascinating look at an HPC person't thought on Hadoop. Interesting that before Beowulf "super computing" was exemplified by the Cray series. The word was that NUMA was incompatible with HPC, but folks found ways around that.

Its a totally valid observation that many HPC problems are not amenable to a Hadoop cluster, but it is not clear to me yet that there are no HPC problems that a distributed data approach might do better than the current system :-)

KaiserPro · on May 18, 2014

Scheduling is what HPC can do much better. The concept of data/message/cache/storage/memory affinity seems to be a novelty.

epaulson · on May 18, 2014

The real mismatch is handling failures. The HPC world has grown up handling failures by failing the entire computation. As a defense against failure, they usually checkpoint the entire computation.

Part of this is the fault of the MPI standardization process: MPI didn't even try to deal with dynamic-but-expected changes to the computation (much less changes caused by "node" failure) because not all of the initial systems MPI was targeting could handle nodes coming or going. In this way, MPI was sort of a step backwards - PVM was much more flexible than MPI from the get-go.

I haven't paid much attention to MPI in a few years, so maybe some of the MPI-FT approaches have caught on.

One of the initial targets for YARN was MPI, but it was never clear to me that any MPI users actually wanted to use YARN.

With just a few tweaks, PVM and AFS could have given us a pretty good Big Data stack 15 years ago (And in the case of security, a better stack than what we've got today.) The hardware wasn't quite cheap enough to make the case as compelling as it is today, and the MapReduce programming model helped sell the entire approach, but it was a real missed opportunity.

TallGuyShort · on May 18, 2014

I've discussed the MPI-on-YARN thing with quite a few people from both communities and really only found one use-case: MPI jobs you want to run with data in your Hadoop cluster, where it will be done so rarely it's not worth moving the data from your cluster to your supercomputer. Even then, I can't name an instance of this type of use case I've actually seen in the field.

anonymousDan · on May 18, 2014

Out of interest, what are the security issues you see with Big Data stacks?

TallGuyShort · on May 19, 2014

Most of the tools are written under the assumption that you're behind a firewall or airgap that will do all your security for you, and the controls are very weak and poorly enforced in the younger projects. Even in the mature ones, there's hasn't really been a cohesive model of security across the stack that meets the needs of stricter organizations. That's improving - I'm doing some work with Apache Sentry (incubating) that lets you apply more sophisticated security policies to fine-grained data in Hive, Impala and Solr (and in the future, hopefully many others). There are also projects like Apache Knox that, as I understand it, is a gateway for your cluster - so taking the idea of having an external tool protect your cluster, but at least one that understands what's happening in the cluster and handles authentication, etc. Multi-tenancy is also in it's infancy for Hadoop - but I'd expect as more of the projects start integrating better with YARN's resource management this will also be improving in the very near future.

pjmlp · on May 18, 2014

I think a possible good outcome of people trying to squeeze Hadoop in HPC world is already happening with the pressure to give more focus to performance in Java 9 roadmap.

If it will really happen, remains to be seen. But I am following it with interest.

frozenport · on May 18, 2014

The real problem is everybody trying to jump on the Big Data™ train. I work a block from the NCSA and everybody is saying big data, even when it means a few gigabytes in a traditionally non-data format like images. The molecular dynamics simulations at UIUC don't generate Big Data unless you refer to the log files, which are generated once and of an order smaller than most people see.

"What is Big Data?" is frequently answered with "What I do!"

TallGuyShort · on May 18, 2014

I've always thought of "Big Data" as being big in the same sense as "Big Oil". It's not any specific capacity of storage or computation - it's the value being extracted from the data to make companies a powerful influence in their industry. I still think the term is overused as well, but I think the "Big" doesn't have to refer to the dataset itself.

saryant · on May 18, 2014

This is completely off-topic, but "Big Oil" actually did have a specific historic meaning: it's the descendants of the Seven Sisters, the oil companies party to the Consortium of Iran. Those seven were Anglo-Persian (now BP), Gulf + Standard of California + Texaco (now Chevron), Shell, Standard of New Jersey + Standard of New York (Exxon). By popular convention, Total has sort of slipped into this group despite not being party to the Iran agreement, which obviously undermines the historic meaning of the term.

EdwardDiego · on May 18, 2014

The only working definition of "Big Data" that I've come across is "Data too large to work with in a traditional RDBMS".

slurry · on May 18, 2014

"Big Data is any thing which is crash Excel." - @DEVOPS_BORAT

EdwardDiego · on May 20, 2014

Oh, that is fantastic. :D

KaiserPro · on May 18, 2014

if its not a petabyte its not even close :)