Cloudera and Hortonworks merge

alexnewman · on Oct 3, 2018

I was an early employee at Cloudera, am a Hadoop contributor and think this entire market is garbage. Basically the big data field went off the rails and this is Cloudera's way of trying to remain relevant. It's hard to point to a product that came after my tenure (I was on the team that original made the POS called Cloudera Manager) that's really used by anyone at scale. Cloud is displacing all of these tools and never got the clouds to play ball.

colin_mccabe · on Oct 4, 2018

> It's hard to point to a product that came after my tenure (I was on the team that original made the POS called Cloudera Manager) that's really used by anyone at scale.

There are tons of them. Spark, Hue, Sentry, Kafka, the list goes on.

alexnewman · on Oct 4, 2018

Spark, Sentry and Kakfa are not cloudera projects. They all existed as open source before i left. Also cloudera never truely dominated any of them. Spark has databrix and Kafka has confluence.

dswalter · on Oct 4, 2018

I think the parent comment was saying that none of these were primarily made by cloudera.

colin_mccabe · on Oct 4, 2018

Hue is a Cloudera project, not even an Apache project. Sentry is an Apache project but primarily developed by Cloudera.

alexnewman · on Oct 8, 2018

Hue was made before cloudera manager. I know because when I first wrote it, it was based on hue. I think if sentry and hue are all cloudera has brought to the table...

alexnewman · on Oct 4, 2018

indeed. or controlled them

capkutay · on Oct 3, 2018

Good move for both companies. The surplus of 'enterprise hadoop' companies was created by a mixture of hype and a peak in VC investment in open source.

The fact that Cloudera, Hortonworks, MapR were all founded and raised $100m+ around the same time was a bit superfluous for the whole market.

Hard to say where this leaves MapR now. They seem to be the odd man out in terms of growth and adoption.

jamesblonde · on Oct 3, 2018

We have a startup, Logical Clocks (www.logicalclocks.com), that just raised money to sell our data-science next-generation Hadoop stack. It has a new verison of HDFS called HopsFS (Nvme storage, distributed metadata) and support for GPUs in YARN. And distributed tensorflow, Spark, Airflow, Flink.

So, what does that mean for us? I am shellshocked. I expect prices to increase (good for us). What else should we expect?

agibsonccc · on Oct 3, 2018

Both Horton and Cloudera require a huge number of partner resellers and consultants. What horton and cloudera have isn't just software, but brand, an install base, and the ability to back it up.

They merged partially because they were both being cannibalized by services revenue they couldn't get rid of. Now they are struggling to move to the cloud (see Atlas now)

I'm not sure yet another hadoop distro with a bunch of 1 off tooling that is supposedly faster is the answer. Why is all that stuff even needed?

jamesblonde · on Oct 3, 2018

Adam, what about Java for deep learning? Is that needed?

agibsonccc · on Oct 3, 2018

According to my customers and users yes? Granted, we do a lot more than just that though.

Dl4j itself has a decent sized user base. Ranging likely from your phone maker to your bank and retail store.

We have our own software distro too which is why I'm commenting on this. We don't try to boil the ocean with a bunch of tech though.

There's a whole new crop of companies focusing on solving bits of the ML problem well rather than trying to do storage and god knows what else.

My point here about you guys is you're trying to compete in what is largely a commodity market. People don't need all this stuff. Simplicity won here. It's not about better tech.

You guys have the same pitch MapR does and largely the same problem: Better tech is only part of the problem with adoption. You need customers, users, and a clear business model when going to market.

Cloudera and Horton ran one playbook that at least somewhat worked (it got them public) and now they can focus on competing with the cloud vendors, which made the right decision and just made commonly used software easy to use.

jamesblonde · on Oct 3, 2018

We don't. We are the only on-premise vendor with proper support for GPUs and python. We are a full data science stack, backed by Hadoop. We even have Kubernetes for model serving. And python in the cluster (with conda environments). And we have customers and funding. Nothing like MapR. And none of the legacy mapreduce crap.

agibsonccc · on Oct 3, 2018

You bundle way more than they do and on top of that have your own file system just like mapr does.

Your pitch is still about differentiated tech, not a large install base, a differentiated business model

and something related to people like a good partner ecosystem.

Your pitch here requires tons of services.

People don't know how to use all of this stuff especially on prem.

It takes more than just code to build a business.

I say this as someone who's been doing this since 2013. It's not easy.

jamesblonde · on Oct 3, 2018

Ok, now you're changing your angle. Differentiated tech is what we are all about - that is ok by me (for now).

If you want to train DNNs on a hundred GPUs today on-premise on TensorFlow, come to us, we can do it. They can't.

agibsonccc · on Oct 3, 2018

I don't think I changed my angle here? I'm still addressing

the same point. Tech doesn't matter. Simplicity does.

Even in our own product line, we only do a small

subset of this. We don't even require a cluster

to run. We also work with tech that people use.

You are currently competing with horovod

and kubeflow. eg: "competing with free"

You need more than that to survive.

Generally, that comes down to services.

ironchef · on Oct 4, 2018

A slightly different wording is it comes to using the tech... which is either through services or through adoption / operationalization (which... with complexity... is very difficult)

agibsonccc · on Oct 4, 2018

Exactly!

xapata · on Oct 4, 2018

Who needs to do that? Not many companies. Deep Learning is overhyped, just as MapReduce was.

HenryR · on Oct 4, 2018

I am surprised that anyone is trying to differentiate on storage at this time, precisely when that's the part of the stack that's being cannibalized by the cloud vendors (look at the rate of innovation in HDFS over time; the effort is going elsewhere). Are you just targeting on-premise clusters, or is there some differentiation planned for the cloud as well?

jamesblonde · on Oct 4, 2018

We think that there is a niche for a higher performance dist FS than S3. We have integrated NVMe hardware with our HDFS implementation (HopsFS) and made its metadata layer distributed. NVMe means you can, for example, work with datasets with millions of files for deep learning - instead of having to munge them into parquet files because your FS is slowing down your machine learning pipeline.

Reference: https://www.logicalclocks.com/millions-and-millions-of-files...

We have also redesigned the stack around our distributed metadata layer.

We are primarily targeting on-prem right now, but HopsFS would be the fastest DFS in the cloud if you ran it there today.

Game_Ender · on Oct 4, 2018

How does HopsFS compare to Lustre?

SpaceManNabs · on Oct 3, 2018

You support Flink too? Nice! I am glad to see more and more people eyeing it as a viable framework for streaming.

Do you support Kubernetes too?

jamesblonde · on Oct 3, 2018

Yes, but for model serving. Not for Flink or Spark. That's on YARN, for now.

batoure · on Oct 4, 2018

Don’t count MapR down, their commitment to developing ports and tools for the ecosystem outside the jvm was very forward thinking. Their internal engineering talent either makes them a prime aquaition target for CDH in future or a fierce competitor as systems like kudu become more common and production ready.

tumanian · on Oct 4, 2018

MapR is great while you are in their ecosystem - once you step outside and try to bring in technology thats not supported by them, you run into little showstopper bugs and complete lack of documentation and community support.

purplezooey · on Oct 4, 2018

Our experience has not been consistent with that. For example we use MapR-DB and have never had a single issue with HBase compatibility or bugs. My DS team loves the database especially it's quite good.

purplezooey · on Oct 4, 2018

My current company uses MapR and I have to say their software is impressive. It's much easier to use than CDH or HDP. I think they'll do fine.

dirkgently · on Oct 4, 2018

> surplus

Define surplus. Why having 3 or 4 companies in 1B plus market is surplus?

dfee · on Oct 4, 2018

Because there are typically uneven returns due to the winner-takes-most style of internet businesses.

Differentiation is of course key, then.

PanosJee · on Oct 4, 2018

This is not an Internet business (no network effects), switching is not that hard on OSS stacks

ameyamk · on Oct 3, 2018

Good move! Would help consolidate Hadoop ecosystem. MapR would find it hard to compete now.

Both these companies are being challenged with a new crop of Databricks and the confluents of the world.

Tsarbomb · on Oct 4, 2018

Can someone explain to me what the big draw was for Hortonworks or Cloudera?

Working as a lead in a small team that deals with a colossal amount of data (human genomics), it was always easier for us to hand roll deployments with terraform/ansible in either baremetal or OpenStack environments.

In the public clouds like AWS we are using the managed services like EMR.

The whole sales pitch I've always got from either hortonworks or Cloudera always seemed more aimed at nontechnical stakeholders than the technical ones. Am I wrong? Have I missed out on some cool stuff?

coredog64 · on Oct 4, 2018

You’re not wrong. It would be hard to sell CDH if you only engage with technical stakeholders. I can imagine the pitch now...

“So what you get is really old versions of everything, plus some shaded jars that will break if your classloader ever changes load order. As far as cloud goes, we’ll give you this half-baked automation (Director) and we’ll also constrain you from using any features of your cloud provider (availability zones, load balancers, Amazon Linux 2). Finally, we require you to use Oracle’s JDK but we won’t distribute it so that you’re indemnified.”

computronus · on Oct 4, 2018

But you have heard of Director :)

dikei · on Oct 4, 2018

I have used both Ansible and and Ambari (from Hortonworks) to deploy and maintain our big data stack. They each have pros and cons

Ansible

Pros:

- More flexible

- Easier to update separate components

- Bring-your-own monitoring & alerting

Cons

- Higher initial time to setup playbooks

- Rolling update configuration is a pain

- Upgrading components require more compatibility testing

- Lack built-in monitoring & alerting

Ambari

Pros:

- Easier to setup from scratch (step-by-step wizard for everything)

- Changing configuration is easy, you change one components, and other components are update accordingly. It also notify you which software in which nodes need to be restarted and allow you to do rolling restart seamlessly.

- Components integrate well with each others, so less time is needed for compatibility testing.

- Back-ported of important patches .

- Built-in monitoring and alerting

Cons:

- Older software version, albeit with back-ported patches

- Cannot upgrade separated components

- Less easy to integrate with your existing monitoring infrastructure.

I also find it easier to train new team members to use an existing Ambari installation than to maintain Ansible playbooks. We are now using Ambari to maintain the more stable parts of our big data stacks (HDFS, YARN, etc), and use Ansible for the part which are still improving rapidly (Kafka, Flink, Presto, etc)

ewencp · on Oct 4, 2018

terraform: 2014 ansible: 2012 openstack: 2010 cloudera: 2008 hortonworks: 2011 emr: 2011

(those were quickly googled, apologies if i got any dates wrong)

timing is important. cloudera had good, early timing and looked promising because of it. you are right that EMR definitely hurt all the other hadoop vendors, though i think people overestimate how comfortable big enterprises are with moving to public cloud. way more comfortable today; back then everything was a lot less certain. cloudera's name is unfortunate given they never got anything cloud-based successful, maybe they'd be in a better place now if they had.

but some of the key technologies you suggest as better options came 4-6 years later. that's 4-6 years of providing value, gaining traction, and building a committed customer base. 4-6 years is a long time, and even with how slow many enterprise projects run, more than enough to get entrenched, build tooling that makes a bunch of stuff easier, build mindshare, etc.

> Have I missed out on some cool stuff?

stuff that makes companies money doesn't always look cool.

rawoke083600 · on Oct 4, 2018

Plain-Old-Bash-Scripts 2018 ~

batoure · on Oct 4, 2018

Both companies make most of their money providing professional services. I was going to touch on EMR but I think the “small team” comment is a better differentiator. When you have a large cluster with a large volume of workloads distributed computing becomes much less forgiving. The algorithm that was so fast and so valueable when you first deployed suddenly becomes a bottle neck for the rest of your operations because not enough time was spent thinking about partition strategies... Horton and Cloudera have made their money giving big companies advice when they get in over their heads. It’s rarely in the beginning but usually several years after companies have made a hard commitment to Hadoop and distributed computing.

francoisprunier · on Oct 3, 2018

Interestingly, none of them mention Hadoop. It makes sense to merge from a business point of view, and on a technical level it could finally mean more work on polishing the gazillion tools they include, which is badly needed.

Thaxll · on Oct 3, 2018

Hadoop very "fancy" 5-6 years ago, what's the trend now? I guess with managed services from AWS / Google it makes Hadoop less useful?

tumanian · on Oct 4, 2018

Its still there - storage is still mainly HDFS, but computation layer morphed into Spark on Yarn (Mesos as a scheduler is barely used). Barely anyone runs classic Mapreduce jobs, Spark all around, written in Scala(or java or python).

threeseed · on Oct 4, 2018

The managed data science services from AWS/Google is Hadoop/Spark.

medh2000 · on Oct 4, 2018

True. Some one in the cloud is managing these Hadoop stacks for you. You just pay usage of these services.

thinkmassive · on Oct 4, 2018

> The two companies are committed to supporting existing offerings from the two companies for at least three years but will work on a "unity release" of software, drawing on technologies from both companies' portfolios, Reilly said.

I’m really intrigued to see whether Ambari or Cloudera Manager wins out in the long run. This is unexpected but interesting.

(I worked at Hortonworks 2014-2016, no current affiliation)

jamesblonde · on Oct 4, 2018

This is, of course, an anti open-source move. So, Cloudera Manager will win, there. The stack will go largely proprietary, as this is a defensive move - AWS are just killing them in the cloud. They need to prevent AWS just repackaging the Hadoop core and reselling it - what better way than by making your new platform AGPL-v3 or something, so you're still open-source, but AWS can't just resell it as EMR.

medh2000 · on Oct 4, 2018

That's a good question. Both CM and Ambari are front end for managing all the services and more.

xab9 · on Oct 3, 2018

That's interesting. Been at their Hungarian office for a job interview (Hortonworks, a year or so ago), it was weird though. Haven't tried Cloudera, but they too have an office at Budapest, I wonder how it will effect their workforce in the region.

gnanesh · on Oct 4, 2018

Looks like MapR has got something to say to people visiting their homepage, "Two wrongs don't make a right... SEE WHY CLOUDERA AND HORTONWORKS CUSTOMERS HAVE MOVED TO MAPR!"

docker_up · on Oct 4, 2018

Lots of overlap between the two companies. I unfortunately expect a lot of layoffs on the Horton side of things, which is disappointing because I know quite a few people there.

batoure · on Oct 4, 2018

I work for neither company but know people at both and think that this is highly inaccurate. Though there will be operational redundancy both companies make most of their revenue from support contracts and professional services since they provide a mostly FOSS product. Services and support are the two most heavily labor intensive parts of the tech industry every customer requires x amount of labor present. Additionally internal engineers paid to contribute to the FOSS products are still needed. Because both companies have been benifitting from the contributions of both teams of engineers for years. Since Horton has no closed source products it has very few redundant engineering resources.

softliving · on Oct 4, 2018

Hmmm, merger slowdown market, employees beware of the trickery! this happens and then the mgmt thinks of redundancy to show markets how efficient they are ... is there a humane company which maybe does not pay much but secure enough and challenges us to rise together ... seems like a distant dream!

SimplyUseless · on Oct 3, 2018

This is not to eliminate threats from new comers but to find an identity in the new era of cloud.

nerpderp83 · on Oct 3, 2018

Yikes!

medh2000 · on Oct 4, 2018

I have both used Cloudera stack and Hortonworks. I would say in my experience that installation and management is much straightforward with HW distribution. Both companies stacks have similar Apache projects and management software like Ambari and Cloudera Manager.

nerpderp83 · on Oct 4, 2018

It doesn't bode well for either of them. This is a pure contraction of the markets they are trying to serve.