Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to Apache Mesos (antonlindstrom.com)
145 points by aant on March 29, 2015 | hide | past | favorite | 54 comments



As noted in another comment, we, a growing starting with 5 people around software out of 15 people total, run a Mesos cluster with a couple of hundred machines. By FAR our largest challenge has been to adapt our thinking to break the "this machine is 'production', that machine is 'staging'" mindset. We have a production compute infrastructure and you're welcome to launch 'production', 'staging' or whatever jobs into it. The friction around "running Mesos" has mostly been the friction of the air from exhalatoins of joy buffeting our esophagi...

We have a separate, much smaller cluster to test new Mesos (and Chronos and Marathon) versions. But the distinctions "production", "staging" and "dev" have become much more nuanced, so we've settled on discussing the "application" environment versus the "infrastructure" environment. Much as, as a startup on AWS, you wouldn't distinguish the RDS instance of your database (e.g. 9.3.1), you would distinguish the version of your database on an RDS instance, we distinguish the versions of our apps on the production cluster and not the version of the production cluster. One of the team was an ex-Googler and he said that Google did much the same.

The one thing Marathon and Chronos currently lack is a prioritization mechanism so we're building that as a Chronos task that monitors and scales down/up Marathon tasks by their priority (as represented in their id or tag).


> The friction around "running Mesos" has mostly been the friction of the air from exhalatoins of joy buffeting our esophagi...

Is there some common phrase relating belching and being happy that I haven't heard?


Not that I know of, but it'd be awesome to have a word for joyous belching...


An interesting talk about Docker and Mesos by a former coworker, he also contributed to 0.22:

https://www.youtube.com/watch?v=ZZXtXLvTXAE


Maybe this is a good place to ask a question that I've been pondering for a while: Why is Mesos based on the concept of resource offers? AFAIK this is backwards compared to other schedulers where you ask for resources and the schedule gives them to you (or not).


I can't quite tell you why it was implemented like this. The Omega paper by Google describes different kinds of scheduling algorithms and their pros and cons.

I don't think that it's backwards compared to other schedulers (many of them are monolithic) but I think it can be improved by using transactions and optimistic concurrency control. This is discussed in the Omega paper and I think there's work to integrate the shared state scheduler into Mesos.


The most puzzling thing for me is why schedulers and executors are so tightly coupled in Mesos. In my mind, those are separate concerns: schedulers should concern themselves solely with locating an available resource for you (and should be part of the Mesos core), and the executor should take care of launching and managing the program to be run.

With the current design, if I want to launch a Docker container, I have to use its scheduler, even though its scheduler might not be as good as some other resource's scheduler. And I might have to accept a regression in the scheduler in order to benefit from a bugfix in the executor.


To load the framework (application) with the logic with the concern of deciding what resources it wants. The scheduler shouldn't care about what you get, just fairness. Two-tiered scheduling achieves this:

Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them.

http://mesos.berkeley.edu/mesos_tech_report.pdf


I guess that makes sense if frameworks care about placement (which I generally don't) because otherwise they'd have to express placement constraints to Mesos.

I still can't find documentation about what fairness policies are actually implemented, though.


Mesos uses dominant resource fairness:

https://www.cs.berkeley.edu/~alig/papers/drf.pdf


Allow me take the opportunity and ask if anybody here is running Elasticsearch with Mesos (and Marathon) using Docker container?

I'm running Elasticsearch nodes on a dedicated Mesos slaves and I’m still not sure how much memory should I allocate to Elasticsearch task e.g. all available memory as reported by Mesos or some smaller amount to leave something for the system? Please note that I'm asking about memory allocated to the Marathon task not about JVM heap size.


I've taken the view of treating our database slaves as special and allocating all the memory on the slave to ES. That way ES (or any other db) always runs on the right slave, and no other potentially long running/resource hungry process runs alongside it (but just enough that's jenkins tasks and chronos tasks can still rune if need be).

AKAIK, there isn't a right way to run databases because the persistent storage layer into mesos is still being baked.


> the persistent storage layer into mesos is still being baked

That's pretty important point, would have thought persistent storage would have been first point to get working for a project like Mesos. Also, their homepage ( http://mesos.apache.org/ ) outright states: "Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines [...]". What storage are they talking about if not persistent storage?


This mesos 0.22 talk goes into a bit more what they are imagining - http://mesosphere.com/2015/03/27/mesos-0-22-0-released/

Essentially the current mesos disk quotas are just so that tasks that run, run with a minimum amount of disk space. I think what they are trying to accomplish in future releases is some way you could build an EBS Style disk space management on top of mesos.


Yes, they are calling it dynamic reservations:

https://issues.apache.org/jira/browse/MESOS-2018

With a first draft of the user documentation at:

https://gist.github.com/mpark/e8ee4eb9671bdb252c4f

It will be really slick once this makes it all into Mesos 0.23


question / thought : seems like mesos could evolve to be an alternative to openstack ...especially if a tenant layer is developed ..yes/no ???


Probably. And OpenShift 3.0 (also on the front page today) is a PaaS (Cloud Foundry competitor) built on Mesos.


OpenShift is most definitely not built on Mesos. RedHat's plans over time are to integrate with Mesos at a lower level for fine grained resource allocation, but for now Kubernetes is used as the cluster scheduling and orchestrations runtime.


Okay, maybe I'm getting old, but I can't really see the point of Mesos. It feels like it's adding too much magic to the mix, and I can't see myself using it to deploy web app servers/databases. Is there a different intended purpose that I'm missing? I suspect it's great for deploying task workers, for example.


You can use it for web apps/services (https://mesosphere.com/docs/getting-started/service-discover...). But I'm with you on databases. I guess I just like knowing that those hosts are being managed and cared for (pets vs. cattle).

With that said, I often wonder how many people are using Mesos/Marathon before they have any need for it? Using it on for 4 hosts vs. 40 or 400?


There's nothing wrong with Mesos on 4 hosts, though...

Anything more than a single host requires additional co-ordination - picking Mesos is mostly no different than rolling alternate methods for doing so.

It's not choice with the least overhead, but it's a possibility, for those for whom it has the right conveniences...


Yeah i have seen Mesos used for managing big data frameworks like Spark. Any large distributed computing framework has to deal with some of the same problems - managing resources on worker nodes, sending tasks, getting results, scheduling etc. and Mesos/YARN provide a common base here, better than each framework duplicating the same code and solving the same problems.


That makes sense, thanks. The workers are pretty independent from each other, too, so this setup would be more fitting for that.


I've looked at Mesos briefly, but it seems completely JVM-based, and that you're not prepared to build your whole stack around the current Java stack (Hadoop, Spark, Storm, Kafka, Zookeeper etc.), Mesos has zero utility. Is this accurate?

As someone working on scaling microservices, I keep being disappointed by potentially useful services that turn out to require a JVM-based language such as Java or Scala. For example, Kafka looks very decent, but the high-level client is written in Java; if you're not on the JVM, you're stuck implementing a lot of the client yourself. As far as I know (from the last time I looked at this stuff), the Zookeeper client is similar, whereas Spark and Storm both require that you write processing code on the JVM, and libhdfs is apparently still JNI-based, not native.

For someone using Docker, is there anything competing with Mesos that isn't wedded to Javaland?


> [Not using Java -> ] Mesos has zero utility. Is this accurate?

Not at all. My company has a cluster with a few hundred Slaves and we're mostly a Python shop, with some C++ for machine vision.

Mesos certainly has its issues (as does everything), but its awfully nice for micro-services: if you can package your service into a Docker container, then you can launch it into the cluster and Chronos/Marathon/Mesos will take care of making sure that it's run/running.

I missed the deadline for the Mesos conference (I was unaware of the deadline), but I'm trying to squeak in a talk about "Using Mesos at [small] Scale" because we're a small company and Mesos has allowed us to do a bunch of big company stuff.

>is there anything competing with Mesos that isn't wedded to Javaland?

Yes, I can look at the source code and see that it uses te JVM (Chronos uses Scala), but, AFAICT, Mesos isn't "wedded" to anything. All of the components are API-driven. I apt-get install it, I run it, I send jobs to it, it works and it behaves well. Better I can poke at the APIs of any of the services to find out what is happening. So we use Marathon for service-discovery and run Chronos, a framework, under Marathon. Makes finding Chronos, which could be on one of 200 machines, quick and easy.


Thanks for the detailed reply!

I have to say I'm wary of big frameworks like this that insert themselves as a kind of monolithic control structure for everything.

My ideal setup is always one where I pick and mix the best modules for the job, and where I can write some interface glue to let my apps slot into the system, as opposed to writing my apps for a specific API (as tends to be the case with, say, Hadoop, and which of course would tie the whole platform to that API, making it hard to migrate to something else).

Sounds like Mesos is pretty modular and open in that respect?


Once you get to know it, Mesos is really quite simple: slaves emit events about their capabilities and running processes; masters collect and distribute an inventory of slaves and their processes and of events that occur on slaves. The frameworks (ie. Marathon for long running jobs; Chronos for scheduled jobs) then listen to those inventories/events and ask the Mesos master to add and delete processes from slaves. So Mesos is quite modular.


> For someone using Docker, is there anything competing with Mesos that isn't wedded to Javaland?

Lattice[0].

It's built from a few components extracted out of Cloud Foundry, all written in Go. In particular it includes Diego, an orchestrator. It can be run locally, on AWS, Google and DigitalOcean.

In terms of the key architectural difference, Onsi Fakhouri explained it to me this way: Mesos is supply-driven, Diego is demand-driven. Mesos keeps a list of pending workloads and waits for a report of available resources that fits. Diego instead receives a request for resources and then stages an auction amongst available cells.

Disclaimer: I have worked on Cloud Foundry (on the Buildpacks team here in NYC) and I am employed by Pivotal, we do the lion's share of the work on Cloud Foundry.

[0] http://lattice.cf/


Mesos itself is written in C++. Building Mesos will generate Java and Python bindings.


There are frameworks that run on Mesos capable of running Docker containers. Check out Singularity, for example.


AFAIK, the Docker capability is largely due to Mesos. Frameworks need the ability to specify a container type so that they can specify that a Docker container can run. But running a Docker container is the responsibility of Mesos. So Marathon and Chronos, both frameworks, are able to specify a job which uses a Docker container, while Mesos is able to run the job which uses a Docker container.


Mesos is written in C++ FYI.


Mesos itself is, but it seems every framework out there is written in Java or Scala: Chronos, Marathon, Aurora, Spark, Singularity etc. Also, it relies on Zookeeper, which is of course Java.


If you don't want to manage ZooKeeper manually to bootstrap Mesos cluster, you need another Mesos cluster haha.


Or just use Exhibitor. I don't remember the last time I manually anything'd with Zookeeper.

Between Exhibitor and Curator, I honestly find Zookeeper so straightforward and easy to work with that I don't quite understand the popularity of etcd.


I'm running a small Exhibitor/Zookeeper cluster dockerized (5 nodes, several hundred clients), and its extremely straightforward. Etcd isn't what we'd consider production-grade yet.


Having worked with etcd in production for the last few months, I have to agree. The CoreOS stack needs some more time to marinate.


Thanks for this comment. Glad to know I made the right choice.

I'm not saying etcd won't ever overshadow Zookeeper, it probably will with the momentum behind it, but as an ops guys, I wasn't willing to bet production application service discovery on it.


My distaste for the Go community is pretty well-established in these parts; I think worse-is-better is screwing us all, and etcd seems to me to be the worse-is-better Zookeeper. And for things that don't matter, sure, worse-is-better your life away; a Rails app can be whatever you want, but the infrastructure I manage had better be bulletproof. I won't say etcd will never be competitive, but without some significant changes, I don't see it getting my vote--and those changes are largely around the parts of the feature set that etcd doesn't support, at which point...why use it, anyway?


What particular issues have you run into with etcd and/or CoreOS?


Lots of split brains. Serious bugs making it through the alpha and beta channels into stable (and our boxes auto-updating only to become useless). Fleet units dying purely due to problems with fleetd/systemd. A particularly painful one was an Akka deployment on top of CoreOS where a sidekick unit would fail to start because fleet hadn't actually copied the unit file to the remote host. Only happened with sidekicks but due to how we ran our networking, it effectively killed the application. Almost every redeploy required manually getting fleet to copy the unit over.


Just to add on: I've had fleet misreport unit status and btrfs reporting lack of disk space for no apparent reason. Also the inability to restart individual failed units which are part of a global unit.

Also there was that one time they changed how cloud-config was parsed, so if "#cloud-config" wasn't on the very first line without preceeding spaces, initialisation would fail. That was when I switched the reboot strategy to manual.


Btrfs is no longer the default for CoreOS for this reason. Overlayfs doesn't have this issue.


Oh man, yes. I'd blocked all my scaring memories of btrfs biting me in the ass.


Matches up pretty well with my experience, too. I do not trust fleet as far as I can throw it.


Yeah, the whole project was something of a disaster. Eventually things stabilized a bit but every few weeks etcd or fleetd would throw a curveball and I'd lose a day of time chasing down the problem.


> Between Exhibitor and Curator, I honestly find Zookeeper so straightforward and easy to work with that I don't quite understand the popularity of etcd.

Tools like etcd and consul fit into the Unix philosophy of small, composable tools. Zookeeper is more a part of the Enterprise Java philosophy which many people have written off for various reasons, both rational and irrational.

Having run Zookeeper in the past and now having run Consul in production for the past ~6 months, I can't imagine ever running Zookeeper again, unless I'm using a tool that's built on top of it. Consul is just easier to use/maintain and we've yet to run into any problem with it. Zero problems in six months. In all the time I ran Zookeeper, I could never say that.


Wait a tick, how does consul fit into the "Unix philosophy" (which is asinine, bullshit, and wrongheaded, in whatever order you please, but set that aside) but Zookeeper doesn't? Your description of running Consul problem-free but being stricken with issues with Zookeeper is foreign to me, but okay, anecdotes, but this is bonkers, man. Consul being a DNS server and a K/V store and a health checker is so not-Unix-philosophy it should hurt you physically to say that.

I mean, Consul is fine for what it does. I've used both, whatever. But if you're going to have something that does a bunch of things, I'd much rather have the one that supports the primitives to do what somebody needs, rather than trying to do it all itself.

(Personally, after trying to work with Terraform, I don't much trust Hashicorp's attempts to write code I have to rely on to work correctly and never ever break. YMMV, of course.)


> how does consul fit into the "Unix philosophy"

- Single binary executable

- Compatible with ps (Zookeeper has the traditional java problem of showing up as .../bin/java followed by 4 lines of classpath)

- Arguments to consul don't need to be prefaced with -D (another common java problem)

- Passing -h to consul actually helps you figure out how to run it.

Oh, and the download is 1/3 the size of Zookeeper, and the executable includes the Go runtime whereas Zookeeper's java runtime is separate.

Etcd is actually much closer to the Unix philosophy. Consul seems to go more in the direction of similar Go tools like Docker where it bundles related activities together into one executable. But, then again, parts of the Unix ecosystem do this to (openssl, for one).


You could run it on top of something like CoreOS, but then you're just swapping Zookeeper for etcd.


I'm curious which of the Marathon bugs caused your downtime?


I'm afraid I don't remember exactly which one it was but after an upgrade of the framework it worked perfectly again.


The 0.7X releases contained a number of bugs. 0.8X has been significantly better!


Don't poll the apps API if you have a decent number of tasks or it can bring the whole system to its knees.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: