Hacker News new | past | comments | ask | show | jobs | submit login
Lmctfy: Google's Linux application container stack (github.com/google)
298 points by friism on Oct 3, 2013 | hide | past | favorite | 74 comments



There is a great wired article [1], which outline how Google uses orchestration software to manage linux containers in its day-to-day operations. Mid way through the article there is this great diagram, which shows Omega (Google's orchestration software), and how it deploys containers for images, search, gmail, etc onto the same physical hardware. There is an amazing talk by John Wilkes (Google Cluster Management, Mountain View) about Omega at Google Faculty Summit 2011 [2], I would highly recommend watching it!

By the way, one of the key concepts of containers is, control groups (cgroups) [3, 4], and this was initially added to the kernel back in 2007 by two Google engineers, so they have definitely given back in this area. I know all this because I have spent the last two weeks researching control groups for an upcoming screencast.

I am happy Google released this, and cannot wait to dig though it!

[1] http://www.wired.com/wiredenterprise/2013/03/google-borg-twi...

[2] http://www.youtube.com/watch?v=0ZFMlO98Jkc

[3] http://en.wikipedia.org/wiki/Cgroups

[4] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt


I just spend the last hour or so digging through the code and playing with CLI. It's pretty neat. The "specification" for creating new containers (how to describe limits, namespacing etc) is not very well documented so it takes some trial and error... But this feels like a nice, clean low-level component.

It can be used as a C++ library, too - I'm going to evaluate it as a possible low-level execution backend for docker :)


In response to the other comments mentioning docker - this definitely does not compete with docker. A better comparison would be with the lxc tools or maybe something like libvirt-lxc or systemd's nspawn.


Quick summary of my early explorations:

* Building is relatively straightforward on an Ubuntu system. You'll need to install re2 from source, but that's about it.

* No configuration necessary to start playing. lmctfy just straight up mounts cgroups and starts playing in there.

* Containers can be nested which is nice.

* I really couldn't figure out useful values for the container spec. Even the source doesn't seem to have a single reference - it's all dynamically registered by various subsystems in a rather opaque way. I opened a few issues to ask for more details.

* This is a really low-level tool. Other than manipulating cgroups it doesn't seem to do much, which is perfect for my particular use case (docker integration). I couldn't figure out how to set namespaces (including mnt and net namespaces which means all my containers shared the host's filesystem and network interfaces). I don't know if that functionality is already in the code, or has yet to be added.

* Given the fairly small footprint, limited feature set, and clean build experience, this really looks like an interesting option to use as a backend for docker. I like that it carries very little dependencies. Let's see what the verdict is on these missing features.


> useful values for the container spec

Are you referring to the container spec in the proto file? https://github.com/google/lmctfy/blob/master/include/lmctfy.... Which attributes are you having trouble setting a useful value for?


According to the docs all that can be set is cpu and memory limits. So maybe that's the extent of it for now, even though the proto identifies more.


Thank you! I was scanning .h files for a type declaration.. Silly old-school me :)


I only knew because I've worked with the internal analogue of this library recently. Glad I could help.


Most apps running on Google servers are aware that they're running in a shared environment, so they don't need the overhead of virtualized network interfaces. So I doubt that there will be any specific support for network namespaces.

And you can approximate mount namespaces with chroots and bind mounts. (In some ways that's better, since it's a bit easier for a process outside the container to interact with the container's filesystem).


We (lmctfy team) are in the middle of designing and adding namespace support. It will start trickling in soon.


Damn. This means it's much less useful to me (and 99% of applications outside of google). I guess I could combine lmctfy with a namespacing library of my own. But that's more extra work than I was anticipating.


Namespaces will be coming, but we're not there yet. This captures some of what we are already doing internally, but not all of it, yet.


Perhaps they'd be open to a collaboration where you add that functionality and then you use it to make docker (even more!) awesome.


The namespacing part is much simpler, if you have specific use cases.


google is probably holding the piece that competes with docker or openvz, they should have somthing like docker internally


I'm sure they do - but again google is probably holding a proprietary alternative to virtually every piece of software in the world :) Most of it is too tied to "the google way" to be useful to anybody else.


This is more true than the idea that there's a competitive advantage. We would love to open-source more of our stack (and are working towards it), but it's all very tied to the rest of our cluster environment. Piece by piece.


As an ex-googler I agree. Much if it just wouldn't be useful as is to most people as it's an environment that few match in load, resources, and diversity of services. Plus it's all integrated tightly across the board.


Am I the only one who feels like cgroups are extraordinarily complex for the problem they're trying to solve? It seems like a simpler structure could have achieved most of the same goals and not required one or two layers (in the case of docker cgroup->lxc->docker) of abstraction to find widespread use.

In particular, was the ability to migrate a process or have a process in two cgroups really essential to containerization? It seems like without those it'd be a simple matter of nice/setuidgid-style privilege de-escalation commands to get the same kinds of behaviour without adding a whole 'nother resource management to the mix (the named groups).

The cgroups document you link to as [4] has such a weirdly contrived use case example it makes me think they were trying really hard to come up with a way to justify the complexity they baked into the idea.


(Original cgroups developer here, although I've since moved on from Google and don't have time to play an active role anymore.)

It's true that cgroups are a complex system, but they were developed to solve a complex group of problems (packing large numbers of dynamic jobs on servers, with some resources isolated, and some shared between different jobs). I think that pretty much all the features of cgroups come either from real requirements, or from constraints due to the evolution of cgroups from cpusets.

Back when cgroups was being developed, cpusets had fairly recently been accepted into the kernel, and it had a basic process grouping API that was pretty much what cgroups needed. It was much easier politically to get people to accept an evolution of cpusets into cgroups (in a backward-compatible way) than to introduce an entirely new API. With hindsight, this was a mistake, and we should have pushed for a new (binary, non-VFS) API, as having to fit everything into the metaphor of a filesystem (and deal with all the VFS logic) definitely got in the way at times.

If you want to be able to manage/tweak/control the resources allocated to a group after you've created the group, then you need some way of naming that group, whether it be via a filesystem directory or some kind of numerical identifier (like a pid). So I don't think a realistic resource management system can avoid that.

The most common pattern of the need for a process being in multiple cgroups is that of a data-loader/data-server job pair. The data-loader is responsible for periodically loading/maintaining some data set from across the network into (shared) memory, and the data-server is responsible for low-latency serving of queries based on that data. So they both need to be in the same cgroup for memory purposes, since they're sharing the memory occupied by the loaded data. But the CPU requirements of the two are very different - the data-loader is very much a background/batch task, and shouldn't be able to steal CPU from either the data-server or from any other latency-sensitive job on the same machine. So for CPU purposes, they need to be in separate cgroups. That (and other more complex scenarios) is what drives the requirement for multiple independent hierarchies of cgroups.

Since the data-loader and data-server can be stopped/updated/started independently, you need to be able to launch a new process into an existing cgroup. It's true that the need to be able to move a process into a different cgroup would be much reduced if there was an extension to clone() to allow you to create a child directly in a different set of cgroups, but cpusets already provided the movement feature, and extending clone in an intrusive way like that would have raised a lot of resistance, I think.


Cool, thanks for the details.


The good news is that namespaces (the most interesting part of containers) are simpler than cgroups, and the api is stable.

cgroups are indeed a mess. The api is highly unstable and there is an effort underway to sanitize it, with the help of a "facade" userland api. In other words kernel devs are basically saying: "use this userland api while we fix our shit". (I don't claim to understand the intricacies of this problem. All I know is that, as a developer of docker, it is better for my sanity to maintain an indirection between my tool and the kernel - until things settle down, at least).


cgroups are extremely powerful, but they are fairly complex, it took me some hands on experience to wrap my mind around it. Redhat has done a great job on the intigation side. You can watch a demo @ http://www.youtube.com/watch?v=KX5QV4LId_c


There is also warden, but it doesn't get much press.

https://github.com/cloudfoundry/warden/tree/master/warden


yes, cloud foundry has been using warden for PaaS isolation between hosted apps for awhile. it was originally authored by redis and cloud foundry contributor pieter noorduis currently working for vmware [1]. the ongoing work has been continued by the cloud foundry team at pivotal.

warden has a c server core [2] wrapping cgroups and other features currently on the lmctfy roadmap like network and file system isolation [3]. the current file system isolation uses either aufs or overlayfs depending on the distro/linux version you are using [4]. the network uses namespaces and additional features.

warden also has early/experimental support for centos in addition to ubuntu, although some of the capabilities are degraded. for example, disk isolation uses a less efficient, but still workable copy file system approach.

the client orchestration of warden is currently written in ruby, but there was also a branch started to move that to go [5] that has not been hardened and moved into master.

recently cloudfoundry started using bosh-lite [2] leveraging warden to do full dev environments using linux containers instead of separate linux hosts on many virtual machines from an IaaS provider, which has dramatically reduced the resources and time required to create, develop and use the full system.

[1] https://twitter.com/pnoordhuis [2] https://github.com/cloudfoundry/warden/tree/master/warden/sr... [3] https://github.com/cloudfoundry/warden/blob/master/warden/RE... [4] https://github.com/cloudfoundry/warden/blob/master/warden/RE... [5] https://github.com/cloudfoundry/warden/tree/go


Just a follow up to my comment. I have released the "Introduction to Linux Control Groups (cgroups)" screencast which I talked about in my previous comment. View it @ http://sysadmincasts.com/episodes/14-introduction-to-linux-c...


You may want to check out Mesos: http://mesos.apache.org/


My understanding (from reading various articles) is that Mesos is very analogous to Google's original scheduler, Borg.


I assume the name is a reference to http://lmgtfy.com ?


haha that's true! let me contain that for ya!


Speaking of references, in a previous life working in neuro, I encountered the drug Lamictal a lot. As a result, I can only 'hear' the name as "lamictify" instead of their version "lem-kut-fee".


Can someone explain why you'd would use this instead of LXC? Is it just that Google built this before LXC existed, or are there differences in what it's useful for or capable of?


One of the pieces of code that has a more general purpose:

https://github.com/google/lmctfy/blob/master/util/task/codes...

One thing I really like about working with Google software is that you can count on the same namespace of error codes being used pretty much everywhere. Generally speaking, the software I write and work with can't differentiate between errors finer than these. The machine readable errors are the ones you can respond to differently, and you put detailed messages in the status message. This is how it should be, according to this semi-humble engineer!


I wouldn't get super excited about this... there's very little that's new here (it's almost like a LMRTFY: let me reimplement that for you!). As a layer on top of kernel functionality, this code really seems very thin, basically doing the same thing but far less of it than the existing LXC userspace tools. It is targeted at process CPU and memory isolation, rather than entire system image virtualization.

Key quotes:

(1) Currently only provides robust CPU and memory isolation

(2) In our roadmap... Disk IO Isolation ... Network Isolation ... Support for Namespaces ... Support for Root File Systems ... Disk Images ... Support for Pause/Resume ... Checkpoint Restore


FWIW, we have a lot of these things done internally, but not in a releasable form yet.


Could you perhaps explain what the overall motivations were for not using LXC userspace tools and instead creating an alternative?

Watching the Google Omega talk video linked in the comment above, I am guessing your implementation probably mostly exists to instantiate Google Omega cluster cell-manager specified jobs including intra-google properties of resource shape, constraints and preferences in a local container. I am guessing that part is not released because the current code has too much to do with your internal standards for expressing the above job-related metadata.


LXC didn't really exist in a usable form when Google started work on kernel-based resource isolation.


Hrrm OK... when was that? 2009 or earlier I suppose. I was using it in 2010 with few hiccups.


We started building a large-scale datacenter automation system in 2003, and by late 2005 it was deployed on most machines but it became apparent that achieving the high density of job packing we wanted was going to be impossible by relying on post-hoc user-space enforcement of resource allocation (killing jobs that used too much memory, nicing jobs that used too much CPU, etc). Sensitive services like websearch were insisting on their own dedicated machines or even entire dedicated clusters, due to the performance penalties of sharing a machine with a careless memory/CPU hog. We clearly needed some kind of kernel support, but back then it didn't really exist - there were several competing proposals for a resource control system like cgroups but none of them made much progress.

One that did get in was cpusets, and on the suggestion of akpm (who had recently joined Google) we started experimenting with using cpusets for very crude CPU and memory control. Assigning dedicated CPUs to a job was pretty easy via cpusets. Memory was trickier - by using a feature originally intended for testing NUMA on non-NUMA systems, we broke memory up into many "fake" NUMA nodes, and dynamically assigned them to jobs on the machine based on their memory demands and importance. This started making it into production in late 2006 (I think), around the same time that we were working on evolving cpusets into cgroups to support new resource controls.


Interesting history, makes sense. (I thought your name was familiar: you are the author of the cgroups.txt kernel documentation! Do you still get to work on this stuff much? What is your take on the apparent popularization of container-based virt? What are the kernel features you would like to see in the area that do not yet exist?)

Was there a reason you guys didn't open source this many years ago?


I left the cluster management group over three years ago, so I've not had much chance to work on / think about containers since then.

This code grew symbiotically with Google's kernel patches (big chunks of which were open-sourced into cgroups) and the user-space stack (which was tightly coupled with Google's cluster requirements). So open-sourcing it wouldn't necessarily have been useful for anyone. It looks like someone's done a lot of work to make this more generically-applicable before releasing it.


What menage said is true. As it came time to clean up what we were using internally it became a decision of adapting the code and semantics we already have to be cleaner, or starting anew with something like LXC (which does more than we need in some ways and less in others).

We have a lot of really well tested and battle-hardened code and behavior, so we chose to keep that.

lmctfy is designed from the ground up as automation which can be used by humans, and never the other way around.


p.s. Hello menage! Nice to see you commenting here :)


Great to see some of this code finally making it out into the world, although it seems to have changed a bit over the last few years. (I still recognize a few low-level functions.)

Who came up with the name?


The name was a collaborative effort. We suck at names :)

And yeah, it's been retooled a lot to extract it more cleanly from the surrounding code.


People are comparing this to Docker, but lmctfy seems to be much more limited -- it doesn't even try to isolate the filesystem, for instance.

In fact, based on the documentation, I don't see how this is anything different from the "cgroup-bin" scripts that have shipped with Ubuntu for years: http://linuxaria.com/article/introduction-to-cgroups-the-lin...


Decent description in the readme, but could not find the part that explains "what does this buy me?"

Finally, it came down to:

"This gives the applications the impression of running exclusively on a machine."

OK but as an outsider that still doesn't tell me what it buys me (or what it buys you, or Google).

(By outsider, I mean I have reasonable ability to administer my own Linux system, but wouldn't trust myself to do so in a production environment... so I'm not up on the latest practices in system administration or especially Google-scale system administration.)

Setting aside whether I need it (I'm pretty sure I don't, so no need to tell me that) I'm really curious what this is good for. Can someone explain it in more lay person's terms? Sounds like applications can still stomp on each others files, and consume memory that then takes away from what's available for other applications, so what is the benefit?

I'm not questioning that there's a benefit, just wondering what it is, and how this is used.


cgroups (and lmctfy) does support limiting memory usage on a per-application basis, as well as a bunch of other resources (ability to run on particular CPUs, access certain network ports, disk/IO, etc).

You can also prevent applications from stomping on each others' files, with a combination of permissions, chroots and mount namespaces.

This is basically a low-level API for a controller daemon. The daemon knows (via a centralized cluster scheduler) what jobs need to be running on the machine, and how much of each kind of resource they're guaranteed and/or are limited to. lmctfy translates those requirements into kernel calls to set up cgroups that implement the required resource limits/guarantees.

While you could use it for hand administration of a box, or even config-file-based administration of a box, you probably wouldn't want to (lxc may well be more appropriate for that).


You can run many different types of applications on the same host without the risk of one of them utilizing all of the system resources and leaving none for the others.


Great some competition for Docker! Which is good news for everyone (esp. Docker) Containers are the future of hosting web-apps in my opinion. They more implementations there are the better.


It's not competition for docker, more a complement to it.

edit: see shykes reply: https://news.ycombinator.com/item?id=6487080


It's somewhat related to compete with docker but in no way to complement docker. In fact it has been recommended not to run lmctfy alongside LXC (which docker user in its core).


What about linking to the lmctfy so instead of shelling out to LXC as Docker currently does? There's a lot of promise in such an approach, should lmctfy solidify further.


Hacked together a Vagrantfile if anyone wants to try it out:

https://github.com/silas/vagrant-lmctfy


After hearing about this and Docker, I'm interested in learning more about containers from a high level point of view. What are they used for, beyond very slim VMs? What direction is the technology heading?


Hey Tom, here's a high-level overview of containers in general (and docker in particular): http://www.docker.io/learn_more


Consider it very fine-grained resource control: You can restrict exactly what memory, cpu, process visibility, network access etc. a process, or group of processes can have.


So you have a base set necessary for an OS of some sort, then match and choose whatever you want?


- Off Topic -

Is just me or do you guys think that's strange Google using github instead his own code hosting infrastructure.


I don't think it's strange that Google lets teams use their preferred code hosting/versioning system.


that, and GitHub's offering isn't going to be spring-cleaned anytime soon.


> it is not recommended to run lmctfy alongside LXC or another container system

I guess this makes it a choice between this and Docker.io, unless it becomes a docker backend.


It should have a better name. More pronounceable and more strong against typos :)

Other than that, why open source it now? Is it a race against docker and lxc? Or is it just simply Google's paying back to FLOSS?


We've wanted to open source this for some time, but it takes some time to make it less Google-specific. Things just lined up recently and we were able to finally release it.


Someone (probably me, eventually) needs to write a libcontainer and a libresource that can be used on BSD / Linux without the LXC mess


Do you have experience with BSD RACCT?

https://wiki.freebsd.org/Hierarchical_Resource_Limits

If so I'm curious how they compare with cgroups.

I agree a platform-independent API would be very useful, but I wonder how close the semantics are.

I think a process-isolation model, possibly with capiscum, is more interesting than the LXC-like "VM model" (which does seem messy to me). I don't need an init process and fake hardware inside the container. Just Unix process tree isolation.

For example, I think BSD jails have the option to use host networking, which in Linux is analogous to not using network namespaces.


Like this? http://libvirt.org/

It seems the FreeBSD port originally failed due to deficiencies in kernel features[1] but is now available[2] at least partially, possibly only to manage workloads on remote systems (typically Linux).

[1] http://forums.freebsd.org/showthread.php?p=100894

[2] http://svnweb.freebsd.org/ports/head/devel/libvirt/


No libvirt is a disgusting mess, look at the internals sometime.


Interesting that they are using Github and not Google Code. Wonder if the latter is next on the Google kill list?


Other have commented that this is a very thin API on top of the already existing LCX system. This worries me - maybe this is a move from Google to be able to switch away from Linux (their last GPL component in Android).


From README:

> lmctfy was originally designed and implemented around a custom kernel with a set of patches on top of a vanilla Linux kernel.

No sign of said patches though. Anyone know if they're available?


We'll put a kernel image on the site and the set of google-specific patches. In process of cleaning them up to work with some of the upstream kernels.


Those are coming too, they just take a bit longer to release. Most are things we've shared before though.


Interesting that this is C++ (vs Docker in Go).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: