Putting GPUs to work with Kubernetes

Seanny123 · on April 19, 2017

I keep seeing Kubernetes appear on HackerNews. Is there a quick thing I can read to explain why everyone's so excited about? I know it's container orchestration, but I'm not sure what people are using it for or what pain point it is revealing.

moondev · on April 19, 2017

https://vishh.github.io/docs/concepts/overview/what-is-kuber...

Seanny123 · on April 19, 2017

If you'd like to engage with me further, how does a company know it needs Kubernetes? If I'm Soylent and I'm processing a few orders a minute, I'm probably safe with a few redundant monoliths. Do I have to be Uber? What's the middle-ground between Soylent and Uber that would still need this?

Is the answer the same as the question "who needs a microservice architecture"?

samnco · on April 19, 2017

There are a few killer features that you would benefit at any size and that I really love

* self healing: when you create a deployment/replica set. it will be maintained at all cost, so if the app has a memory leak or anything goes wrong, it will be contained and kept up and running

* Rolling update: even when you run 5 frontends, it is a pain to use capistrano or other tools to just update at git repo. it is literally a one liner in Kubernetes. If you use CI/CD the setup is just a few lines in any Jenkins/Gitlab/Travis...

* Service discovery: the combination of ENV and predictable DNS endpoints is just awesome

* Ecosystem: PaaS, Serverless... Many of the new world infra is built on K8s, so it is a door to the next gen, whether you know you will use it or not.

As for Micro Service Architecture, just starting with the web frontend and a couple of lightly dockerize middleware makes it sooooo simple that you instantly want to get more out of it.

As the overhead of running K8s vs. set of servers is relatively low especially at small scale, it is definitely worth looking at. Happy to do a run through with you and show you how the deployment of a tiered app works as a demo, ping me on @SaMnCo_23 if interested.

tokenizerrr · on April 19, 2017

When running a kubernetes cluster on your own hardware what do you use for storage?

samnco · on April 19, 2017

You have several options:

* Run Ceph in separate nodes and connect it to the cluster. With Juju, you can do that from the bundle, as Ceph is also a supported workloads. This gives you scale for storage

* Run Ceph within the cluster with a Helm chart. We see that for openstack-helm for example. Also gives you scale, but the lack of device discovery makes it less interesting in my opinion

* Run an NFS server, plain easy but not very scalable.

* Use hostpath, which is the default but doesn't get you scale.

tokenizerrr · on April 19, 2017

So Ceph is the preferred storage provider? I've noticed there is a huge list, including GlusterFS. Do you have experience with any of the other ones?

ferrantim · on April 19, 2017

Ceph is not a good solution at all for databases. Your performance is going to be terrible.

Ceph is object-based SDS solution designed to take servers with local drives and create a SAN out of them. In order to do this, they take each LUN (Ceph volume) and scatter the data across all nodes in the cluster. They do not assume that applications will run on these servers themselves... they assume compute is elsewhere, like a traditional SAN. The goal is to replace a SAN with servers, not create a converged platform. Also, Ceph was designed during an age where an Intel server did NOT have tier-1 capacity (8 - 20 TB), which is why they shard a volume across so many servers.

This causes a problem for modern applications like Cassandra, Mongo, Kafka etc, where they like to scaleout themselves and want a converged system, where data is not scattered, but on the node where an instance of that cluster runs. Ceph also disrupts (undo) the HA capabilities that these scaleout applications have (For example, a Cassandra instances data will not be on a node on which it thinks it is).

tokenizerrr · on April 20, 2017

Do you happen to have any suggestions for alternatives to look into?

ferrantim · on April 24, 2017

You could look at Portworx (disclosure, I work there so am biased, but you can test it yourself for free)

marcoceppi · on April 19, 2017

Gluster is a good alternative, Ceph can be prickly if you don't want a block device and instead just need a filesystem. CephFS fills sort of the same role, but does so on top of Ceph rbd.

Since GlusterFS /can use/ NFSv4 as a client, it should work with the stuff @samco_23 uses

samnco · on April 19, 2017

Ah you are right, I forgot about glusterfs. My bad.

Canonical at this stage only supports Ceph commercially, but it doesn't mean GlusterFS is not a good option. I haven't tried it myself, so can't tell.

Anyone?

eicnix · on April 19, 2017

When you are running on your own hardware you usually have multiple options for storage:

- Use the local node storage although this is very simple but can get complicated on more complex installations

- Connect to your existing storage solution using ISCSI or NFS

- Running your own distributed storage solution on top of Kubernetes for Kubernetes. e.g. https://github.com/rook/rook

aquadrop · on April 19, 2017

So, how well is Kubernetes suited for working with local hardware? As I understand it's mainly supposed to work with some external abstracted storage, like AWS EBS, cephs, NFS etc But it's much slower that local SSD and for some small local installation maybe be not optimal. Like, running some not large non-critical service which requires database, several workers, some monitoring etc, overall 3-4 local servers. Is Kubernetes a good fit for this or it's only supposed to work with hive-like stateless workers connected to external storage over network?

samnco · on April 19, 2017

You have several options for this. If it is non HA, then you can pin a RC to a specific node, and use hostpath storage. if the container fails, it will always respawn on the same node, maximizing uptime and also having max capacity from your local SSD. Alternatively, you can also run rook, which is backed by Ceph, and use affinity to make sure that your pods are very close to storage, and gain back some of the speed.

In general and in my opinion, it is always better to run with k8s since you have, for the stateless pieces, cluster awareness. So there is never a downside to it, especially as the control plane is very lightweight for small clusters, and you can colocate many parts.

aquadrop · on April 19, 2017

Thanks for the answer. How easy will it be to transfer this cluster to another set of servers (with data copy)? Like, stop the service for several minutes, push the button "Transfer" and start service on new servers after that. As I understand you'll need rook for something like that?

samnco · on April 19, 2017

Typically, you would have a set of "helm charts" (packages) for your application(s). So deploying, without data, would be something like a sequence of "helm install app-appId --values /path/to/config/for/environment.yaml"

At this stage, you have no data. Normally, if you app depends on data, the pods will keep failing until data has been moved and is available (at which point they stabilize in an equilibrium state).

If you run Ceph in both environments, then you may want to use Ceph replication. If you use another storage layer, then it's also up to you to make sure the data is moved around.

rook should have support for that use case. However, it is still alpha as per the GitHub readme, so use at your own risk. However, you may want to consider the data replication problem outside of the scope of k8s.

If the 2 sets of clusters are physically close to each other, you may want to just point the new apps to the old data and pull the switch from the first one.

Another option would be to run another beta feature in K8s called Federation, which allows to manage several cluster via a single control plane.

Sorry for the long comment, your question has a broad scope, and it's hard to answer without diving into details.

aquadrop · on April 19, 2017

Thanks for your explanation. Looks like k8s is designed by Google with Google in mind - where data is somewhere over network and main problem is to manage code. I just thought it could be perfect solution to completely abstract and isolate such small services and move them around on local hardware with one push of a button, but it seems I'll have to think about external ways to manage data replication.

yebyen · on April 19, 2017

So, your thought process has apparently led you quickly to the same conundrum that I have found myself in, after experimenting with this stuff for several generations.

You might be interested in Deis v1 PAAS as a historical reference. Deis is a company that specializes in Kubernetes (was just bought by Microsoft). They have been in the container orchestration game since before Kubernetes was a kid (and even before containers were really en vogue.) Deis v1 PAAS is the ancestor of Deis Workflow (or v2) which is a product that runs solely on Kubernetes.

Workflow does not do distributed filesystems internally where PAAS v1 did. That is why I'm telling you about it. PAAS v1 had its own storage layer called deis-store, which is (was) essentially CephFS and RBD under the hood. They did the best they could to make sure you did not have to be a competent Ceph admin just to get it started, but as it happens you would be running Ceph and susceptible to all of the Ceph issues.

Distributed filesystems are complicated business.

Deis was running Ceph for internal purposes. Deis used the Store component to take care of log aggregation ("Logger"), container image storage ("Registry"), and Database backups. When Workflow was released, it was targeting Kubernetes and required PVC support (AWS S3, or GCE PD, or one of the other storage drivers).

It still handles Log Aggregation, Database Backups, and Image Storage, but it uses the platform services to do this in an agnostic way (that is, whatever way you have configured to enable PVC support in Kubernetes.)

The Ceph support provided by Deis v1 was never intended to be an end-user service, it was for internal platform support. I thought about using it for my own purposes but never got around to it. The punchline is this: porting your applications to Deis requires you to re-think the way they are built to support 12factor ideology. Porting your applications to Kubernetes requires no such thing... but it helps!

Also that distributed storage is a complicated problem, and if you undertake to solve it for yourself, you should not take it lightly. (OR do take it lightly, but with the understanding that you haven't given much rigor to that part.)

What was good advice for Deis v1 is still good advice for Kubernetes today. If you are building a cluster or distributed architecture to scale, you should really consider separating it into tiers or planes. In Deis v1, the advice was to have a control plane (etcd, database), storage plane (deis-store or Ceph), data plane (your application / worker nodes), and routing mesh plane (deis-router, traefik, or the front-end HTTP serving nodes.) All of those planes may require special attention to make them reliable and scalable.

In my opinion none of this has anything to do with AWS or Google, but those two providers have positioned themselves well to be the people that do work on solving those hard problems for you. I would certainly start experimenting with Rook, I had good experiences with deis-store and I've been looking for something to fill the void for me.

aquadrop · on April 19, 2017

Thanks, I know that distributed storage is hard, that's why I would be ok if K8s could just work with something like docker compose volumes on local storage and copy them between servers if needed.

yebyen · on April 19, 2017

It would be cool if Kubernetes had a native distributed filesystem. I don't read the future roadmaps but I wouldn't be too surprised to see it coming in a future release.

The first thing that anyone doing serious deployments needs is an image registry. For that to be HA hosted on a cluster, you need some kind of distributed filesystem.

But those PD/EBS solutions are pretty compelling and they're not going away.

ferrantim · on April 19, 2017

You can use Portworx[1] to transfer data between servers and datacenters. It has a native kubernetes driver, is production ready being used by GE and Lufthansa airlines among others and does sync and async replication of data between environments. (disclosure I work for Portworx). [1] https://portworx.com/

aquadrop · on April 19, 2017

Yes, looks interesting, but seems enterprisey and not open source?

ferrantim · on April 19, 2017

Correct, it is not opensource but you can use it for free on up to 3 nodes. Not sure what "enterprisey" means but to me that sounds like it works well for large deployments and that is true. We have customers running us with Mesos on up to 1000 nodes and K8s on hundreds of nodes.

aquadrop · on April 19, 2017

"Request a demo" and no pricing is what I mean with "enterprisey".

eicnix · on April 19, 2017

You can run distributed stateful workloads on kubernetes with local storage and disable rescheduling when the node goes down. This means that you need to manually migrate/restart workloads if a node goes down. There is work being done on improving handling local node storage https://github.com/kubernetes/community/pull/306 but it doesn't scale as well as network storage.

If you have an existing SAN solution you can connect to it via fiber channel over iSCSI.

moondev · on April 19, 2017

In my opinion, if you are running containers in prod, you need to be on Kubernetes, regardless of scale.

Kubernetes is so much more than just "planet scale". It encourages patterns and mindsets for efficient software delivery that can really pay dividends.

Here are some of my favorite things:

Cloud agnostic. Your team and business are not at the mercy of pricing, features or availability of a third party. You can run it on everything from a massive cluster on AWS to some cheap mini computers off ebay: https://hackernoon.com/diy-kubernetes-cluster-with-x86-stick... Moving between cloud-providers when they both run Kubernetes is fairly trivial. You can also run on multiple clouds at the same time. Kubernetes abstracts the infrastructure away. It's also really easy to run a single node cluster on your own machine for local development. Try doing that with AWS services in a reliable way.

Immutable infrastructure:

The fact that containers don't hold state FORCES you to develop your applications in a 12-factor pattern. Deploy images by tag forces you to create a pipeline that automates their builds. It also allows you to effortlessly roll-back. It's not an afterthought or something you need to glue together.

High availability:

Just define how many replicas of your service you want and k8s does the rest. If they crash so what. Not only will they be restarted automatically but they will automatically be distributed across your fleet for you. Node goes down? Who cares. It's self-healing.

Service discovery:

Just put a k8s service in front of your application replicas and everything is automatic. Nothing to install, simply refer to the stable dns service name and everything will be routed. Software agnostic.

Config Management:

Very easy to inject secrets and configs as env vars or mounted into the pod. No third party library or framework needed to leverage it.

Dev - Stage - Prod envs made easy:

The same container image can move through each env effortlessly and you can be sure there is no "artifact rot"

Extensible and open

You can run different container runtimes such as rkt or different pod networks and persistent storage options. There is not a single company trying to steer it in some way. Also recently with helm charts it's becoming very easy to "apt-get install" whatever you want on your cluster. Very powerful and portable.

It does take some time getting ramped up but once it clicks there is no turning back.

Tossrock · on April 19, 2017

Unprofitable for ETH mining maybe, but it seems like a natural fit to rent time on it to deep learning people with slow training models. Although that could still be unprofitable after the cost of electricity, I guess it's a question of market size/demand. A lot of deep learning is already at big infrastructure players anyway who wouldn't need the service, leaving academics / smaller companies. But maybe some people would find a reliable, scalable GPU cluster valuable.

samnco · on April 19, 2017

Ahah, good point. Really the ETH stuff was "because I can". But in the same charts repository you will find a Tensorflow chart. My previous series of blogs [0] was about exactly that. A nice addition as well for compute intensive workloads is the use of LXD [1]

Another use case is in media for transcoding. It is not a trivial job to orchestrate transcoding at scale, and Kubernetes with or without GPUs is an excellent solution for that as it is trivial to setup a completely automated job queue.

Also another interesting field will eventually be HPC but there are some constraints about compute that K8s does not tick scheduling wise at this point in time. There is a pluggable scheduler in the works I think, and this will eventually help. Also the LXD example is a nice optimization but it would not replace the scheduler in any way.

[0]: https://medium.com/intuitionmachine/gpus-kubernetes-for-deep...

[1]: https://hackernoon.com/job-concurrency-in-kubernetes-lxd-and...

aub3bhat · on April 19, 2017

Great read, on a smaller scale, I have found nvidia-docker with nvidia-docker-compose to be a great solution for deploying docker containers on AWS P2 machines with 8 GPUs.

nrki · on April 19, 2017

"1060GTX at home but on consumer grade Intel NUC"

A bit OT, but I'd like to see how this works...

Ah, very cool - https://www.youtube.com/watch?v=wyY-lTmgb8c

samnco · on April 19, 2017

Actually, it was a fun DIY project I did a while ago. You can read about it here: https://hackernoon.com/installing-a-diy-bare-metal-gpu-clust...

It works, but the GPUs aren't very stable at 4x vs. a normal 16x.

jacquesm · on April 19, 2017

That's one problem, another is the size of the powersupply. And maybe that's the only problem, I don't see why a GPU would become unstable when using fewer lanes, all it should do is get slower.

samnco · on April 19, 2017

I don't know. Maybe the make of the extenders isn't very good, I saw other people with similar issues.

The PSU is the Corsair AX1500i (1500W), with 10x lines for GPUs. It's robust on paper, didn't have any problem with just 4 plugged in.

But I must say... The T630 are very noisy compared to these, but so much more powerful #NotGoingBack

jacquesm · on April 19, 2017

I just bought a GTX1080ti + a similar corsair as an upgrade for my 3 year old Dell, it works like a charm.

If you have a PSU that big then that probably isn't the problem. I thought you might be using the PSU that comes with those extender boxes and they usually are very puny (250 W or so).

Do you use it for gaming or for CUDA?

Do you run the 4 GPUs in the extender?

samnco · on April 19, 2017

Yes, each GPUs has a 4x -> 16x and a 4x-4x extender, in addition to the m.2 -> PCI-e 4x adapter.

So many potential failure points in there. The sole use case is CUDA. Essentially I wanted a portable cluster with GPUs and that did the work for a couple of month. Now it's getting more serious so the switch to T630 makes sense, and I repurposed the NUCs into the control plane of the K8s cluster.

jacquesm · on April 19, 2017

I built this long ago:

https://clustercompute.com/images/image4.jpg

Which was a lot of fun.

Do you have all the GPUs internal to the T630?

Any chance of a picture (of the guts)?

I'm seriously thinking of duplicating your effort.

samnco · on April 19, 2017

Here you go :) https://drive.google.com/file/d/0B1CCk51NQ4koSmkxSmxWb1E5Y0E...

Replicating is not very hard. You need a lightweight x86 machine for MAAS, which takes ~20min to install, one VLAN for the iDRAC (IPMI), another for networking that can connect to internet, and off you go. You can also enable KVM power management in MAAS to run the Juju control plane in VMs and save a box if you're limited in compute power.

https://maas.io https://jujucharms.com/docs https://www.ubuntu.com/containers/kubernetes for all the goodies.

If you run into problems, I am SaMnCo on #juju in freenode.

jacquesm · on April 19, 2017

Ok, so 2 GPUs in there. Have you tried 4 or is that not possible for some reason?

I have plenty of other hardware floating around here so no problem on hooking it all up.

Thank you for the image.

samnco · on April 19, 2017

I have not, but it is technically possible. the PSU is the double 1100W with the GPU enablement kit. Up to 4x PCI-x 16x full speed. Also up to 1.5TB RAM, and 8x 3.5" HDD or 16x 2.5". I didn't go this far though ($$$...)

shaklee3 · on April 19, 2017

Is the author of this working on official support or just testing? I know there's a gpu roadmap for k8s, but I can't tell from this blog if this was part of it.

samnco · on April 19, 2017

Canonical will officially support GPUs when they lands GA upstream. The flag is beta as of now in the Canonical Distribution of Kubernetes. Paying customers either for the managed or supported solutions get a best effort for GPU, and this feature is enabled by default.

puzzle · on April 19, 2017

What is the requirement for privileged containers? The post never explains it.

samnco · on April 19, 2017

privileged containers are required for the GPU to be shared with the containers.

By default, the bundle come with a "auto" tag, which will activate privileged containers just when GPUs are detected.

You can enforce "false" to remove that, but then you won't be able to run GPU workloads.

Or you can enforce "yes" and have them activated all the time.

Does that answer the question? Not sure if I understood it right.

puzzle · on April 19, 2017

The Kubernetes docs don't say anything about having to use privileged containers for GPU support. Privileged containers are given tens of Linux capabilities; which of those are actually needed in your setup? Or, conversely, which specific step would fail for an unprivileged container?

Just because I want to use a GPU shouldn't require the power to change the clock, switch UIDs, chown files, mess with logs, reboot the machine, etc.

marcoceppi · on April 19, 2017

Since the GPU libraries are hosted on the node, privileged flag is typically required to make that possible. I'm sure there will be improvements to not require privileged, but today it's mostly a requirement to get anything useful out of containers tapping into GPU.

That said, if you set the allow-privileged flag to false GPU drivers will still be installed but you may not be able to make use of the cuda cores

puzzle · on April 19, 2017

That's weird, because all the times I tried the experimental support, it didn't need privileged containers. From the YAML files, it looks like it's using hostPath directories, but those don't require special privileges, unless you need to write to them:

https://kubernetes.io/docs/concepts/storage/volumes/#hostpat...

I suspect that there is a bug somewhere.

puzzle · on April 19, 2017

Ah, wait:

https://github.com/madeden/blogposts/blob/master/k8s-gpu-clo...

You don't need to mount the /dev entries into the container at all. The experimental support creates them automatically for you when you are using GPU resources. Perhaps it's device nodes, not the libraries that required privileges?

samnco · on April 20, 2017

Hello,

OK I gave it a try and you are absolutely right. For the nvidia-smi, I could run it the /dev/nvidia0, which is cool.

I was also able to run it unprivileged. I guess my mistake was to believe the example from the docs and not test without.

Thanks for sharing that, I'll update my charts and the post accordingly.

puzzle · on April 20, 2017

Awesome! Happy to hear that more containers will run without unneeded privileges.

samnco · on April 19, 2017

Aaah that is interesting. Let me dive into this later today and test my charts without that. It would actually make my life way easier for charting. I got that from a very early stage work and never questioned it again (the /dev stuff). Thanks for pointing that out.

stubish · on April 19, 2017

Its the Canonical distribution of Kubernetes. Supported by Canonical.