I recently rebuilt my Kubernetes cluster running across three dedicated servers hosted by Hetzner and decided to document the process. It turned into a (so far) 8-part series covering everything from bootstrapping and firewalls to setting up persistent storage with Ceph.
While the documentation was initially intended more as a future reference for myself as well as a log of decisions made, and why I made them, I've received some really good feedback and ideas already, and figured it might be interesting to the hacker community :)
Great write up and what I especially enjoyed was how you kept the bits where you ran into the classic sort of issues, diagnosed them and fixed them. The flow felt very familiar to whenever I do anything dev-opsy.
I’d be interested to read about how you might configure cluster auto scaling with bare metal machines. I noticed that the IP address of each node are kinda hard-coded into firewall and network policy rules, so that would have to be automated somehow. Similarly with automatically spawning a load-balancer from declaring a k8s Service. I realise these things are very cloud provider specific but would be interested to see if any folks are doing this with bare metal. For me, the ease of autoscaling is one of the primary benefits of k8s for my specific workload.
I also just read about Sidero Omni [1] from the makers of Talos which looks like a Saas to install Talos/Kubernetes across any kind of hardware sourced from pretty much any provider — cloud VM, bare metal etc. Perhaps it could make the initial bootstrap phase and future upgrades to these parts a little easier?
When it comes to load balancing, I think the hcloud-cloud-controller-manager[1] is probably your best bet, and although I haven't tested it, I'm sure it can be coerced into some kind of working configuration with the vSwitch/Cloud Network coupling, even if none of cluster nodes are actually Cloud-based.
I haven't used Sidero Omni yet, but if it's as well architected as Talos is, I'm sure it's an excellent solution. It still leaves open the question of ordering and provisioning the servers themselves. For simpler use-cases it wouldn't be too difficult to hack together a script to interact with the Hetzner Robot API to achieve this goal, but if I wanted any level of robustness, and if you'll excuse the shameless plug, I think I'd write a custom operator in Rust using my hrobot-rs[2] library :)
As far as the hard-coded IP addresses goes, I think I would simply move that one rule into a separate ClusterWideNetworkPolicy which is created per-node during onboarding and deleted again after. The hard-coded IP addresses are only used before the node is joined to the cluster, so technically the rule becomes obsoleted by the generic "remote-node" one immediately after joining the cluster.[3]
Have you tried KubeOne? Also with the benefits of machine-deployments. Works like a charm, didn’t go through your blogs, but KubeOne on Hetzner [0] seems easier than your deployment. And yes, also Open Source and German support available.
Hetzner Cloud is officially supported, but that means setting up VPSs in Hetzner's Cloud offering, whereas this project was intended as a more or less independent pure bare-metal cluster. I see they offer Bare Metal support as well, but I haven't dived too deep into it.
I haven't used KubeOne, but I have previously used Syself's https://github.com/syself/cluster-api-provider-hetzner which I believe works in a similar fashion. I think the approach is very interesting and plays right into the Kubernetes Operator playbook and its self-healing ambitions.
That being said, the complexity of the approach, probably in trying to span and resolve inconsistencies across such a wide landscape of providers, caused me quite a bit of grief. I eventually abandoned this approach after having some operator somewhere consistently attempt and fail to spin up a secondary control plane VPS against my wishes. After poring over loads of documentation and half a dozen CRDs in an attempt to resolve it, I threw in my hat.
Of course, Kubermatic is not Syself, and this was about a year ago, so it is entirely possible that both projects are absolutely superb solutions to the problem at this point.
If you ever want to have fun with setting up your own k8s, I recommend to start small. The author is already knowledgeable, so they probably knew from the start what they want, but a lot of this complexity is not essential.
When I deployed my first kubernetes "cluster", I just spinned a single-node "cluster" using kubeadm (today k3s is an option too) and started deploying services (with no distributed storage - everything stored using hostPath). You only need to know kubernetes basics to do this. Then you probably want to configure CNI (I recommend flannel when starting, later cilium), spin an ingress controller (I recommend nginx or traefik), deploy cert-manager (this was hard for me when I started) and you can go a long way. With time I scaled up, decided to use GitOps, and deployed many more services (including my own registry - I started with docker's own, then migrated to Gitea. Harbor is too heavy for me). And of course over time you add monitoring, alerting etc - the fun never ends (but it's all optional, you should to decide when is the right time).
Interesting read. I have just setup a very similar cluster this week: 3 node bare metal cluster in a 10G mesh network. Decided for Debian, RKE2, Calico and Longhorn. Encryption is done using LUKS FDE. For Load Balancing I am using the HCloud Load Balancer (in TCP mode).
At first I had some problems with the mesh network as the CNI would only bind to a single interface. Finally solved it using a bridge, veth and isolated ports.
Initially Ubuntu 20.04, but I upgraded to 22.04. Finally got it working -- turns out a lot of things that reference `--cgroup-driver="systemd"` are doing it as if it were run in shell, which means that the quotes around "systemd" get removed by shell, and would lead to an error & ignored options.
Nothing was showing whatsoever when using 20.04, so I wonder if there were some missing dependencies somewhere there...
I'll probably write up everything I discovered at some point, there's a lot of pieces that you have to cobble together from pretty disparate sources (network plugins, config files (which!?), etc).
Thankfully we've never had the need for such complexity and are happy with our current GitHub Actions > Docker Compose > GCR > SSH solution [1] we're using to deploy 50+ Docker Containers.
Requires no infrastructure dependencies, stateless deployment scripts checked into the same Repo as Project and after GitHub Organization is setup (4 secrets) and deployment server has Docker compose + nginx-proxy installed, deploying an App only requires 1 GitHub Action Secret, as such it doesn't get any simpler for us and we'll look to continue to use this approach for as long as we can.
I used to do something similar at a previous company and this works well if you don't have to worry about scaling. YAGNI principal and all that. When you run hundreds of containers for different workloads, k8s bin packing and autoscaling (both on the pod and node level) tips the balance in my experience.
Yeah if we ever need to autoscale then I can see Kubernetes being useful, but I'd be surprised if this a problem most companies face.
Even when working at StackOverflow (serving 1B+ pages, 55TB /mo [1]) did we need any autoscaling solution, it ran great on a handful of fixed servers. Although they were fairly beefy bare metal servers which I'd suspect would require significantly more VMs if it was to run on the Cloud.
I was a k8s contrib since 2015, version 1.1. I even worked at Rancher and Google Cloud. If you don't need absolutely granular control over a PAAS/SAAS (complex networking w/ circuit breaking yadda yadda, deep stack tracing, vms controlled by k8s (kubevirt etc), multi-tenancy in cpu or gpu) you don't need k8s and will absolutely flourish using a container solution like ECS. Use fargate and arm64 containers and you will save an absolute fortune. I dropped our AWS bill from $350k/mo to around $250k converting our largest apps to arm from x86.
GKE is IMO the best k8s solution PAAS wise that exists, but quite frankly few companies need that much control and granularity in their infrastructure.
My entire infrastructure now is AWS ECS and it autoscales and I literally never, ever, ever have had to troubleshoot it outside of my own configuration mishaps. I NEVER get on call alerts. I'm the Staff SRE at my corp.
> Ceph is designed to host truly massive amounts of data, and generally becomes safer and more performant the more nodes and disks you have to spread your data across.
I'm very pessimistic on CEPH usage in the scenario you have - may be I've missed it, but seen nothing about upgrading networking, as by default you gonna have 1Gbit on single interface used for public network/internal vSwitch.
Even by your benchmarks, write test is 19 iops (block size is huge though)
Max bandwidth (MB/sec): 92
Min bandwidth (MB/sec): 40
Average IOPS: 19
Stddev IOPS: 2.62722
Max IOPS: 23
Min IOPS: 10
while single HDD drive would give ~ 120 iops. single 3 years old NVMe datacenter edition, gives ~ 33000 iops with 4k block + fdatasync=1
CEPH would be very limiting factor in 1Gbit networking I believe - I'd put clear disclaimer on that for fellow sysadmins.
P.S. The amount of work you done is huge and appreciated.
Here's what I don't really get.. So, let's say you have three hosts and create your cluster.
But now, you still need a reverse proxy or load balancer in front right? I mean not inside the cluster but to route requests to nodes of the cluster that are not currently down.
So you could set up something like HAProxy on another host. But now you once again have a single point of failure. So do you replicate that part also and use DNS to make sure one of the reverse proxies is used?
Maybe I'm just misunderstanding how it works but multiple nodes in a cluster still need some sort of central entry point right? So what is the correct way to do this.
My solution for this setup is having ingress controllers on all three nodes, and then specifying all three IPs in all DNS records. That way the end user will "load balance" based on the DNS randomization.
Of course, if a node goes down, a third of the traffic will be lost, but with low TTLs and some planning, you can minimoze the impact of this.
It's an interesting approach.
I did it a bit differently. I set up three Proxmox nodes on three hetzner servers. Then I deployed virtual routers. I then set up HAProxy and k3s nodes as LXC containers.
What's nice about the whole setup is that a proxmox node can go down and it all still works. I will now set up keepalived as mentioned in the other reply so the HAProxies will also be fully HA. Proxmox also works well with zfs and backups.
I set up the proxmox nodes manually and did the rest with terraform + ansible. One `terraform destroy` cleans up everything nicely.
I wonder how the performance difference is between bare metal and k8s node in LXC.
You almost answered your own question. One common solution is to have 2 nodes with haproxy (or similar) sharing a virtual IP with keepalived that load balance de traffic to the control plane nodes and to the nodes where your ingress controller runs.
There are other options, like running the haproxy in the control plane nodes.
I've come to the conclusion (after trying kops, kubespray, kubeadm, kubeone, GKE, EKS) that if you're looking for < 100 node cluster, docker swarm should suffice. Easier to setup, maintain and upgrade.
Docker swarm is to Kubernetes what SQLite is to PostgreSQL. To some extent.
The docker swarm ecosystem is very poor as far as tooling goes. You're better off using docker-compose (? maybe docker swarm) and then migrating to k3s if you need a cluster.
My docker swarm config files are nearly the same craziness as my k3s config files so I figured I might as well benefit from the tooling in Kubernetes.
Edit for more random thoughts: being able to use helm to deploy services helped me switch to k3s from swarm.
This is almost exactly my experience with Docker Compose, which is lionized by commenters in nearly every Kubernetes thread I read on HN. It's great and super simple and easy ... until you want to wire multiple applications together, you want to preserve state across workload lifecycles for stateful applications, and/or you need to stand up multiple configurations of the same application. The more you want to run applications that are part of a distributed system, the uglier your compose files get. Indeed, the original elegant Docker Compose syntax just couldn't do a bunch of things and had to be extended.
IMO a sufficiently advanced Docker Compose stack is not appreciably simpler than the Kubernetes manifests would be, and you don't get the benefits of Kubernetes' objects and their controllers because Docker Compose is basically just stringing low-level concepts together with light automation.
Helm and Kustomize are low-budget custom resource definitions. They serve their purpose well and they have few limitations considering how much they can achieve before you write your own controllers.
In my opinion, the complexity is symptomatic of success: once you make a piece of some kind of seemingly narrowly focused software that people actually use, you wind up also creating a platform, if not a platform-of-platforms, in order to satisfy growth. Kubernetes can scale for that business case in ways Docker Swarm, ELB, etc. do not.
Is system configuration avoidable? In order to use AWS, you have to know how a VPC works. That is the worst kind of configuration. I suppose you can ignore that stuff for a very long time, you'll be paying ridiculous amounts of money for the privilege - almost the same in bandwidth costs, transiting NAT gateways and all your load balancers, whatever mistakes you made, as you do in compute usage. Once you learn that bullshit, you know, Kubernetes isn't so tedious after all.
Any sufficiently complicated Docker Swarm,
Heroku, Elastic Beanstalk, Nomad or other program
contains an ad hoc, informally-specified,
bug-ridden, slow implementation of half
of vanilla Kubernetes.
A pithy response to be sure, but is it true? Every Kubernetes object type exists within a well-specified hierarchy, has a well-specified specification, an API version, and documentation. Most of the object families' evolution are managed by a formal SIG. Not sure how any of that qualifies as ad-hoc or informal.
I'm not sure what to say here. The kubernetes docs and code speak for themselves. If you actually think that it's clean, simple, well designed, and easy to operate, with smooth interop between the parts, I can't change your mind. But in practice, I have found it very unpleasant. It seems this is common, and the usual suggestion is to pay someone else to operate it.
I agree in part - the features and simplicity of Docker Swarm are very appealing over k8s, but it also feels like so neglected that I'd be waiting every day for the EOL announcement.
It's built from another separate project called swarm-kit. So if it comes to that where it is abandoned, the forks would be out in the wild soon enough.
I see more risk of docker engine as a whole pulling some terraform/elastic search licensing someday as investors get desperate to cash out.
Docker is largely irrelevant in modern container orchestration platforms. Kubernetes dropped docker support as of 1.24 in favor of CRI-O.
Docker is just one of many implantations of the Open Container Initiative (OCI) specifications. It’s not even fully open source at this point.
Under the hood Docker leverages containerd which in tern leverages runc which leverages libcontainer for spawning processes.
Linux containers at this point will exist perfectly fine if Docker as a corporate entity disappears. The most impact that would be felt would be Dockerhub being shutdown.
They also sort of already did pull something like Hashicorp with their Docker Desktop product for MacOS.
That’s a little different than if Docker disappeared completely, but one could easily switch to Podman (which has a superset of the docker syntax).
> I've come to the conclusion (after trying kops, kubespray, kubeadm, kubeone, GKE, EKS) that if you're looking for < 100 node cluster, docker swarm should suffice. Easier to setup, maintain and upgrade.
Personally, I'd also consider throwing Portainer in there, which gives you both a nice way to interact with the cluster, as well as things like webhooks: https://www.portainer.io/
With something like Apache, Nginx, Caddy or something else acting as your "ingress" (taking care of TLS, reverse proxy, headers, rate limits, sometimes mTLS etc.) it's a surprisingly simple setup, at least for simple architectures.
If/when you need to look past that, K3s is probably worth a look, as some other comments pointed out. Maybe some other of Rancher's offerings as well, depending on how you like to interact with clusters (the K9s tool is nice too).
When I was deploying swarm clusters I would have a default stack.yml file with portainer for admin, traefik for reverse-proxying, and prometheus, grafana, alertmanager, unsee, cadvisor, for monitoring and metrics gathering. All were running on their own docker network completely separated from the app and were only accessible by ops (and dev if requested, but not end users). It was quite easy to deploy with HEAT+ansible or terraform+ansible and the hard part was the ci/cd for every app each in its tenant, but it worked really really well.
I’ve been at a company running swarm in prod for a few years. There have been several nasty bugs that are fun to debug but we’ve accumulated several layers of slapped bandaids trying to handle swarm’s deficiencies. I can’t say I’d pick it again, nor would I recommend it for anyone else.
Node count driven infrastructure decisions make little sense.
A better approach is to translate business requirements to systems capabilities and evaluate which tool best satisfies those requirements given the other constraints within your organization.
Managed Kubernetes solutions like GKE require pretty minimal operational overhead at this point.
PostgreSQL is more complex to use and operate and requires more setup than SQLite. If you don’t need the capabilities of PostgreSQL then you can avoid paying the setup and maintenance costs by using the simpler SQLite.
I have never had a Postgres install go that easily. There’s still initialization and setup of the server and users. And you’ll have to do something about upgrades as well. Postgres isn’t difficult to set up but SQLite is just a file. It’s much simpler.
And that’s only the installation. Interaction with SQLite as a database is also simpler.
They both have uses but it’s strange to me to assert that they’re equally complex.
I was using docker swarm cause of the simplicity and easy setup but the one feature that I really really need was to be able to specify which runtime to use, either I use runsc (and docker plugins don’t work with runsc) or runc as the default and it was too inefficient to have groups of node with certain runtime, I really do like swarm but it misses too much features that are important
I haven't had much opportunity to work with Docker Swarm, but the one time I did, we hit certificate expiration and other issues constantly, and it was not always obvious what was going on. It soured my perception of it a bit, but like I said I hadn't had much prior experience with it, so it might have been on me.
Besides my local cluster of virtual box cluster, I have tried Kubernetes on three clouds with at least a dozen different installers/distributions and operational pain would be a factor going forward has always been my gut feeling.
That's where the author also has following to say:
>My conclusion at this point is that if you can afford it, both in terms of privacy/GDPR and dollarinos then managed is the way to go.
And I agree. Kubernetes managed is also really hard for those of offering it and have to manage it for you behind the scenes.[0]
This looks really nice, but the main feature of Docker Swarm rather than, Docker Compose, is the ability to run on a cluster of servers, not just a single node.
Speaking of k8s, anyone here know of ready-made solutions for getting XCode (i.e. xcodebuild) running in pods? As far as I'm aware, there are no good solutions for getting XCode running on Linux, so at the moment I'm just futzing about with a virtual-kubelet[0] implementation that spawns MacOS VMs. This works just fine, but the problem seems like such an obvious one that I expect there to be some existing solution(s) I just missed.
Someone has submitted patches to containerd and authored “rund” (d for darwin) to run HostProcess containers on macOS.
The underlying problem is poorly familiarity with Kubernetes on Windows among Kubernetes maintainers and users. Windows is where all similar problems have been solved, but the journey is long.
I rand rados benchmarks and it seems writes are about 74MB/s, whereas both random and sequential reads are running at about 130MB/s, which is about wire speed given the 1Gbit/s NICs.
I haven't had an excuse to test it yet, but since it's only 6 OSDs across 3 nodes and all of them are spinning rust, I'd be surprised if performance was amazing.
I'm definitely curious to find out though, so I'll run some tests and get back to you!
I believe the recommended[1] way to deploy Talos to Hetzner Cloud (not bare metal) is to use the rescue system and Hashicorp Packer to upload the Talos ISO, deploying your VPS using this image, and then configuring Talos using the standard bootstrapping procedure.
This post series is specifically aimed at deploying a pure-metal cluster.
Great post. We (Koor) have been going through something similar to create a demo environment for Rook-Ceph. In our case, we want to show different types of data storage (block, object, file) in a production-like system, albeit at the smaller end of scale.
Our system is hosted at Hetzner on Ubuntu. KubeOne does the provisioning, backed by Terraform. We are using Calico for networking, and we have our own Rook operator.
What would have made the Rook-Ceph experience better for you?
Just finished reading part one and wow, what an excellently written and presented post. This is exactly the series I needed to get started with Kubernetes in earnest. It’s like it was written for me personally. Thanks for the submission MathiasPius!
Part I: Talos on Hetzner https://datavirke.dk/posts/bare-metal-kubernetes-part-1-talo...
Part II: Cilium CNI & Firewalls https://datavirke.dk/posts/bare-metal-kubernetes-part-2-cili...
Part III: Encrypted GitOps with FluxCD https://datavirke.dk/posts/bare-metal-kubernetes-part-3-encr...
Part IV: Ingress, DNS and Certificates https://datavirke.dk/posts/bare-metal-kubernetes-part-4-ingr...
Part V: Scaling Out https://datavirke.dk/posts/bare-metal-kubernetes-part-5-scal...
Part VI: Persistent Storage with Rook Ceph https://datavirke.dk/posts/bare-metal-kubernetes-part-6-pers...
Part VII: Private Registry with Harbor https://datavirke.dk/posts/bare-metal-kubernetes-part-7-priv...
Part VIII: Containerizing our Work Environment https://datavirke.dk/posts/bare-metal-kubernetes-part-8-cont...
And of course, when it all falls apart: Bare-metal Kubernetes: First Incident https://datavirke.dk/posts/bare-metal-kubernetes-first-incid...
Source code repository (set up in Part III) for node configuration and deployed services is available at https://github.com/MathiasPius/kronform
While the documentation was initially intended more as a future reference for myself as well as a log of decisions made, and why I made them, I've received some really good feedback and ideas already, and figured it might be interesting to the hacker community :)