I'd be really interested in how you're approaching persistence? I've also found self managing clusters provisioned with kuibeadm fairly hassle free until persistence is involved. Not so much setting it up (e.g. rook is fairly easy to get going with now), but the ongoing maintenance, dealing with transitioining workloads between nodes etc etc.
tl;dr - Rook is the way to go, with automatic backups set up -- using rook means your cluster resources are ceph-managed, you basically have a mini EBS -- Ceph does replication across machines for you in the background, and all you have to do is write out snapshots of the volume contents from time to time just in case you get spectacularly unlucky and X nodes fail all at once, in just the right order to make you lose data. Things get better/easier with CSI (Container Storage Interface) adoption and support, snapshotting is right in the standard and restore is as well -- barring catastrophic failures you can just lean super hard on Ceph (and probably one more cluster-external place for colder backups).
I'd love to share! In the past I've handled persistence in two ways:
- hostPath setting on pods[0]
- Rook[1] (operator-provisioned ceph[2] clusters, I free up one drive on my dedicated server and give it to rook to manage, usually /dev/sdb)
While Rook worked well for me for a long time, It didn't quite work for me in two situations:
- Setting up a new server without visiting Hetzner's rescue mode (which is where you would be able to disassemble RAID properly)
- Using rkt as my container runtime. The Rook controller/operator does a lot of things which require a bunch of privileges, which rkt doesn't give you by default and I was too lazy to work it out. I use and am happy with containerd[3] (and will be in the future) as my runtime however, so I just switched off rkt.
right now, I actually use hostPath volumes, which isn't the greatest (for example you can't really limit them properly) -- I had to switch from Rook due to my distaste for needing to go into Hetzner rescue mode to disassemble the pre-configured RAID (there's no way currently to ensure they don't raid the two drives you normally get after the automated operating system setup). Normally RAID1 on the two drives they give you is a great thing, but in this case I actually don't really care much for main server contents since I try to treat my servers as cattle (if the main HDD somehow goes down it should be OK), and I know that as long as ceph is running on the second drive I should have reliability as long as I have more machines which is the only way to really improve reliability, anyway.
Supposedly, you can actually just "shrink" the raid cluster to one drive, and then remove the second drive from the cluster -- then I could format the drive and give it to Rook. With Rook though (from the last time I set up the cluster and went through the raid disassembly shenanigans ), things are really awesome -- you can store PVC specs right next to the resources that need them -- this is much better/safer than just giving the deployment/daemonset/pod a hostpath.
These days, there's also local volumes[4], which are similar to hostPath, but offer a benefit in that your pod will know where to go because the node affinity is written right into the volume. Your pod won't ever try and run on a node where the PVC it's expecting isn't present. The downside is that local volumes have to be pre-provisioned, which is basically a non starter for me.
I haven't found a Kubernetes operator/add-on that can dynamically provision/move/replicate local volumes, and I actually wanted to write a simple PoC one of these weekends -- I think it can be done naively by maintaining a folder full of virtual disk images[4] and creating/mounting them locally when someone asks for a volume. If you pick your virtual disk filesystem wisely, you can get a lot of snapshotting, replication, and other things for free/near-free.
One thing Kubernetes has coming that excites me is the CSI (Container Storage Interface)[5] which is in beta now and standardizes all of this even more. Features like snapshotting are right in the rpc interface[6], which means once people standardize to it, you'll get a consistent means across compliant storage drivers.
What I could and should probably do is just use a hetzner storage box[7].