Haha "super fucked up" is a much better way of describing the "usual, rare" situ...

dylan604 · on Aug 17, 2022

Care to provide examples of what these things were that you were doing to a storage pool? I guess I'm just not imaginative enough to think about ways of using a storage pool other than storing data in it.

erulabs · on Aug 17, 2022

In our case we were a free-to-use-without-any-signup way of testing Kubernetes. You could just go to the site and spin up pods. Looking back, it was a bit insane.

Anyways, you can imagine we had all sorts of attacks and miners or other abusive software running. This on top of using ephemeral nodes for our free service meant hosts were always coming and going and ceph was always busy migrating data around. The wrong combo of nodes dying and bursting traffic and beta versions of Rook meant we ran into a huge number of edge cases. We did some optimization and re-design, but it turned out there just weren't enough folks interested in paying for multi-tenant Kubernetes. We did learn an absolute ton about multi-tenant K8s, so, if anyone is running into those challenges, feel free to hire us :P

Lex-2008 · on Aug 17, 2022

not OP, but I would start with filling disk space up to 100%, or creating zillions of empty files. In case of distributed filesystems - maybe removing one node (under heavy load preferably), or "cloning" nodes so they had same UUIDs (preferably nodes storing some data on them - to see if the data will be de-duplicated somehow).

Or just a disk with unreliable USB connection?

mcronce · on Aug 18, 2022

Administrating the storage pool.

The worst that comes to mind for me was a node failure in the middle of a major version upgrade. Not likely a big deal for proper deployments, but I don't have enough nodes to tolerate complete node failure for most of my data.

Grabbed a new root/boot SSD, reinstalled the OS, reinstalled the OSDs on each disk, told Ceph what OSD ID each one had previously (not actually sure if that was required), and....voila, they just rejoined the cluster and started serving their data like nothing ever happened.