Unless you have a large cluster with many tens of nodes/OSDs (and who does in a homelab?) then using Ceph is a bad idea (I've run large Ceph clusters at previous jobs).
Disagree. I run Ceph in Proxmox and have for years on a small cluster of 3 used R620 servers without any SSDs.
It’s just worked. I’ve lost two of the machines due to memory failures at two different points in time and the k8s clusters sitting on top didn’t fail, even the Postgres databases running with cnpg remained ready and available during both hardware failures.
Oh sure it works, not denying that. My point is that performance isn't great and if you only have a small cluster then it doesn't take much to make everything fall over because your failure domains are huge (in your case, you only have 3).
But then to offset the above, it also depends on how important your environment is; homelabs don't usually require five nines.
I am a big Proxmox fan but I dislike how easy it makes Ceph to run (or rather, how it appears to be easy). Ceph can fail in so many ways (I've seen a lot of them) and most people who set a Ceph cluster up through the UI are going to have a hard time recovering their data when things go south.
Ceph's design is to avoid a single bottleneck or single point of failure, with many nodes that can all ingest data in parallel (high bandwidth across the whole cluster) and be redundant/fault tolerant in the face of disk/host/rack/power/room/site failures. In exchange it trades away some of: low latency, efficient disk space use, simple design, some kinds of flexibility. If you have a "small" use case then you will have a much easier life with a SAN or a bunch of disks in a Linux server with LVM, and probably get better performance.
How does it work with no single front end[1] and no centralised lookup table of data placement (because that could be a bottleneck)? All the storage nodes use the same deterministic algorithm for data placement known as CRUSH, guided by placement rules which the admin has written into the CRUSH map, things like:
- these storage servers are grouped together by some label (e.g. same rack, same power feed, same data centre, same site).
- I want N copies of data blocks, separated over different redundancy / failure boundaries like different racks or different sites.
There's a monitoring daemon which shares the CRUSH map out to each node. They get some data coming in over the network, work through the CRUSH algorithm, and then send the data internally to the target node. The algorithm is probabalistic and not perfectly balanced so some nodes end up with more data than others, and because there's no central data placement table this design is "as full as the fullest disk" - one full disk anywhere in the cluster will put the entire cluster into read-only mode until you fix it. Ceph doesn't easily run well with random cheap different size disks for that reason, the smallest disk or host will be a crunch point. It runs best with raw storage below 2/3rds full. It also doesn't have a single head which can have a fast RAM cache like a RAID controller can have.[2] Nothing about this is designed for the small business or home use case, it's all designed to spread out over a lot of nodes[3].
It’s got a design where the units of storage are OSDs (Object Storage Devices) which correspond roughly to disks/partitions/LVM volumes, each one has a daemon controlling it. Those are pulled together as RADOS (Reliable Autonomic Distributed Object Store) where Ceph internally keeps data, and on top of that the admin can layer user-visible storage such as the CephFS filesystem, Amazon S3 compatible object storage, or a layer that presents as a block device which can be formatted with XFS/etc.
It makes a distributed system that can ingest a lot of data in parallel streams using every node’s network bandwidth, but quite a lot of internal shuffling of data around between nodes and layers adding latency, and there are monitor daemons and management daemons overseeing the whole cluster to keep track of failed storage units and make the CRUSH map available to all nodes, and those ought to be duplicated and redundant as well. It's a bit of a "build it yourself storage cluster kit" which is pretty nicely designed and flexible but complex and layered and non-trivial.
There are some talks on YouTube by people who managed and upgraded it at CERN as targets of particle accelerators data which are quite interesting. I can only recommend searching for "Ceph at Cern" and there are many hours of talks, I can't remember which ones I've seen. Titles like: "Ceph at CERN: A Year in the Life of a Petabyte-Scale Block Storage Service", "Ceph and the CERN HPC Infrastructure", "Ceph Days NYC 2023: Ceph at CERN: A Ten-Year Retrospective", "Ceph Operations at CERN: Where Do We Go From Here?".
[1] If you are not writing your own software that speaks to Ceph's internal object storage APIs, then you are fronting its with something like a Linux machine running an XFS filesystem or the S3-compatible gateway, and that machine becomes your single point of failure and bottleneck. Then you front one Ceph cluster with many separate Linux machines as targets, and have your users point their software to different front ends, and in that case why use Ceph at all? You may as well have had many Linux machines with their own separate internal storage and rsync, and no Ceph. Or two SANs with data replication between them. Do you need (or want) what Ceph does, specifically?
[2] I have only worked on HDD based clusters, with some SSDs for storing metadata to speed up performance. These clusters were not well specced and the metadata overflowed onto the HDDs which didn't help anything.
[3] There are ways to adjust the balance of data on each node to work with different size or nearly full disks, but if you get to read-only mode you end up waiting for it to internally rebalance while everything is down. This isn't so different to other storage like SANs, it's just that if you are going for Ceph you probably have a big cluster with a lot of things using it so a lot of things offline. You still have to consider running multiple Ceph clusters to limit blast radius of failures, if you are thinking "I don't want to bother with multiple storage targets I want one Ceph" you still need to plan that maybe you don't just want one Ceph.
While most of what you speak of re Ceph is correct, I want to strongly disagree with your view of not filling up Ceph above 66%. It really depends on implementation details. If you have 10 nodes, yeah then maybe that's a good rule of thumb. But if you're running 100 or 1000 nodes, there's no reason to waste so much raw capacity.
With upmap and balancer it is very easy to run a Ceph cluster where every single node/disk is within 1-1.5% of the average raw utilization of the cluster. Yes, you need room for failures, but on a large cluster it doesn't require much.
80% is definitely achievable, 85% should be as well on larger clusters.
Also re scale, depending on how small we're talking of course, but I'd rather have a small Ceph cluster with 5-10 tiny nodes than a single Linux server with LVM if I care about uptime. It makes scheduled maintenances much easier, also a disk failure on a regular server means RAID group (or ZFS/btrfs?) rebuild. With Ceph, even at fairly modest scale you can have very fast recovery times.
Source, I've been running production workloads on Ceph at fortune-50 companies for more than a decade, and yes I'm biased towards Ceph.
I defer to your experience and agree that it really depends on implementation details (and design). I've only worked on a couple of Ceph clusters built by someone else who left, around 1-2PB, 100-150 OSDs, <25 hosts, and not all the same disks in them. They started falling over because some OSDs filled up, and I had to quickly learn about upmap and rebalancing. I don't remember how full they were, but numbers around 75-85% were involved so I'm getting nervous around 75% from my experiences. We suddenly commit 20TB of backup data and that's a 2% swing. It was a regular pain in the neck, stress point, and creaking, amateurishly managed, under-invested Ceph cluster problems caused several outages and some data corruption. Just having some more free space slack in it would have spared us.[1]
That whole situation is probably easier the bigger the cluster gets; any system with three "units" that has to tolerate one failing can only have 66% usable. With a hundred "units" then 99% are usable. Too much free space is only wasting money, too full is a service down disaster, for that reason I would prefer to err towards the side of too much free rather than too little.
Other than Ceph I've only worked on systems where one disk failure needs one hotspare disk to rebuild, anything else is handled by a separate backup and DR plan. With Ceph, depending on the design it might need free space to handle a host or rack failure, and that's pretty new to me and also leads me to prefer more free space rather than less. With a hundred "units" of storage grouped into 5 failure domains then only 80% is usable, again probably better with scale and experienced design.
If I had 10,000 nodes I'd rather 10,100 nodes and better sleep than playing "how close to full can I get this thing" and constantly on edge waiting for a problem which takes down a 10,000 node cluster and all the things that needed such a big cluster. I'm probably taking some advice from Reddit threads talking about 3-node Ceph/Proxmox setups which say 66% and YouTube videos talking about Ceph at CERN - in those I think their use case is a bursty massive dump of particle accelerator data to ingest, followed by a quieter period of read-heavy analysis and reporting, so they need to keep enough free space for large swings. My company's use case was more backup data churn, lower peaks, less tidal, quite predictable, and we did run much fuller than 66%. We're now down below 50% used as we migrate away, and they're much more stable.
[1] it didn't help that we had nobody familiar with Ceph once the builder had left, and these had been running a long time and partially upgraded through different versions, and had one-of-everything; some S3 storage, some CephFS, some RBDs with XFS to use block cloning, some N+1 pools, some Erasure Coding pools, some physical hardware and some virtual machines, some Docker containerised services but not all, multiple frontends hooked together by password based SSH, and no management will to invest or pay for support/consultants, some parts running over IPv6 and some over IPv4, none with DNS names, some front-ends with redundant multiple back end links, others with only one. A well-designed, well-planned, management-supported cluster with skilled admins can likely run with finer tolerances.