Aside from running Ceph as my day job, I have a 9-node Ceph cluster on Rasberry Pi 4s at home that I've been running for a year now, and I'm slowly starting to move things away from ZFS to this cluster as my main storage.
My setup is individual nodes, with 2.5" external HDDs (mostly SMR), so I actually get sligtly better performance than this cluster, and I'm using 4+2 erasure coding for the main data pool for CephFS.
CephFS has so far been incredibly stable and all my Linux laptops reconnect to it after sleep with no issues (in this regard it's better than NFS).
I like this setup a lot better now than ZFS, and I'm slowly starting to migrate away from ZFS, and now I'm even thinking of setting up a second Ceph cluster. The best thing with Ceph is that I can do a maintenance on a node at any time and storage availability is never affected, with ZFS I've always dreaded any kind of upgrade, and any reboot requires an outage. Plus with Ceph I can add just one disk at a time to the cluster and disks don't have to be the same size. Also, I can move the physical nodes individually to a different part of my home, change switches and network cabling without an outage now. It's a nice feeling.
I want to preface this - I don't have strong opinion already here, and I'm curious about Ceph. As someone who runs a 6 drive raidz2 at home (w/ ECC RAM) does your Ceph config give you similar data integrity guarantees to ZFS? If so, what are the key points of the config that enable that?
When Ceph migrated from Filestore to Bluestore, that enabled data scrubbing and checksumming for data (older versions before Bluestore were only verifying metadata).
Ceph (by default) does metadata scrubs every 24 hours, and data scrubs (deep-scrub) weekly (configurable, and you can manually scrub individual PGs at any time if that's your thing). I believe the default checksum used is "crc32c", and it's configurable, but I've not played with changing it. At work we get scrub errors on average maybe weekly now, at home I've not had a scrub error yet on this cluster in the past year (I did have a drive that failed and still needs to be replaced).
My RPi setup certainly does not have ECC RAM as far as I'm aware, but neither does my current ZFS setup (also a 6 drive RAIDZ2).
Nothing stopping you from running Ceph on boxes with ECC RAM, we certainly do that at my job.
Weekly scrub errors are definitely not normal. There has been a few bug fixes in Ceph & the Kernel. I would check how up to date your packages are.
Hardware errors are also possible but there has been a few software bugs so worth checking.
Here's a really curly one we recently found and solved at work (only in March) that causes both scrub errors and sometimes bluefs aborts, it's a kernel patch. Likely to happen under memory pressure:
That depends on the scale of your cluster. The Backblaze drive report shows they lose 1.3% of their disks per year on average. Larger operations will have one or more people who just replace disks all day.
@mike_d you are right I made a very bad assumption about the size of the environment :)
@antongribok sounds like you have it in hand having seen your subsequent comments just didn’t want to leave someone assuming that error rate was normal in the smaller clusters that are more commonly but not universally the case. Evidently not the case for you :)
I was running glusterfs on an array of ODROID-HC2s ( https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/ ) and it was fun, but I've since migrated back to just a single big honking box (specifically a threadripper 1920x running unraid). Monitoring & maintaining an array of systems was its own IT job that kinda didn't seem worth dealing with.
I just setup a test cluster at work to test this for you:
4 nodes, each node with 2x SAS SSDs, dual 25Gb NICs (one for front-end, one for back-end replication). The test pool is 3x replicated with Snappy compression enabled.
On a separate client (also with 25Gb) I mounded an RBD image with krbd and ran FIO:
For the standard 3x replicated setup, 3 nodes is the minimum for any kind of practical redundancy but you really want 4 so that after failure of 1 node all the data can be recovered onto the other 3 and still have failure resiliency.
For erasure coded setups which is not really suited to block storage but mainly object storage via radosgw(s3) or cephfs you need minimum k+m and realistically k+m+1. That would translate to 6 minimum but realistically 7 nodes for k=4,m=2. That’s 4 data chunks and 2 redundant chunks which means you use 1.5x the storage of the raw data (half that of a replicated setup). You can do k=2,m=1 also. So 4 nodes into that case.
I would say the minimum is whatever your biggest replication or erasure coding config is, plus 1. So, with just replicated setups, that's 4 nodes, and with EC 4+2, that's 7 nodes. With EC 8+3, which is pretty common for object storage workloads, that's 12 nodes.
Note, a "node" or a failure domain, can be configured as a disk, an actual node (default), a TOR switch, a rack, a row, or even a datacenter. Ceph will spread the replicas across those failure domains for you.
At work, our bigger clusters can withstand a rack going down. Also, the more nodes you have, the less of an impact it is on the cluster when a node goes down, and the faster the recovery.
I started with 3 RPis then quickly expanded to 6, and the only reason I have 9 nodes now is because that's all I could find.
Can I ask an off topic/in-no way RPi related question?
For larger ceph clusters, how many disks/SSD/nvme are usually attached to a single node?
We are in the middle of transitioning from a handful of big (3x60 disk, 1.5PB total) JBOD Gluster/ZFS arrays and I’m trying to figure out how to migrate to a ceph cluster of equivalent size. It’s hard to figure out exactly what the right size/configuration should be. And I’ve been using ZFS for so long (10+ years) that thinking of not having healing zpools is a bit scary.
For production, we have two basic builds, one for block storage, which is all-flash, and one for object storage which is spinning disks plus small NVMe for metadata/Bluestore DB/WAL.
The best way to run Ceph is to build as small a server as you can get away with economically and scale that horizontally to 10s or 100s of servers, instead of trying to build a few very large vertical boxes. I have run Ceph on some 4U 72-drive SuperMicro boxes, but it was not fun trying to manage hundreds of thousands of threads on a single Linux server (not to mention NUMA issues with multiple sockets). An ideal server would be one node to one disk, but that's usually not very economical.
If you don't have access to custom ODM-type gear or open-19 and other such exotics, what's been working for me have been regular single socket 1U servers, both for block and for object.
For block, this is a very normal 1U box with 10x SFF SAS or NVMe drives, single CPU, a dual 25Gb NIC.
For spinning disk, again a 1U box, but with a deeper chassis you can fit 12x LFF and still have room for a PCI-based NVMe card, plus a dual 25Gb NIC. You can get these from SuperMicro, Quanta, HP.
Your 3x60 disk setup sounds like it might fit in 12U (assuming 3x 4U servers). With our 1U servers I believe that can be done with 15x 1U servers (1.5 PiB usable would need roughly 180x 16TB disks with EC 8+3, you'll need more with 3x replication).
Of course, if you're trying to find absolute minimum requirements that you can get away with, we'd have to know a lot more details about your workload and existing environment.
EDITING to add:
Our current production disk sizes are either 7.68 or 15.36 TB for SAS/NVMe SSDs at 1 DWPD or less, and 8 TB for spinning disk. I want to move to 16 TB drives, but haven't done so for various tech and non-tech reasons.
I would love to hear more about your Ceph setup. Specifically how you are connecting your drives and how many drives per node? I imagine with the Pis limited USB bus bandwidth, your cluster performs as more of an archive data store compared to realtime read/write like the backing block storage of VMs. I have been wanting to build a Ceph test cluster and it sounds like this type of setup might do the trick.
Each node is completely separate, housed in a good quality aluminum enclosure with a fan, and sitting on top of an external USB Seagate 2.5" portable drive (either 4TB or 5TB), connected via USB 3 cable. I'm pretty sure these drives are SMR, but they've been good to me, and they're fast enough for my needs.
Power is provided either using official RPi power supplies, or a couple of multi-port Anker USB power supplies that I had previously. A limit of 2.5 amps does not seem to cause any issues.
Currently everything is connected to a single switch, but I move things around my office sometimes, and sometimes have the RPis connected to two different switches.
Right now, everything including the 1 switch is connected to a single APC UPS, and that thing is super old, so that's another SPOF.
My clients currently are a few wired desktops and laptops over wifi, all connecting via CephFS. I haven't tested with librbd or krbd, I imagine it wouldn't be fast.
The RPis are mostly 8GB, but I do have a couple 4GB, and one RPi 400, which is kind of hilarious.
Everything is running Ubuntu 20.04, Ceph Pacific, and deployed from the first node with cephadm.
I use only Samsung microSD cards, either 32GB or 64GB. I don't think it matters what kind, but getting bigger cards makes me feel like they'll last longer. Most of the nodes have the var partition on the external drives (on a small partition at the beginning of the drive), but I do have a few where I didn't set that up early on, and haven't gotten around to redoing it.
I partition the drives and put LVM on manually, and tell cephadm to use the specific LV instead of the bare drive.
If you want any kind of performance, definitely set your expectations very low, but for me this works. I can stream at least a pair of 4K movies off this simultaneously, and I also run an instance of Paperless-NG off this over a CephFS mount and haven't had any issues.
I tried using Ceph twice at home. Once was via Proxmox, and it installed and ran perfectly fine, although tbf I didn't load it with much.
The next was via Rook, since I have a Kubernetes cluster, and it was a nightmare. I spent a week or so reading through all the docs I could find before I felt prepared to go through with it, only to have random clock sync issues that Reddit informed me were due to me enabling power savings mode in the BIOS for my nodes.
ZFS's biggest hiccup for me is when I do a kernel update and DKMS borks the upgrade. Other than that, it's been rock-solid. I run a normal and backup node with it, no regrets.
I solved the ZFS DKMS bork issue by moving to Debian 11 from Centos 8. I've had zero openZFS issues since the move. On Centos it would require work every time a sufficient kernel upgrade came in.
Since I'm familiar with RHEL I just swapped some of the Debian default services for RHELish alternatives (Firewalld, Podman, etc.).
I'm also using Debian; have been since Wheezy. I also template my VMs with Packer/Ansible though, so in the event it gets too messy or annoying to fix, I just export the pool and import it to a new VM.
The last issue I had was that linux-headers didn't get installed with the kernel update, so DKMS couldn't build on them. I assumed they were a dependency for zfs-dkms, but no.
> Plus with Ceph I can add just one disk at a time to the cluster and disks don't have to be the same size.
I'd like to note that ZFS now has RAID-Z expansion which allows us to do exactly that! It's an essential feature for home users since it allows us to gradually expand capacity instead of buying up all the storage up front at great cost.
I too researched ceph for this exact reason but was told the hardware requirements were too high for a typical home lab, yet you're running ceph on raspberry pis... I should probably look into ceph once more.
I'm also running ceph (using the rook kubernetes operator) in my homelab. Been running this setup for 9 months now with 2 cheap HP elitedesk workstations i picked up on ebay and 2 8TB HDDs in each.
Since this setup has run incredible smooth so far, I plan on using SolidRun's HoneyComb LX2 as a ceph node with bigger disks and nvme write cache in the future. I looked at the raspberry pi 4, but was not too impressed by the single PCIe 3.0 lane, since I also plan on using NVME disks as ceph's metadata storage device to speed up the hard disk with the normal data behind it and the ceph recommendation to use 10GbE NICs.
The HoneyComb LX2 has 4 built in 10GbE ports, 16 A72 cores, actual DDR4 RAM slots, a 4 lane PCIe 3.0 m.2 slot and an open-ended (so you can put in a full x16 device) PCIe 3.0 slot with 8 lanes for a max of 8Gbyte/s bandwidth.
Since it's an arm box it's incredible energy efficient which is important since energy prices are increasing in my country. Also its the only affordable performant arm device at 800USD.
Man, Ceph really doesn't get enough love. For all the distributed systems hype out there - be it Kubernetes or blockchains or serverless - the ol' rock solid distributed storage systems sat in the background iterating like crazy.
We had a huge Rook/Ceph installation in the early days of our startup before we killed off the product that used it (sadly). It did explode under some rare unusual cases, but I sometimes miss it! For folks who aren't aware, a rough TLDR is that Ceph is to ZFS/LVM what Kubernetes is to containers.
This seems like a very cool board for a Ceph lab - although - extremely expensive - and I say that as someone who sells very expensive Raspberry Pi based computers!
Ceph is fantastic. I use it as the storage layer in my homelab. I've done some things that I can only concisely describe as super fucked up to this Ceph cluster, and every single time I've come out the other side with zero data loss, not having to restore a backup.
Care to provide examples of what these things were that you were doing to a storage pool? I guess I'm just not imaginative enough to think about ways of using a storage pool other than storing data in it.
In our case we were a free-to-use-without-any-signup way of testing Kubernetes. You could just go to the site and spin up pods. Looking back, it was a bit insane.
Anyways, you can imagine we had all sorts of attacks and miners or other abusive software running. This on top of using ephemeral nodes for our free service meant hosts were always coming and going and ceph was always busy migrating data around. The wrong combo of nodes dying and bursting traffic and beta versions of Rook meant we ran into a huge number of edge cases. We did some optimization and re-design, but it turned out there just weren't enough folks interested in paying for multi-tenant Kubernetes. We did learn an absolute ton about multi-tenant K8s, so, if anyone is running into those challenges, feel free to hire us :P
not OP, but I would start with filling disk space up to 100%, or creating zillions of empty files. In case of distributed filesystems - maybe removing one node (under heavy load preferably), or "cloning" nodes so they had same UUIDs (preferably nodes storing some data on them - to see if the data will be de-duplicated somehow).
The worst that comes to mind for me was a node failure in the middle of a major version upgrade. Not likely a big deal for proper deployments, but I don't have enough nodes to tolerate complete node failure for most of my data.
Grabbed a new root/boot SSD, reinstalled the OS, reinstalled the OSDs on each disk, told Ceph what OSD ID each one had previously (not actually sure if that was required), and....voila, they just rejoined the cluster and started serving their data like nothing ever happened.
I think many people (myself included) had been burned by major disasters on earlier clustered storage solutions (like early Gluster installations). Ceph seems to have been under the radar for a bit of time when it got to a more stable/usable point, and came more in the limelight once people started deploying Kubernetes (and Rook, and more integrated/wholistic clustered storage solutions).
So I think a big part of Ceph's success (at least IMO) was its timing, and it's adoption into a more cloud-first ecosystem. That narrowed the use cases down from what the earliest networked storage software were trying to solve.
We're more and more feeling we made the wrong call with gluster... The underlying bricks being a POSIX fs felt a lot safer at the time but in hindsight ceph or one of the newer ones would probably have been a better choice. So much inexplicable behavior. For your sake I hope the grass really is greener.
Can someone with experience with Ceph and MinIO or SeaweedFS comment on how they compare?
I currently run a single-node SnapRAID setup, but would like to expand to a distributed one, and would ideally prefer something simple (which is why I chose SnapRAID over ZFS). Ceph feels to enterprisey and complex for my needs, but at the same time, I wouldn't want to entrust my data to a simpler project that can have major issues I only discover years down the road.
SeaweedFS has an interesting comparison[1], but I'm not sure how biased it is.
Seaweedfs has problems with large "pools" it's based on an old facebook paper (haystack) and supposed for block storage to distribute large image caches. I found it mediocre at best as it's documentation was lacking, performance was lacking (in my tests) and the multitude of components were hard to get working.
The idea behind it is that every daemon uses one large file as data store to skip slow metadata access. There are different ways to access the storage over gateways.
MinIO is changing so much in the last years thatI can't give a competent answer but compared to seaweedfs it uses many small local databases. Right now it's deprecating many features like the gateway and it is split into 2 main components (cli and server) compared the seaweedfs deployment is dead simple, but I don't know which direction the project is going. Went from a normal open source project to a more business like deal (in what I saw) like I said I didn't quite follow the process.
Ceph is based on blocmlk storage. Offers an object gateway (s3/swift), fs (cephfs) and block storage (rbd). You can access everything through librados directly as well. For a minimal setup you need a "larger" vluster but it is the most flexible solution (imho). Uses the most resources as well, but you can do nearly everything you want without limit with it.
SeaweedFS author here. Thanks for your candid answer. You do not need to use multiple SeaweedFS components. Just download the binary and run "weed server -s3".
There are many other components, but you do not really need to use them. This default mode should be good enough for most cases. I saw many times people try to optimize too early, but often unnecessary, and sometimes in the wrong way.
I would like to know what kind of setup you are running. It should beat most other options if the use case needs lots of small files, e.g. millions or billions of files. If just small use case, e.g. a few personal files, it would be an overkill.
Another aspect is how to increase capacity for existing clusters. It should be most simple for SeaweedFS, just start one more volume server. And it will linearly increase the throughput.
Yeah sorry my answer was more than insufficient to be honest. I wrote it _in bed_ and was embarrassed the next day because it was of really low quality. Thought of expanding it later. So yeah I screwed the pooch here and I'm sorry, I will try to do better now by expanding on my answer.
First of all this is all from memory and I didn't try seaweedfs again for this.
So first things first. I evaluated seaweedfs for HPC Cluster usage in 2020 (oh my this is some time ago), but my test setup were VMs. I tried it with many small and larger files and it didn't scale at all (at least when I tested it) for parallel loads. The response time was acceptable, but the throughput was very low. When I tried it "weed server" spun up everything more or less fine, but had problems binding correctly that a distributed setup worked. Based on the wiki documentation I configured a master server, a filer and a few volume server (iirc).
My main gripes at that time were as follows:
* the syntax of the different clients was incosistent
* the throughput was rather low
* the components didn't work well together in a certain configuration and I had to specify many things manually
* the wiki was lacking
I tried filer (fuse), s3 and hadoop. s3 wasn't compatible enough to work with everything I tried with it so I spun up a minIO instance as a gateway to test the whole thing.
When working over a longer period I had some hangs as well.
That's sadly everything I remember on it but I made a presentation if you are interested I can look for it and give you the benchmarks I tried and the limitations I found (although they will be all HORRIBLY out of date). When I tested it there were 2 versions with different size limitations iirc. I just now looked over your gitlab releases and can't find these.
Sorry again if I misrepresented seaweedfs here with my outdated tests. I looked at the github wiki and it looks much better then when I last played with it. I will give it a spin again soon and if I find my old experience of it to be not represantive, maybe write something about it and post it here.
---
minIO was when I tried it mainly an s3 server and gateway. It had a simple web framework that allowed you to upload and share files. One of our use cases that we thought we could use minIO for was as a bucket browser/web interface. It was easy to setup as a server as well. Like I said I didn't track it after testing it for about a month. Today it boasts with it's performance and AI/ML use cases. Here is there pricing model https://min.io/pricing and you can see how they add value to their product.
---
Ceph is like I said the most complex product of the three with the most components that need to be setup (even though it's quite easy now). Performance is optimized in their crimson project https://next.redhat.com/2021/01/18/crimson-evolving-ceph-for... (this is a WIP and not enabled by default). It's not the most straight forward to tune since many small things can lead to big performance gains and losses (for instance the erasure code k and m you choose), but I found that the defaults got more sane with time.
Thanks for the detailed clarification! I am too deep into the SeaweedFS low level details and am all ears on how to make it simpler to use. SeaweedFS has weekly releases and is constantly evolving.
Depending on your case, you may need to add more filers. UCSD has a setup that uses about 10 filers to achieve 1.5 billion iops. https://twitter.com/SeaweedFS/status/1549890262633107456 There are many AI/ML users switching from MinIO or CEPH to SeaweedFS, especially with lots of images/text/audio files to process.
I found MinIO benchmark results is really, well, "marketing". MinIO is basically just an S3 API layer on top of the local disks. Any object is mapped to at least 2 files on disk, one for metadata and one is the object itself.
Thanks for your perspective. Ceph does sound the most appealing for my use case. I'm hoping that the learning curve is mild, and that it has a mostly set-and-forget UX.
I love it, but when it fails at scale, it can be hard to reason about. Or at least that was the case when I was using it a few years back. Still keen to try it again and see what's changed. I haven't run it since bluestore was released.
Yeah, I've been running a small Ceph cluster at home, and my only real issue with it is the relative scarcity of good conceptual documentation.
I personally learned about Ceph from a coworker and fellow distributed systems geek who's a big fan of the design. So I kind of absorbed a lot of the concepts before I ever actually started using it. There have been quite a few times where I look at a command or config parameter, and think, "oh, I know what that's probably doing under the hood"... but when I try to actually check that assumption, the documentation is missing, or sparse, or outdated, or I have to "read between the lines" of a bunch of different pages to understand what's really happening.
I've run Ceph at two Fortune 50 companies since 2013 to now, and I've not lost a single production object. We've had outages, yes, but not because of Ceph, it was always something else causing cascading issues.
Today I have a few dozen clusters with over 250 PB total storage, some on hardware with spinning rust that's over 5 years old, and I sleep very well at night. I've been doing storage for a long time, and no other system, open source or enterprise, has given me such a feeling of security in knowing my data is safe.
Any time I read about a big Ceph outage, it's always a bunch of things that should have never been allowed in production, compounded by non-existent monitoring, and poor understanding of how Ceph works.
Can you talk about the method that Ceph has for determining whether there was bit rot in a system?
My understanding is that you have to run a separate task/process that has Ceph go through its file structures and check it against some checksums. Is it a separate step for you, do you run it at night, etc.?
Just to add to the other comment: Ceph checksums data and metadata on every read/write operation. So even if you completely disable scrubbing, if data on a disk becomes corrupted, the OSD will detect it and the client will transparently fail over to another replica, rather than seeing bad data or an I/O error.
Scrubbing is only necessary to proactively detect bad sectors or silent corruption on infrequently-accessed data, so that you can replace the drive early without losing redundancy.
By default it “scrubs” basic metadata daily and does a deep scrub where it fully reads the object and confirms the checksum is correct from all 3 replicas weekly for all of the data in the cluster.
So what amount of disk bandwidth/usage is involved?
For instance, say that I have 30TB of disk space used and it is across 3 replicas , thus 3 systems.
When I kick off the deep scrub operation, what amiunt of reads will happen on each system? Just the smaller amount of metadata or the actual full size of the files themselves?
In Ceph, objects are organized into placement groups (PGs), and a scrub is performed on one PG at a time, operating on all replicas of that PG.
For a normal scrub, only the metadata (essentially, the list of stored objects) is compared, so the amount of data read is very small. For a deep scrub, each replica reads and verifies the contents of all its data, and compares the hashes with its peers. So a deep scrub of all PGs ends up reading the entire contents of every disk. (Depending on what you mean by "disk space used", that could be 30TB, or 30TBx3.)
The deep scrub frequency is configurable, so e.g. if each disk is fast enough to sequentially read its entire contents in 24 hours, and you choose to deep-scrub every 30 days, you're devoting 1/30th of your total IOPS to scrubbing.
Note that "3 replicas" is not necessarily the same as "3 systems". The normal way to use Ceph is that if you set a replication factor of 3, each PG has 3 replicas that are chosen from your pool of disks/servers; a cluster with N replicas and N servers is just a special case of this (with more limited fault-tolerance). In a typical cluster, any given scrub operation only touches a small fraction of the disks at a time.
Not to be too glib, but "a bunch of things that should have never been allowed in production, compounded by non-existent monitoring, and poor understanding of how <thing> works" is the cause of most outages.
I will once again lament the fact that WD Labs built SBCs that sat with their hard drives to make them individual CEPH nodes but never took the hardware to production. It seems to me there's still a market for SBCs that could serve a CEPH OSD on a per-device basis, although with ever increasing density in the storage and hyperconverged space that's probably more of a small business or prosumer solution.
Yeah, those were really cool. I saw some homelab setups using the ODroid HC2 from Hardkernel in a similar way.
The 2 issues with this setup were that the HC2 was using a low-performance armv7 processor, with armv7 being a very unsupported platform by most software and the fact that you can't use a flash-based disk for a ceph bluestore metadata device since it only had one SATA port.
Credit where it's due - this is some 18 watt awesomeness at idle. Is it more "practical" than doing a Mini-ITX (or smaller, like one of those super small tomtom with up to 5900HX) build and equipping it with one one or more NVME expansion cards? Probably not. But it's cool.
Now, if there were a new Pi to buy. Isn't it time for the 5? It's been 3 years for most of which they've been hard to fine. Mine broke and I really miss it because having a full blown desktop doing little things makes no sense, especially during the summer.
18 W idle is kinda horrible if you just want a small server (granted, this isn't one server, but instead six set-top boxes in one). That's recycled entry-level rack server range, which come with ILOM/BMC. Most old-ish fat clients can do <10 W, some <5 W, no problem. If you want a desktop that consumes little power when idle or not loaded a lot, just get basically any Intel system with an IGP since 4th gen (Haswell). Avoid Ryzen CPU with dGPU if that's your goal; those are gas guzzlers.
1. I would bet at least half of all that wattage is the SSDs.
2. Buddy, you're spewing BS at someone who used to run a Haswell in a really small Mini-ITX case. It was a fine HTPC back in 2014. But now everything, bar my dead Pi, is some kind of Ryzen. All desktops and laptops. The various 4800u/5800u/6800u and lower parts offer tremendous performance at 15W nominal power levels. The 5800H I am writing this message on is hardly a guzzler, especially when compared to Intel's core11/12 parts.
This random drive-by intel shilling really took me by surprise.
If someone is trying to find a pi you can try the telegram bot I made for rpilocator.com. It will notify you as soon as there is stock with filters for specific pis and your location/preferd vendor.
Would buy this in an instant if it weren't hobbled as hell by the onboard realtek switch. If it had an upstream 2.5/5/10g port it would be instantly 6 times more capable.
Would 6 Pis be able to handle more than 1g? It says that they got around 70MB write and 100MB read. 2.5/5/10 seems like it would be a waste unless I'm overlooking something
The AXI bus internal to the Pi's SoC is only capable of about 4gbps, and it carries DMA, so ~2gbps is more or less the hard limit for any kind of combined IO operation like disk<=>network no matter what kind of hardware you use for disk and network.
So yes, each pi can easily saturate its own 1gbps interface, so a system like ceph that parallelizes reads and writes among nodes is severely crippled by the onboard switch choking off bandwidth to external clients. For the same reason, you can't easily scale this platform beyond a single board, which puts your clustered system back into a single point of failure.
The internal AXI bus on the Pi SoC can only move data at ~4gbps by DMA, so no the Pi 4 simply cannot cannot round trip real data from disk to network any faster than about 2gbps. It can saturate a 2.5gbps nic with a synthetic datastream, but for a storage application the 1gbps interface is as much as a single pi is going to realistically be capable of
> Many people will say "just buy one PC and run VMs on it!", but to that, I say "phooey."
I mean with VM-leaking things like Spectre (not sure how much similar things affect ARM tbh) having physical barriers between your CPUs can be seen as a positive thing.
Sure, it's just that the Raspberry Pi isn't really fast enough for most production workloads. Having a cluster of them doesn't really help, you'd still be better off with a single PC.
As a learning tool, having the ability to build a real hardware cluster, in a MiniITX case is awesome. I do sort of wonder what the business case for these boards are, I mean are there actually enough people who want to do something like this... schools maybe? I still think it's beyond weird that that there are so much hardware available for build Pi clusters, but I can't get an ARM desktop motherboard, with a PCI slot capable of actually being used as a desktop, for a reasonable prices.
I think a lot of these types of boards are built with the business case of either "edge/IoT" (which still for some reason causes people to toss money at them since they're hot buzzwords... just need 5G too for the trifecta), or for deploying many ARM cores/discrete ARM64 computers in a space/energy-efficient manner. Some places need little ARM build farms, and that's where I've seen the most non-hobbyist interest in the CM4 blade, Turing Pi 2, and this board.
The future of cloud is Zero Isolation... With all the mitigation slowing it down, and the current energy prices and rising, having super-small nodes that are always reserved to one task seems interesting.
Unless you are constrained in space to a single ITX case as in this example, you can get whole x86 machines for <$100 with RAM and storage included.
There is a lot of choice in the <$150 range. You could get eight of these and a cheap 10-port switch for any kind of clustering lab you want to set up.
Surprised someone mentioned these. I picked up 2 of them a while back when you could get them individually for ~$30. I'm using one at a vacation home with zigbee/zwave2mqtt which then lets me integrate it all into home assistant at the main house. Been super stable and small enough to tuck away in a closet.
Only downside is the limited 8gb storage, but it's enough for a minimal linux install with a couple containers. You just can't beat a regular x86 pc when it comes to just being able to install an up to date distro of choice.
A low-end x86 CPU will perform better than the RasPis. My current NAS is an Intel G4560 with 40GB of RAM and 4 HDD and it barely does over ~40W on average. The article's cluster does 18W which is better, but even over a year that's only a 192kWh difference (assuming that is runs all the time) which would amount to about 40$ at $0.20/kWh.
It's not really worth comparing further as the configuration are significantly different, but if your goal is doing 110MB/s R/W, even when accounting for power consumption the product in the article is much more expensive.
I don't know much about NAS and thought they were just a bundle of drives with some [media] access related apps on a longer cable ... 40G RAM? What's that for, is it normal for a NAS to be so loaded? I was looking at NAS and people were talking about 1G as being standard (which conversely seemed really low).
G4560 suggests you're not processing much, is the NAS caching a lot?
Even for mainsteam x86 Intel chips idle power consumption is mostly down to peripherals, power supply (if you build a small NAS that idles on 2-3 W on the 12 V rail and can't pull more than 50 W, don't use a 650 W PSU), cooling fans, and whether someone forgot to enable power management.
Picked one up off craigslist for ~$50 and use it as a plex transcoder since it has QuickSync and can simultaneously transcode around 20 streams of 1080p content.
I hear it's still possible, through heretic magic, to limit a CPUs power draw and most importantly, it will not affect speed on any level (load will increase).
There's even people selling their souls to the devil for the ability to control the actual voltage of their chips, increasing performance per watt drawn!
But only Gods and top OSS contributors can control the power draw of chips integrated into drives/extension cards/etc
Here is a YouTube review of the linked model with input power listed at 12 volt, 1.5 amp (link to timestamp of bottom of unit): https://youtu.be/56UA2Uto1ns?t=129
Adafruit had some in stock a few minutes ago: https://twitter.com/rpilocator ... I think every Wednesday around 11am ... I almost got one this time, but because they had me setup 2FA I couldn't checkout on time.
I think there's something to be said for sizing a raspberry pi or a clone to fit into a hard drive slot.
I also think the TuringPi people screwed up with the model 2. Model 2 of a product should not have fewer slots than the predecessor, and in the case of the Turing Pi, orchestrating 4 devices is not particularly compelling. It's not that difficult to wire 4 Pi's together by hand. I had 6 clones wired together using risers and powered by my old Anker charging station and an 8 port switch, with a few magnets to hold the whole thing together.
I find these boards - that get little boards and network them into a cluster - very interesting. I'd like to see more of these in the future. I hope someone makes a Framework-compatible motherboard with these at some point.
The Intel Edison module could have been a viable building block for one (and it happened a long time ago in computing terms - 2014) - it was self-contained, with RAM and storage on the module - but it lacked ethernet to connect multiple ones on a board - and I don't remember it having a fast external bus to build a network on. And it was quickly discontinued.
> I was able to get 70-80 MB/sec write speeds on the cluster, and 110 MB/sec read speeds. Not too bad, considering the entire thing's running over a 1 Gbps network. You can't really increase throughput due to the Pi's IO limitations—maybe in the next generation of Pi, we can get faster networking!
110MB/s is gigabit. It’s limited to gigabit networking and only has 1Gbps out from the cluster board. So there’s no way to do an aggregate speed of more than 1Gbps/110MBs on this particular cluster board.
If each pis Ethernet was broken out individually and you used a 10G uplink switch or multiple 1G client ports then you could do better.
The write speed being lower than the read speed will be because writes have to be replicated to two other nodes in the ceph cluster (everything has 3 replicas) which are also sharing bandwidth on those same 1G links. Reads don’t need to replicate so can consume the full bandwidth.
So basically it’s all network limited for this use case. Needs a 2.5G uplink, LACP link aggregation or individual Ethernet ports to do better.
Just a random search on Mouser, but something like the BCM53134O[1] as four 1GbE ports, and one 2.5GbE port. A bit pricier you have the BCM53156XU[2] with eight 1GbE ports and a 10G port for fiber.
The device has an onboard 8 port unmanaged gigabit switch. The two external ports are just switch ports and cannot be aggregated in any way. The entire cluster is therefore limited effectively to 1gbps throughput.
IMO it ruins the product utterly and completely. They should have integrated a switch IC similar to what's used in the netgear gs110mx which has 8 gigabit and 2 multi-gig interfaces.
It would be really cool if they could split out 2.5G networking to all the Pis, but with the current generation of Pi it only has one PCIe lane, so you'd have to add in a PCIe switch for each Pi if you still wanted those M.2 NVMe slots... that adds a lot of cost and complexity.
Failing that, a 2.5G external port would be the simplest way to make this thing more viable as a storage cluster board, but that would drive up the switch chip cost a bit (cutting into margins). So the last thing would be allowing management of the chip (I believe this Realtek chip does allow it, but it's not exposed anywhere), so you could do link aggregation... but that's not possible here either. So yeah, the 1 Gbps is a bummer. Still fun for experimentation, and very niche production use cases, but less useful generally.
110 Megabits == 880 Megabits, which is approaching the top speed of the network interface, which is the main bottle neck. A board with more IO, like the rk3568 which has 2x PCIe 2 lanes and 2x PCIe 3 lanes, or a hypothetical rpi5, can deliver more throughput.
Do you think Raspberries without ECC RAM are fine to use for a Ceph storage cluster? I did some research yesterday on the same topic, many say ECC Ram is essential for Ceph (and ZFS too). But I'm not sure what to believe, sure data could get corrupted in RAM before being written to the cluster, but how likely is that?
DDR5 on-die ECC is not the same as traditional ECC. To that point, there are DDR5 modules with full ECC. On-die DDR5 ECC is there because it needs to be for the modules to really work at all.
Imagine being able to buy 6 Raspberry Pis! I have so many projects I'd like to do, both personal and semi-commercial, but it's been literal years since I've seen a Raspberry Pi 4 available in stock somewhere in the USA, let alone 6.
If you are not constrained to the Pi's form factor, there are lots of options for more powerful x86 systems for <$150 per system, often with RAM and storage included. See the linked thread for more details. If you are willing to go used, you can get good rates on lots of multiple thin clients (sample in linked thread).
Micro Center often has them in stock. My local Micro Center currently has 17 RPi 4 4GB in stock. They are available only in store, but you can check stock at their website. Find one you know someone close to that is willing to purchase for you and ship.
I have a Micro Center within 15 minutes. I've checked them a handful of times with no luck. When they do have them, it's usually only the Pi 400s. They've also really jacked up the prices at the one near me, around $80 for a Pi 4 4GB and $120 for the 8GB version.
This looks incredible. Is it possible to expose a full PCI interface from an NVMe slot? I have an old SAS controller that I want to keep running. If I could do that from a PI, that would be incredible.
Which is probably for the best - I don't know how these newer cards behave, but a commonality of all the older RAID/HBA cards seems to be "no power management allowed". Maybe they improved that area, because it's pretty unreasonable for an idle RAID card to burn double digit Watts if you ask me...
The 9405W cards I most recently tested seem to consume about 7W steady state (which is more than the Pi that was driving it!), so yeah... they're still not quite as efficient as running a smaller SATA card if you just need a few drives. But if you want SAS or Tri-mode (NVMe/SAS/SATA) and have an HBA or RAID card, this is a decent enough way to do it!
There's no such thing as an NVMe slot, just M.2 slots that have PCIe lanes. NVMe is a protocol that runs on top of PCIe, and is something that host systems support at the software level, not in hardware. (Firmware support for NVMe is necessary to boot off NVMe SSDs, but the Pi doesn't have that and must boot off eMMC or SD cards.)
I don't see how they could have hooked a 2.5Gb/s Ethernet NIC to the CM4 modules without using up the single PCI-e 2.0 lane other than adding a power hungry, expensive and often out of stock PCI-e switching chip.
This looks really cool!
There was a tutorial posted on HN about building mobile proxy pool with RPI that had obvious limitations: https://scrapingfish.com/blog/byo-mobile-proxy-for-web-scrap...
It seems this could be a solution to scale capabilities of a single RPI.
Considering that this is custom made for the CM4 form factor, the Turing Pi with carrier boards looks much more attractive because future proof. If only it were already available.
It also has SATA and USB 3.0 which is nice
Until I can preorder one I will slowly stock up on CM4s and hope I’ll get there before pi5 comes out.
pastel-mature-herring~> Is this where compute is going?
awesome-zebra*> There is no definitive answer to this question, as the direction of compute technology is always changing and evolving. However, the trend in recent years has been towards smaller, more powerful devices that are able to pack more processing power into a smaller form factor. The DeskPi Super6c is an example of this trend, as it offers a trim form factor and six individual Raspberry Pi Compute Module 4s, each of which offers a high level of processing power.
https://rpilocator.com/ is probably the best place to keep an eye out for them. This is unfortunately also the case for non-CM rpis. Been wanting to get some more pi4s to replace some rather old pi3 (non+) that i've got running just because i want the uefi boot on everything since it makes managing things that much easier.
So far it seems like "maybe 2023"... this year supplies have been slightly better, but not amazing. Usually a couple CM4 models pop up over the course of a week on rpilocator.com.
My setup is individual nodes, with 2.5" external HDDs (mostly SMR), so I actually get sligtly better performance than this cluster, and I'm using 4+2 erasure coding for the main data pool for CephFS.
CephFS has so far been incredibly stable and all my Linux laptops reconnect to it after sleep with no issues (in this regard it's better than NFS).
I like this setup a lot better now than ZFS, and I'm slowly starting to migrate away from ZFS, and now I'm even thinking of setting up a second Ceph cluster. The best thing with Ceph is that I can do a maintenance on a node at any time and storage availability is never affected, with ZFS I've always dreaded any kind of upgrade, and any reboot requires an outage. Plus with Ceph I can add just one disk at a time to the cluster and disks don't have to be the same size. Also, I can move the physical nodes individually to a different part of my home, change switches and network cabling without an outage now. It's a nice feeling.