I have to imagine that by making Storage Pods 1.0 through 6.0, maybe they "encouraged" Dell (and other manufacturers) that this particular 60+ hard drive server was a good idea.
And now that multiple "storage pod-like" systems exist in the marketplace (not just Dell, but also Supermicro) selling 60-bay or 90-bay 3.5" Hard drive storage servers in 4U rack form factors, there's not much reason to build their own?
At least, that's my assumption. After all, if the server chassis is a commodity now (and it absolutely is), no point making custom small runs to make a hypothetical Storage Pod 7. Economies of scale is too big a benefit (worst case scenario: its now Dell's or Supermicro's problem rather than Backblaze's).
EDIT: I admit that I don't really work in IT, I'm just a programmer. So I don't really know how popular 4U / ~60 HDD servers were before Backblaze Storage Pod 1.0
3U and 4U x86 whitebox servers designed for any standard 12"x13" motherboard, where the entire front panel was hotswap 3.5" HDD bays were already a thing many, many years before backblaze existed.
What wasn't really a thing was servers with hotswap HDD trays on both ends (like the supermicros) and things that were designed with vertical hard drives dropped down from a top-opening lid to achieve even higher density.
I remember when I worked in a company that was a netapp shop, and in 2007, they sent one of the ZFS developers and andy bechtolsheim to sell us thumper and ZFS. The developer and I had a spirited discussion of the relative merits of having volume management built into the filesystem.
At the time I advised my company to skip the storage servers (which were overdesigned IMHO, but were unquestionably stout hardware) but that ZFS was an interesting filesystem to explore.
Hah! That was the system my small university IT department bought to act as a storage server. Actually they bought 2, with one to act as a backup.
Unfortunately for them, no one knew how to administrate Solaris, so they... installed Windows. No idea if they actually succeeded in doing anything useful with the systems.
That's not the backblaze design. The backblaze design is that the drives are individually hot-swappable without a tray. 60 commodity SATA drives that can be removed and serviced individually while a 4U server continues to operate normally is pretty amazing.
> things that were designed with vertical hard drives dropped down from a top-opening lid to achieve even higher density.
the backblaze storage pod system and some 3rd party derived designs have the SATA/power ports on PCBs facing upwards, mounted at bottom interior of rack chassis, and 3.5" HDDs mounted vertically down into the system.
You'd think they could build a ARM-powered, credit card sized controller for them with a disk breakout card and network IO. PC motherboard and full-sized cards seem like overkill.
They're running a fair bit of math (probably Reed Solomon matrix multiplications for error correction) over all the data accesses.
Given the bandwidth of 60+ hard drives (150MB/s per hard drive x 60 == 9GB/s in/out), I'm pretty sure you need a decent CPU just to handle the PCIe traffic. At least PCIe 3.0 x16, just for the hard drives. And then another x16 for network connections (multiple PHY for Fiber in/out that can handle that 9GB/s to a variety of switches).
We're looking at PCIe 3.0 x32 just for HDDs and Networking. Throw down a NVMe-cache or other stuff and I'm not seeing any kind of small system working out here.
---------
Then the math comes in: matrix multiplications over every bit of data to verify checksums and reed-solomon error correction starts to get expensive. Maybe if you had an FPGA or some kind of specialist DSP (lol GPUs maybe, since they're good at matrix multiplication), you can handle the bandwidth. But it seems nontrivial to me.
Server CPU seems to be a cheap and simple answer: get the large number of PCIe I/O lanes plus a beefy CPU to handle the calculations. Maybe a cheap CPU with many I/O lanes going to a GPU / FPGA / ASIC for the error checking math, but... specialized chips cost money. I don't think a cheap low-power CPU would be powerful enough to perform real-time error correction calculations over 9GBps of data.
--------
We can leave Backblaze's workload and think about typical SAN or NAS workloads too. More I/O is needed if you add NVMe storage to cache hard drive reads/writes, tons of RAM is needed if you plan to dedup.
RS is normally used as erasure code: It's used when writing (to compute code blocks), and when reading _only when data is missing_. Checksums are used to detect corrupt data, which is then treated as missing and RS used to reconstruct it. Using RS to detect/correct corrupt data is very inefficient.
Checksums are also normally free (CRC + memcpy on most modern CPUs runs in the same time that memcpy does: it's entirely memory bound).
The generation of code blocks is also fairly cheap: Certainly no large matrix multiplications! This is because the erasure code generally only spans a small number of blocks (e.g. 10 data blocks), so every code byte is only dependent on 10 data bytes. The math for this is reasonably simple, and further simplified with some reasonable sized look-up tables.
That's not to say that there is no CPU needed, but it's really not all that much, certainly nothing that needs acceleration support.
> They're running a fair bit of math (probably Reed Solomon matrix multiplications for error correction) over all the data accesses.
Do those run on this machine? I imagine backblaze has redundancy at the cluster level rather than machine level. That allows them to lose a single machine without any data becoming unavailable. It also means we shouldn't assume the erasure code calculations happen on a machine with 60 drives attached. That's still possible but alternatively the client [1] could do those calculations and the drive machines could simply handle read/write raw chunks. This can mean less network bandwidth [2] and better load balancing (heavier calculations done further from the stateful component).
[1] Meaning a machine handling a user-facing request or re-replication after drive/machine loss.
[2] Assume data is divided into slices that are reconstructed from N/M chunks, such that each chunk is smaller than its slice. [3] On read, the client-side erasure code design means N chunk transfers from drive machine to client. If instead the client queries one of the relevant drive machines, that machine has to receive N-1 chunks from the others and send back a full slice. (Similar for writes.) More network traffic on the drive machine and across the network in total, less on the client.
[3] This assumption might not make sense if they care more about minimizing seeks on read than minimizing bytes stored. Then they might have at least one full copy that doesn't require accessing the others.
> Given the bandwidth of 60+ hard drives (150MB/s per hard drive x 60 == 9GB/s in/out)
Given their scale and goal, it would be pretty wasteful to build it to max the writing speed of all hard drives. Considering you rarely write on the pod, you would be better of getting a fraction of that speed and writing on multiple pods at the same time to get the required peak performance.
In fact actually that makes much more sense to put that math on some ingests server and theses hard drive servers would simply write the resulting data. It makes it much easier and faster to divide it over 20 pods like they currently do.
And no, you want to calculate checksums and fix bit errors right here in the RAM buffers you just read or received, because at such scales hardware is not error-free.
I'm not familiar with the algorithms, but matrix multiplication sounds well suited towards GPUs. I wonder if you could get away with a much cheaper CPU and a cheaper GPU for less cost?
But the main issue with GPUs (or FPGAs / ASICs) is now you need to send 9GBps to some other chip AND back again.
Which means 9GBps downstream (to be processed by the GPU) + 9GBps upstream (GPU is done with the data), or a total bandwidth of 18GBps aggregate to the GPU / FPGA / ASIC / whatever coprocessor you're using.
So that's what? Another 32x lanes of PCIe 3.0? Maybe a 16x PCIE 4.0 GPU can handle that kind of I/O... but you can see that moving all this data around is non-trivial, even if we assume the math is instantaneous.
---------
Practically speaking, it seems like any CPU with enough PCIe bandwidth to handle this traffic is a CPU beefy enough to seemingly run the math.
PCIE 3.0 is 1GB/s per lane in each direction. A 3.0 8x link would do a good job of saturating the drives. And basically any CPU could run 8x to the storage controllers and 8x to a GPU. Get any Ryzen chip and you can run 4 lanes directly to a network card too.
The HDD ASIC certainly is doing those computations.
The issue is that Backblaze has a 2nd layer of error correction codes. This 2nd layer of error correction codes needs to be calculated somewhere. If enough errors come from a drive, the administrators take down the box and replace the hard-drives and resilver the data.
Backblaze physically distributes the data over 20 separate computers in 20 separate racks. Some computer needs to run the math to "Combine" the data (error correction + checksums and all) back into the original data on every single read. So singular hard drive can do this math because the data has been reliably dispersed to so many different computers.
With 60+ hard drives in a single enclosure, you want all of the processing power, PCIe lanes, and bandwidth you can get.
There are a small number of high end ARM server boards that could do it, but you’re not saving much money at that point. Might be more expensive due to lack of scale.
Amortize a server-grade CPU and motherboard across 60+ high capacity drives and it’s not really worth pinching pennies at the risk of lower performance.
I have seen some people build Ceph clusters using the HC2 board [1] before. I'm not sure what the performance is like, but it seems like a neat way to scale out storage. The only real shortcoming is that there's a single NIC... If there were two, you could use an HA stack for your network and have a very robust system for very cheap.
It’s actually interesting to me that backblaze has actually reached a size where global logistics plays a bigger part in costs than the actual servers. (And the servers got cheaper).
Also, Dell and Supermicro have storage servers inspired by the BB Pods.
Glad to see this scrappy company hit this amount of scale; a long way from schucking Hard Drives
Any idea what Dell is actually selling them? The DVR's we buy (Avigilon) are white Dell 7x0's with a custom white bezel, but those only fit 18 3.5" drives.
We deploy our storage with a 2U "head unit"[1] that has the actual computer and all of the SSDs for boot and SLOG and L2ARC and "zfs special"[2] devices. Then we attach 60-drive JBODs externally to those "head units".
With this dual-front design you could have 24 2.5 drives in the first row and 12 3.5 drives in the second row and that would really be helpful when we sometimes need to quickly spin up an adjacent zpool without attaching another whole JBOD.
Sadly, I do not see any configurations with 2.5" drives on the Dell config page:
Up to Storage Pod 5.0 they used 45 per 4U, 6.0 was 60 indeed but it was hanging out from the rack so it is understandable no major OEM has a chassis like that. 52 serviceable drives is not bad at all.
Dell's densest server is the PowerEdge XE7100 [1] (100 3.5" drives in 5U) but the bezel cover picture looks like more like a standard 2U, maybe a R740xd2 (26 3.5" in 2U).
Based on the pictured bezel, it looks like they've got three rows 3.5" 14TB SATA drives up front in 14th generation carriers. Best guess would be something like an R740XD2 which has 26 total drive bays per 2U.
Not quite the same, but they do have something like the Pods, but a bit more modular:
It's their PowerEdge MX platform, which allows you to slot in different "sleds" for storage/compute etc. as needed. It can take 7 storage sleds for a total of 112 drivers per chasis.
I'm curious what is being used for the drives (and to a lesser extent, memory) - Dell or OEM and how does support work?
We sell a lot of Dell and for base models, it is very economical compared to self built.
The moment however we add a few high capacity hard drives or memory, all bets are off the table and it's usually 1.75-4x the price of a white box part.
I get not supporting the part itself, but, had them not support a raid card error (corrupt memory) after they saw we had a third party drive.... we only buy a handful of servers a month - I can imagine this possibly being a huge problem for Backblaze though...
That appeared to depend on whether the vendor imposed massive markups on the drives. However they also mentioned service etc.: If they struck a deal with Dell, then Dell might be perfectly happy to sell the servers at a very modest profit while making their money on the service agreement.
Especially flash storage goes through the roof at enterprise purchasing. I've bought the drive trays and used consumer SSDs in servers more than a few times with no real ill effects where SATA is acceptable. If you need SAS, you just need to accept the pain that is about to come when you order.
> That’s a trivial number of parts and vendors for a hardware company, but stating the obvious, Backblaze is a software company.
Stating the obvious: Backblaze wants investors to value them like a SaaS company. This blog post suggests they’re more of a logistics and product company— huge capex and depreciating assets on hand. As a customer, I like their product, but they’re no Dropbox. If they would allow personal NAS then I could see them being a software company.
For me the Big Deal is Backblaze B2. Especially when fronted by Cloudflare -- zero traffic costs. Storage is cheap as far as cloud storage provider goes and traffic is decidedly the cheapest possible.
We tried to use B2 for storing image and video files. First byte out is very slow (as expected). We however experienced bigger issues where if you try to upload millions of files per minute in their system, it crumbles. Huge amount of errors were thrown and their support could not help.
It's easy to buy a bunch of hard drives and connect in a data center. Managing petabytes per user for thousands of users is the hard part, and it's a software problem.
BackBlaze is definitely a SaaS company... though the quality of their offering certainly lags behind Dropbox, both in terms of feature set and user experience. They're also in a very competitive industry. Storage/backup is basically a commodity nowadays.
1TB = $5 on B2. It's pretty great pricing, and there is no reason for them to give you that for free in the backup plan, when it would obviously be at a loss. The "unlimited" plan pricing reflects the fact that very few people carry 1TB+ laptops around.
In general, for larger (> 10TB) drives, most definitely. < 10TB drives have been mostly unaffected.
The direct impact on larger drives is entirely dependent on brand (Toshiba's Enterprise drives appear to be less in demand than Seagate's Exos, for instance), recording process (CMR versus SMR), and, to a lesser extent, power consumption. In virtually all instances though, the price has jumped significantly [0]. 16TB Exos more than doubled, for instance [1]. In large parts of Europe, large drives have been back-ordered since the beginning of April; my orders are scheduled for delivery in August, yet I would be surprised if I saw anything before September.
I've been watching drive prices, it's hard to say exactly why, but around mid April disk prices at Newegg and amazon jumped significantly. One drive that had been $300, jumped to $400, $500, and even spiked to $800 for a bit. By Jun 1st had dropped to $550, and only this week has dropped for $400. Still above the original $300, but at least a not terribly painful premium.
Chia mining-before-transactions-are-released -> Chia transactions released -> Chia price at a high price -> Chia price halves shortly after, turns out virtually impossible for US users to get on any of the exchanges handling XCH
Doesn't look like it, at least in the EU. You can check for yourself here[0]. Just find a drive such as this one[1] and check the chart on the upper-right corner for price history.
Not for hyperscalers like BackBlaze. They have contracts with specific purchase quotas and guaranteed price deltas. Chia has certainly affected prices on the secondary markets, there hasn't been a better time in the past decade to be a secondary server "junk" seller on eBay! NetApp 4246 JBOD's are going for $1000! Absolutely insane!
I looked into Chia mining as a hobby and was directed to Burstcoin. I don't know much about either, but Burstcoin advocates claim it's the "better" PoC.
Note: I went to double check something on the Burstcoin website and realized today, June 24th they changed their name to Signum - https://www.burst-coin.org/
Cool thanks! I didn't know about either, I'll check them out. Are you suggesting Burstcoin is not useful? How do the 4 (Chia/Burst/FileCoin/Arweave) compare? What are FileCoin and Arweave used for now?
FWIW the Burstcoin community have been very helpful, and they have a Windows client which was nice for a "hobbyist" like me.
Bitcoin may be useful but the calculations done for its proof-of-work don't do anything useful outside of Bitcoin. The same is true of Chia. It uses up hard drive space as a mechanism to validate transactions but IIRC that space is essentially random bytes. On the other hand, with arweave or filecoin you are storing files for people.
Dell me be very happy to have a high volume customer with very standardized & predictable needs, and so they're happy with modest markups & extra profit on the service agreements, which is a nice benefit for Backblaze since building their own pods doesn't give them any service guarantee/warranty.
Dell (and, really, all server providers apart from Supermicro) have crazy markups on storage.
It's where they make most of their margin.
And then, most of the time, the drives they sell come on custom sleds that they don't sell separately as a form of DRM/lock in.
Then you get a nice little trade on Chinese-made sleds that sort of work, but not for anything recent like hot swap NVMe drives.
I'm sure BB were able to negotiate down a lot (Dell usually come down 50% off the list price if you press them hard enough for one off projects), but... yeah. That's how it generally goes.
I imagine if you're buying sixty pods a month every month you have some leverage with Dell to get better prices, especially if you have a demonstrated ability to just walk away and build your own if you don't like their offer.
If you are buying in bulk the pricing can be a lot better than the list prices. But generally there isn't a problem buying a server without a drive, Dell will still support the server for non disk related warranty issues, you don't need special firmware or disks
May just be me, but I feel like it’s probably a bad idea to outsource one of your core competencies. Backblaze is quite literally built on their storage pods.
It’s like outsourcing the EU telecommunications, which we’ve seen a bunch of articles about recently.
By my reading of the article, making up chassis and building out servers was outsourced already and was a pita because it wasn’t a core competency.
At least that was my take.
Boxes full of disks used to be special, in a past time there was only the glorious Thumper. BB pods and hyperscaler servers built by Quanta and Supermicro have pushed "boxes full or disks" into race to the bottom territory.
Software is now the only thing that separates services like BB, Dropbox, et al.
Don't cry for BB pods, they did their job and now you can get a higher quality chassis at a similar cost basis from Dell as a result.
If low volume is a problem for manufacturers because you don't need that much, the obvious solution is to increase volume by selling them. Of course that would introduce even more problems to solve, but at least volume wouldn't be one of them.
I don't think Backblaze ever sold them, but 45Drives does. I'm not sure if they assemble them for BB or if they were just using their published design.
they did at one point, fta: “Right after we introduced Storage Pod 1.0 to the world, we had to make a decision as to whether or not to make and sell Storage Pods in addition to our cloud-based services. We did make and sell a few Storage Pods—we needed the money—but we eventually chose software.”
Not any more they don't, fta: “Right after we introduced Storage Pod 1.0 to the world, we had to make a decision as to whether or not to make and sell Storage Pods in addition to our cloud-based services. We did make and sell a few Storage Pods—we needed the money—but we eventually chose software.”
fta: “Right after we introduced Storage Pod 1.0 to the world, we had to make a decision as to whether or not to make and sell Storage Pods in addition to our cloud-based services. We did make and sell a few Storage Pods—we needed the money—but we eventually chose software.”
I can't say that I'm surprised, and honestly anyone can open source the architecture for a storage array of comparable use. I think the only unique thing here really is the chassis, but there are plenty of whitebox vendors that sell storage chassis. You may not get as many drives in one, but usually the other components in these things are usually pretty cheap minus the storage. I don't really see this being a loss in the community at all, and maybe someone else will get creative and build something better.
I used to work in a data centre doing infrastructure; I’m a metal fabricator by trade and now run and operate a laser cutter full time; I’ve done plenty of industrial spray painting; and heaps of press operating; plenty of CAD / CAM experience.
With trainee under me, I could have built these for you in-house and been useful the rest of the month when fabrication and painting etc was done.
I guess it's kind of like building your own gaming PC: Even paying retail prices for the parts, you can build your own for significantly cheaper than a comparable pre-built system. Since their business model is "extremely cheap unlimited backup storage" they had to go it alone, but now there are more COTS options similar to their needs.
> and there never will be, according to the article.
The article doesn't say that. It says:
> So the question is: Will there ever be a Storage Pod 7.0 and beyond? We want to say yes. We’re still control freaks at heart, meaning we’ll want to make sure we can make our own storage servers so we are not at the mercy of “Big Server Inc.” In addition, we do see ourselves continuing to invest in the platform so we can take advantage of and potentially create new, yet practical ideas in the space (Storage Pod X anyone?). So, no, we don’t think Storage Pods are dead, they’ll just have a diverse group of storage server friends to work with.
"The Next Backblaze Storage Pod" is commercially available storage from Dell. It's a little clickbaity, but it's both a) the title of the article, which HN encourages using and b) accurate.
And now that multiple "storage pod-like" systems exist in the marketplace (not just Dell, but also Supermicro) selling 60-bay or 90-bay 3.5" Hard drive storage servers in 4U rack form factors, there's not much reason to build their own?
At least, that's my assumption. After all, if the server chassis is a commodity now (and it absolutely is), no point making custom small runs to make a hypothetical Storage Pod 7. Economies of scale is too big a benefit (worst case scenario: its now Dell's or Supermicro's problem rather than Backblaze's).
EDIT: I admit that I don't really work in IT, I'm just a programmer. So I don't really know how popular 4U / ~60 HDD servers were before Backblaze Storage Pod 1.0