WesternDigital/blb: Distributed object storage system for use on bare metal

brokensegue · on Nov 21, 2018

> In September 2017, Upthere was acquired by Western Digital, and it was decided to pause development on Blb and move data to other storage systems

So wait this is abandoned?

dikei · on Nov 21, 2018

> Although Blb is not being actively developed as a production system, its authors plan to continue improving the system in their spare time as an educational project.

pknopf · on Nov 21, 2018

It is using go.mod, so it can't be that abandoned.

andy_ppp · on Nov 21, 2018

Yes, it has commits to master from 4 days ago ...

zingplex · on Nov 21, 2018

The commit date can't really be used to determine activity in this case. There is only an initial commit.

frou_dh · on Nov 21, 2018

Notably, Upthere is what Bertrand Serlet, the eccentric French software chief at Apple, did next.

monocasa · on Nov 21, 2018

What does bare metal mean here, not virtualized?

It's hard to tell sometimes, what's one man's 'bare metal' is another man's abstraction layer.

regularfry · on Nov 21, 2018

I think in this case it means "there's not even a RAID card in the way".

linkmotif · on Nov 21, 2018

Maybe no filesystem, only block storage?

justincormack · on Nov 21, 2018

It says they currently use ext4 but might bypass later.

notacoward · on Nov 21, 2018

Looks broadly similar to the bottom layer of the system I work on, except that blb only claims that it "should" scale to very high levels whereas the one I work on has already been running at many-petabyte scale for years.

Not too surprisingly, a lot of things that seem "impossible" at smaller scale become every-day things for a large enough system. For example, you will get some kinds of inconsistencies that necessitate various forms of active GC or scrubbing. You will have hot spots, which you need to explicitly deal with instead of relying on statistical distribution guarantees. You will have to migrate whole racks' worth of data at once as equipment (not just disks and hosts but also network switches and power infra) get upgraded. And of course you'll have to monitor the hell out of it so you can fix these problems as they occur instead of having them multiply until your system is irretrievably broken. Don't take claims of super-duper scalability (from blb or Minio) on faith. Look for these "extras" as evidence that the system actually has been run at scale. BTW they don't seem to be there in the blb source.

What does seem to be there is a reference buried in the deployment docs to a "master" component, not mentioned in the architectural overview. It seems to be responsible for assigning partitions of the blob space to curators and directing clients to the right one. Seems like a bottleneck but OK, let's take a look anyway.

// Since we don't persist the address and last heartbeat time, when a master // failover happens, the new leader cannot service requests until it hears // heartbeat from the curators. See PL-1102. (from master.go lines 35-37)

Hm. That seems like a pretty big disruption, even if it's rare. Also, what kinds of heartbeats are these? I think it's generally a bad idea for systems like this to implement their own liveness checking. That's a hard problem, there are tried and true specialist-written systems for doing it, other systems that do it themselves are almost certainly drifting away from their own core competency. This comment is getting long enough so I won't do a full analysis of the blb heartbeat system, but I invite others to look at it with an eye toward how much load it imposes on masters in a large cluster, how reliable failure detection is, and what things should be done (but aren't) when heartbeats fail.

It looks like a pretty good start to a distributed blob store. The basic architectural principles are sound, the code looks pretty clean and well commented, etc. OTOH, seems a bit light on tests, and the lower-level implementation details suggest that in its current form it might not handle even a hundred-node cluster all that well. Caveat emptor.

CTrox · on Nov 21, 2018

Just out of curiosity, what system are you working on?

notacoward · on Nov 21, 2018

It's a system within Facebook called Warm Storage. There have been some public presentations about it, so I'm comfortable mentioning the name, but unfortunately I can't provide many other details about architecture or scale. I'll just warn people that the public information on it is way out of date. Most of it seems to be from 2014, and what it describes is really a separate system from what we have now.

pstuart · on Nov 21, 2018

I (and I'm sure many others on HN) would love to learn more about this when possible.

jagadishg · on Nov 21, 2018

How does it compare to minio?

alphabettsy · on Nov 21, 2018

Minio is an S3-compatible object store with optional distributed storage capability.

This appears to just be an object layer. I supposed you could build an access layer on top similar to Minio.

throwaway9d0291 · on Nov 21, 2018

It looks like it isn't as user friendly as Minio and it doesn't provide an S3-compatible API like Minio but it looks to be more scalable and less opinionated than Minio (which for example forces the authors' replication preferences on you by hard coding them).

glibgil · on Nov 21, 2018

No one knows what that is

octetta · on Nov 21, 2018

https://minio.io

bluedino · on Nov 21, 2018

Was WD planning on making a block storage service or something?

scrollaway · on Nov 21, 2018

Haven't they? WD MyCloud is an end-user storage service.

evntdrvn · on Nov 21, 2018

Seagate has their Kinetic KVS-harddrive project too...

notacoward · on Nov 21, 2018

Are you sure? Even their own website for it seems to be dead.

https://www.seagate.com/tech-insights/kinetic-vision-how-sea...

None of their code repos seem to have been updated for years. Good riddance, too. Object storage is a fine thing, but their implementation was laughably bad.

https://pl.atyp.us/2013-10-comedic-open-storage.html