Hacker News new | past | comments | ask | show | jobs | submit login
WesternDigital/blb: Distributed object storage system for use on bare metal (github.com/westerndigitalcorporation)
95 points by mastabadtomm on Nov 21, 2018 | hide | past | favorite | 23 comments



> In September 2017, Upthere was acquired by Western Digital, and it was decided to pause development on Blb and move data to other storage systems

So wait this is abandoned?


> Although Blb is not being actively developed as a production system, its authors plan to continue improving the system in their spare time as an educational project.


It is using go.mod, so it can't be that abandoned.


Yes, it has commits to master from 4 days ago ...


The commit date can't really be used to determine activity in this case. There is only an initial commit.


Notably, Upthere is what Bertrand Serlet, the eccentric French software chief at Apple, did next.


What does bare metal mean here, not virtualized?

It's hard to tell sometimes, what's one man's 'bare metal' is another man's abstraction layer.


I think in this case it means "there's not even a RAID card in the way".


Maybe no filesystem, only block storage?


It says they currently use ext4 but might bypass later.


Looks broadly similar to the bottom layer of the system I work on, except that blb only claims that it "should" scale to very high levels whereas the one I work on has already been running at many-petabyte scale for years.

Not too surprisingly, a lot of things that seem "impossible" at smaller scale become every-day things for a large enough system. For example, you will get some kinds of inconsistencies that necessitate various forms of active GC or scrubbing. You will have hot spots, which you need to explicitly deal with instead of relying on statistical distribution guarantees. You will have to migrate whole racks' worth of data at once as equipment (not just disks and hosts but also network switches and power infra) get upgraded. And of course you'll have to monitor the hell out of it so you can fix these problems as they occur instead of having them multiply until your system is irretrievably broken. Don't take claims of super-duper scalability (from blb or Minio) on faith. Look for these "extras" as evidence that the system actually has been run at scale. BTW they don't seem to be there in the blb source.

What does seem to be there is a reference buried in the deployment docs to a "master" component, not mentioned in the architectural overview. It seems to be responsible for assigning partitions of the blob space to curators and directing clients to the right one. Seems like a bottleneck but OK, let's take a look anyway.

// Since we don't persist the address and last heartbeat time, when a master // failover happens, the new leader cannot service requests until it hears // heartbeat from the curators. See PL-1102. (from master.go lines 35-37)

Hm. That seems like a pretty big disruption, even if it's rare. Also, what kinds of heartbeats are these? I think it's generally a bad idea for systems like this to implement their own liveness checking. That's a hard problem, there are tried and true specialist-written systems for doing it, other systems that do it themselves are almost certainly drifting away from their own core competency. This comment is getting long enough so I won't do a full analysis of the blb heartbeat system, but I invite others to look at it with an eye toward how much load it imposes on masters in a large cluster, how reliable failure detection is, and what things should be done (but aren't) when heartbeats fail.

It looks like a pretty good start to a distributed blob store. The basic architectural principles are sound, the code looks pretty clean and well commented, etc. OTOH, seems a bit light on tests, and the lower-level implementation details suggest that in its current form it might not handle even a hundred-node cluster all that well. Caveat emptor.


Just out of curiosity, what system are you working on?


It's a system within Facebook called Warm Storage. There have been some public presentations about it, so I'm comfortable mentioning the name, but unfortunately I can't provide many other details about architecture or scale. I'll just warn people that the public information on it is way out of date. Most of it seems to be from 2014, and what it describes is really a separate system from what we have now.


I (and I'm sure many others on HN) would love to learn more about this when possible.


How does it compare to minio?


Minio is an S3-compatible object store with optional distributed storage capability.

This appears to just be an object layer. I supposed you could build an access layer on top similar to Minio.


It looks like it isn't as user friendly as Minio and it doesn't provide an S3-compatible API like Minio but it looks to be more scalable and less opinionated than Minio (which for example forces the authors' replication preferences on you by hard coding them).


No one knows what that is



Was WD planning on making a block storage service or something?


Haven't they? WD MyCloud is an end-user storage service.


Seagate has their Kinetic KVS-harddrive project too...


Are you sure? Even their own website for it seems to be dead.

https://www.seagate.com/tech-insights/kinetic-vision-how-sea...

None of their code repos seem to have been updated for years. Good riddance, too. Object storage is a fine thing, but their implementation was laughably bad.

https://pl.atyp.us/2013-10-comedic-open-storage.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: