These days most people know Hadoop for a distributed storage. In my opinion, tho...

manigandham · on April 19, 2018

There are several distributed storage systems like Ceph and they all have problems. Ceph is not good because it's an object storage system trying to provide block storage and a filesystem on top, which will never work well.

convivialdingo · on April 20, 2018

Absolutely right - I wrote an object storage backed filesystem for Hadoop and synchronizing was a complete nightmare.

It certainly worked well within our required architecture - but as a general purpose system it would have several issues over typical network topologies.

As active volume size increases and/or changes, the object storage layer latency can bump up to minutes, or even hours. We stopped tested at 100TB volumes, as the object storage layer was backed up to hours.

Object Storage is very convenient, but the lack of good metadata and latencies involved basically means that it's only good as an archival backing store of active data. If your active datasource goes out - you could lose hours of data unless you have local copies.

noahdesu · on April 19, 2018

Not that this really has anything to do with FoundationDB, but why do you say that object storage is a poor substrate for file and block abstractions? There are many high-performance block and file systems built on object storage.

manigandham · on April 19, 2018

Block level is the lowest form of addressing bytes on devices. Filesystems are an abstraction on top of block devices. Object stores are an abstraction on filesystems.

Emulating a low-level layer on a higher-level abstraction (which itself is using this hierarchy) will never match the speed, scale, or reliability of doing it correctly.

gregsfortytwo · on April 27, 2018

These abstraction statements just aren't true.

Yes, you can architect a storage system this way. But 1) Even if you do, many, many, many high-performance systems are "on top" of a filesystem but don't actually use the filesystem for anything except perhaps as a block allocator. Consider databases.

2) Many, many object stores do not abstract on top of filesystems. Modern RADOS, the distributed object store, stores its local data in a local object store called BlueStore. BlueStore speaks directly to the block device; there's no filesystem involved.

3) Even if you did store part of your distributed object store data on top of a local filesystem, that's not necessarily an issue. HDFS does this. (HDFS, despite the "FS" in its name, is an object store as most practitioners understand them.)

manigandham · on April 28, 2018

1) Yes, databases are just object stores with indexing and querying, which are an abstraction on filesystems, which are abstractions on block devices.

2) Rados is an object store, which is an abstraction on BlueStore (effectively a filesystem and replacement of FileStore), which is an abstraction on block devices.

3) HDFS is an object store, which is an abstraction on filesystems, which are an abstraction on block devices.

I'm not sure what you're point is because you just restated what I already said. They are abstraction, and they work just fine without any performance issues because that is the trade off of having an abstraction.

What I also said is that emulating low-level layers on a higher-level interface (like a block device on top of a database or object store) will never match the original block device. What is untrue about this?

noahdesu · on April 29, 2018

It appears that you want to make a very simple statement: "Abstractions tend to introduce overhead. If any software layers, an OS, or a network are added on top of a block device, I/O overhead will be introduced somewhere." As a conversational seed, I would wager most people would agree with that statement, in general.

One issue in this thread is that abstractions are concepts, not cpu instructions. In order to discuss overhead, one needs to reify the abstraction. For example, if you care about latency overhead, the block scheduler will definitely introduce overhead. But if you care about throughput, you probably /want/ abstractions like queues and schedulers.

> "What I also said is that emulating low-level layers on a higher-level interface (like a block device on top of a database or object store) will never match the original block device. What is untrue about this?"

Nothing is untrue about the sentiment of your statement. But from a practical standpoint, storage devices are useless pieces of junk without software. So to say abstractions slow down storage device while ignoring their utility feels arbitrary: why not talk about the length of the SATA cable, or the firmware in the disk controller? If the answer is that you just wanted to make the simple statement like the one I quoted at the start of this post then that's great, I think we are all in agreement. Otherwise, it's not clear what your point is and many of the supporting examples that you list are stated as fact, but are in reality either generally untrue, or very nuanced points, both of which tend to attract strong opinions :)

catwell · on April 20, 2018

> Block level is the lowest form of addressing bytes on devices. Filesystems are an abstraction on top of block devices. Object stores are an abstraction on filesystems.

I don't agree with this, but I think you may be confused because "Object Storage" can mean several different things.

"Object Store" in Ceph (as in RADOS - Reliable Autonomous Distributed Object Store) basically means key-value store. I typically say "blob store" instead to avoid the confusion with more sophisticated systems. It is exposed through a S3-like API. As far as I know, this layer of CEPH is pretty good, and you need a layer like this in most distributed systems anyway.

Ceph provides something called RBD, RADOS Block Device, which exposes a Block Device interface and is implemented on top of RADOS blob storage. It is useful for VM disks and has decent performance because it makes heavy use of the cache.

Some people use filesystems on top of RBD, but as far as I know CephFS itself does not sit on top of RBD. It is not as widely used as RBD because it is pretty recent (first release in 2016). The data is stored in RADOS and the metadata (which is the hardest part in a distributed filesystem) is dealt with by a Metadata Server cluster (MDS). This sounds like a typical distributed filesystem architecture to me, similar to GFS (the MDS replaces the GFS master and RADOS is used instead of chunk servers).

People tend to have a lot of issues with Ceph, but I think this is because:

1) It is used in reasonably large scale production settings where you are going to have issues anyway ;

2) It is not as easy to understand and fine-tune as it should be ;

3) Some people expect it to solve all their issues magically with perfect performance...

4) Some people use filesystems on top of RBD when they should have used CephFS or even direct interfaces to RADOS when possible.

But in general, I think Ceph is an example of a decently architectured complex distributed system.

gregsfortytwo · on April 27, 2018

Just for future reference, RADOS is actually not very S3-like. It is an object store; it does map from object names to buckets of bytes. But unlike S3 and many similar object stores or key-value DBs, RADOS allows you to do file-like operations: you can append, write to random offsets in the object, overwrite pieces of it but not the whole object, etc. (That's all in addition to some stunningly-complex stuff like injecting custom code to do specific kinds of transactional read-writes on the OSD [storage node] itself.)

That's all key to RBD being useful, or indeed CephFS itself. There are systems that map a filesystem layer on top of S3, but they have trouble because there aren't good ways to overwrite random small pieces of an S3 object. With RADOS, there are! :)

manigandham · on April 20, 2018

Sure, and key/value systems are at the similar level of object stores, meaning they are abstractions on filesystems (which are abstractions on block devices). This is the hierarchy.

Using Ceph for block and file access is like using AWS S3 to emulate block devices and filesystems. It'll work, and there is software for it, but it will never be very good. And Ceph is far from S3.

noahdesu · on April 20, 2018

> It'll work, and there is software for it, but it will never be very good

What are some examples of distributed file systems and block devices that _are_ very good?

noahdesu · on April 19, 2018

As a general rule of thumb, I would definitely agree that increasing the number of abstraction layers should be done with consideration to performance concerns.

Interestingly, in the latest version of Ceph the abstractions are a bit different than you listed. Ceph is now using an object store built directly on top of raw devices. It's the file system, block, and object abstractions that exist on top of that.

spullara · on April 19, 2018

I mostly agree, but if you want a distributed, reliable, scalable block store it is pretty easy to build one on FDB. Here is a super simple one that acts as an NBClient server that you can mount from Linux:

https://github.com/spullara/nbd

Remember to format it with something like XFS rather than Ext to avoid writing superblocks all over the place.