I've been following the technical discussions around this move, and I'm wondering if you guys looked at making architectural changes to shard your data into more manageable chunks?
Naively it seems like you should be able to reduce your peak filesystem iops by sharding the data at the application layer. That does introduce application complexity, but it might shake out as being less work than the operational complexity of running my own metal.
Of course, easier said than done -- I just didn't spot any discussion of this option, and it seemed like the design choice of having one filesystem served by Ceph was taken for granted.
Then we have to think about redundancy. The simple solution is to have an secondary NFS server and use DRBD. For the shortcomings of that read http://githubengineering.com/introducing-dgit/
The next step is introducing more granular redundancy, failover, and rebalancing. For this you have to be good in distributed computing. This is not something we are now so we rather outsource it to the experts that make CephFS.
The problem of CephFS is that each file need to be tracked. If we would do it ourselves we could do it on the repository level. But we rather reuse a project that many people have already made better than go through the pain of making all the mistakes ourselves. It could be that using CephFS will not solve our latency problems and we have to do application sharing anyway.
That's a fair comment RE: outsourcing, but at my company I'd bias towards bringing some distributed computing knowledge in-house rather than bringing ops expertise plus maintenance burden in-house; sounds like you're going to have to add new expertise to your team either way.
Worth investigating if you can bolt on a distributed datastore like etcd or ZooKeeper to store the cluster membership and data locations; this might not be as complex as it sounds at first. etcd gives you some very powerful primitives to work with.
(For example, etcd has the concept of expiring keys, so you can keep an up-to-date list of live nodes in your network. And you can use those same primitives to keep a strongly consistent prioritized list of repos and their backed up locations. The reconciliation component might just have to listen for node keys expiring and create and register new data copies in response.)
Think about it this way: your EE customers probably have the easier bare metal knowledge, but would be willing to pay for you to solve the distributed system problems for them :)
Naively it seems like you should be able to reduce your peak filesystem iops by sharding the data at the application layer. That does introduce application complexity, but it might shake out as being less work than the operational complexity of running my own metal.
Of course, easier said than done -- I just didn't spot any discussion of this option, and it seemed like the design choice of having one filesystem served by Ceph was taken for granted.