Love how they did some number crunching and decided that rent vs own, own won. I...

sytse · on Dec 11, 2016

Thanks! The decision to move to metal was because of performance problems https://about.gitlab.com/2016/11/10/why-choose-bare-metal/

It is nice that we'll save on costs but we anticipate a lot of extra complexity that will slow us down. So if it wasn't needed we would have stayed in the cloud. But it is interesting that both our competitors (GitHub.com and BitBucket.org) also moved to metal.

mwcampbell · on Dec 11, 2016

Have you considered hosting with Packet.net? You'd be on bare metal, thus solving your performance problems, but you'd still be renting by the hour as you are now, and you wouldn't have to deal with buying your own hardware and all the complexity that comes with that.

sytse · on Dec 11, 2016

I looked at their site and they talk about bring your own block, anycast, and IPv6. But I can't find any information about networking speeds. What if we end up needing 40 Gbps between the CephFS servers?

justincormack · on Dec 12, 2016

They provide dual 10Gb as standard. But talk to them about options.

cobookman · on Dec 11, 2016

Would love to chat with you to see if a switch to GCP might solve your performance and pricing issues. It's also always interesting to see how GCP vs AWS vs bare metal fairs.

email me at bookman@google.com and I'll be sure to get you in contact with the right people at google.

theptip · on Dec 11, 2016

I've been following the technical discussions around this move, and I'm wondering if you guys looked at making architectural changes to shard your data into more manageable chunks?

Naively it seems like you should be able to reduce your peak filesystem iops by sharding the data at the application layer. That does introduce application complexity, but it might shake out as being less work than the operational complexity of running my own metal.

Of course, easier said than done -- I just didn't spot any discussion of this option, and it seemed like the design choice of having one filesystem served by Ceph was taken for granted.

sytse · on Dec 11, 2016

We have sharding on the application layer in GitLab right now https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/7273 and we're using it heavily to split the load among NFS servers.

Then we have to think about redundancy. The simple solution is to have an secondary NFS server and use DRBD. For the shortcomings of that read http://githubengineering.com/introducing-dgit/

The next step is introducing more granular redundancy, failover, and rebalancing. For this you have to be good in distributed computing. This is not something we are now so we rather outsource it to the experts that make CephFS.

The problem of CephFS is that each file need to be tracked. If we would do it ourselves we could do it on the repository level. But we rather reuse a project that many people have already made better than go through the pain of making all the mistakes ourselves. It could be that using CephFS will not solve our latency problems and we have to do application sharing anyway.

theptip · on Dec 11, 2016

That's a fair comment RE: outsourcing, but at my company I'd bias towards bringing some distributed computing knowledge in-house rather than bringing ops expertise plus maintenance burden in-house; sounds like you're going to have to add new expertise to your team either way.

Worth investigating if you can bolt on a distributed datastore like etcd or ZooKeeper to store the cluster membership and data locations; this might not be as complex as it sounds at first. etcd gives you some very powerful primitives to work with.

(For example, etcd has the concept of expiring keys, so you can keep an up-to-date list of live nodes in your network. And you can use those same primitives to keep a strongly consistent prioritized list of repos and their backed up locations. The reconciliation component might just have to listen for node keys expiring and create and register new data copies in response.)

sytse · on Dec 11, 2016

I agree we need to add new expertise to our team anyway. But I think adding bare metal expertise is easier than distributed system expertise.

Etcd is indeed very interesting. I'm thinking about using it for active active replication in https://gitlab.com/gitlab-org/gitlab-ee/issues/1381

daxelrod · on Dec 12, 2016

Think about it this way: your EE customers probably have the easier bare metal knowledge, but would be willing to pay for you to solve the distributed system problems for them :)

daenney · on Dec 11, 2016

> Love how they did some number crunching and decided that rent vs own, own won. I think that if more places looked they would find that out also

I wouldn't be so quick to jump to that conclusion. It's not just the cost of owning and renewing the hardware, it's everything else that comes with it. Designing your network, performance tuning and debugging everything. Suddenly you have a capacity issue, now what b/c you're not likely to have a spare 100 servers racked and ready to go, or be able to spin them up in 2m? Autoscaling?

Companies spend enormous amounts of engineering hours to maintain their on-premise solutions. And sometimes that's fine b/c you have requirements that you can't easily do in the cloud (think of high frequency trading for example). However, once you tally all that up, plus all the value added services you can buy in the cloud (just take a look at the AWS portfolio for example) the price might well be worth it. That's not to say you won't need engineers to help you with cloud stuff, but you'll probably need less and they'll be able to focus on solving a different class of problems for you.

> There must be a margin in it since the big players are making money at it.

From what I've seen the players aren't making (lots of) money on providing compute power. They're basically racing against each other to the bottom. What they're making money on is all the value added services, the rest of the portfolio AWS/Google Cloud Platform/Azure offers.

AstroJetson · on Dec 12, 2016

So in gitlabs case, they have a load that they can monitor and predict. They are looking at 60+ processors, so they can plan to add 10% (6 procs at a time) and grow. They know their load, so the likely need to spin 150% of current capacity isn't something on their plan. I'll give you that there are companies that have erratic loads that are hard to predict, they make sense to place in something that can grow 100% on an email.

Big companies, most of their servers have a pretty stable load, it's unlikely things like internal email, Sharepoint, ERP/MAP systems will take a spike. It's only things like front end order processing that takes the hit.

There are lots of businesses that make sense and some that don't

I like the concept of "racing to the bottom" but they are still making money. But lets take your comment of the Value Added Services other than the ability to spin up capacity. What's the cost to Gitlab to pull this together and keep it running? There is an overflow every day on HN articles about operations monitors, containers, network monitoring. The tools are there, its an effort to glue them together, but then they are there.

So I'll still posit there is cases that the dollars to own are less than the dollars to rent. And I'll agree with your cases of rent because of capacity blowouts is key. The issue, is your ops team savvy enough to figure out what to keep/own, what to rent?