We run 26 services in production on DigitalOcean. Every single VPS in our setup uses block storage feature as a persistence layer (system logs, app logs, databases, etc).
Thanks to this architecture we can rebuild machines at will. The _function_ and _state_ are nicely separated.
You should re-architect without network block storage as a requirement imho. Ever since that big AWS EBS outage in 2012 or whatever, I avoid it like the plague.
Databases like low-latency local storage (the newer nvme instances on aws are very good) and logs and whatnot are aggregated with other systems (fluent, logstash, etc). I actually do not miss EBS much at all really -- if I have a problem with a VM, they're disposable and redundant and I've designed out most of the SPOFs.
I wouldn't throw the baby out with the bathwater. Across Google, we rely almost exclusively on networked storage (Colossus) unless we need extremely high-performance local flash. Being able to separate compute from storage is a huge part of our ability to both scale out, as well as to do live migration for GCE (where Persistent Disk, our equivalent of EBS is built upon Colossus).
Persistent Disk has never had an outage like the EBS one, but I attribute that to the Colossus and underlying teams having run this at Google for a really long time. Fwiw, the AWS folks have also massively improved EBS over the years. You can still worry, but I prefer to think about overall MTBF rather than consider networked storage in particular as the plague :).
If you haven't noticed the mtu in Google cloud is < 1500 vs AWS where you can get jumbo frames (9k). I have no reason to believe persistent disks are different.
Enabling live migrations? You mean the choice of having an instance terminated with zero notice or migrate with a 60 second warning if you subscribe to the right API? Oh yeah and this is happening constantly.
Live migrations (and by connection VM attrition), persistent disks, and network performance are my least favorite aspects of google cloud today.
That being said google cloud does have alot of advantages over ec2. These are just not them.
First, I'm sorry you've had a bad time. We track our VM MTBF closely, so hearing "this is happening constantly" is really worrying. Feel free to reach out to me (email in profile), or Support so we can dig into your experience. If something is wrong, we should diagnose and fix it.
Can you say why you prefer explicitly separated egress caps? We let our networking egress be shared between all sources of traffic on purpose, because it lets you go full throttle rather than hardcap on "flavor". That is, why restrict someone that doesn't write to disk much by "stealing" several Gbps for the PD/EBS they aren't going to use?
Finally, it's true that our MTU is too damn low. But that also isn't particularly material for PD: when the guest issues a write, we handle it all behind the scenes (it's not like your guest sees the write get fragmented into packets).
Thank you for the fresh perspective and the thoughtful reply. I also appreciate the offer to reach out. Rest assured we are actively engaged during the events and have found support incredibly responsive. At a high level none of these are reasons for us to stop investing in our google cloud stack, nor have they caused any major outages. Think of them as quality of life comments.
Live migrations
To clarify "this is happening constantly". I meant we see live migrations happen frequently through the day. Going back the last 24 hours I see "hundreds" of migrations. We do have days where we will see 3 or 4x this number. The majority of these were successful and our logging, probes, and graphs show nothing exciting.
On the positive side of things, we have noticed a marked improvement in migration times, probe failures, and instance fatalities in the last 6 months. Where before we would regularly see live migrations take upwards of 15 minutes or longer. They are now at or under 2 minutes (With only a handful of exceptions barely worth mentioning).
I do appreciate the facility of live migration and the proactive approach googles takes to host maintenance. The 60 second notification window is just too damn short for some of our services to properly drain themselves. So instead, we hold on to our butts and hope for the best on those boxes.
If there was one improvement to live migration it would be to have the option of a 15 minute (or even 30 minutes.. am I being greedy?) notice.
Networking egress caps shared between instance and persistent disk
The edge case that hurts here is when you have a high bandwidth service that also writes a lot of data to a PD disk and the comcast style burst bandwidth throttling that happens on the instances (This is pure speculation and may have improved since last time this was investigated. The observation at the time was the throttling is a bit too efficient and hit the instance disproportionately). We have since migrated to either local ssd or tmpfs disks for these type of hosts in gce. Their sister services are still running fine in ec2 on ebs instances.
MTU
Yay! 1500 would be better, 9k would be great. This hurts when connectivity starts to see an increase in packet loss and the corresponding increase in packet re-transmission (and latency). Not to mention the overhead these extra packets incur (Those 20-60+ bytes just in headers add up quick).
TLDR wishlist
Longer notification window before live migration actually starts
PD-optimized instances
1500/9k MTU
One would think right? We are not running with preemptible instances.
Yes I have verified all the scheduling flags for our instances.
Preemptible: false
OnMaintenance: MIGRATE (only other option is terminate which happens far too often for any sizable shop to use and you lose the 60 second warning)
Key word being usually. Even if it's not a complete failure we have seen all sorts of fun from packet loss/latency spikes to seg faults.
I have no doubt things will improve but right now pd, network, and live migrations are pain points for our group in gce.
Full disclosure: evaluating GCP pretty heavily -- and can support what you're saying from conversations with existing customers (multiple). Live migrations cause both latency/cpu slowdowns (brownouts) and actual (short) interruptions (blackouts). This isn't even a secret, it's fairly well-documented:
To each their own. I don't like getting paged at 3am, so I choose to remove complexity and external dependencies where I can. And not operating at Google-scale, this is one of those things I can live without.
Coming from an enterprise background for a couple of years: network block storage screws you one way or another either way. It's a single point of failure that no matter how reliable it is, will always go down at some point or have failures that affect your application layer.
With Ceph you can distribute data across hosts, switches, racks, rows, fabrics, rooms, datacenters. If you design correctly, you can have a very resilient storage system.
Source: I come from enterprise storage and have been running Ceph in production (successfully) for ~5 years.
Ceph has some weird failure modes, though. Your radius of failure absolutely includes the entire cluster, regardless of how resilient its been made.
Sure, you're insulated from hardware failures, but not from bugs in the underlying system. Data loss is rare, but having a cluster go unavailable does happen.
> radius of failure absolutely includes the entire cluster
Completely agree, same goes for any other storage system, network-based or otherwise.
I'm simply disagreeing with a blanket statement that all network block storage is unreliable, and that is simply not true.
Don't do silly things with Ceph, don't just have a single network fabric, don't buy cheap switches that drop packets on the floor at medium load, don't test things in production.
Can’t understate the need for good network, especially with the traffic amplification of replication.
I was running 2x10GE and had plans to scale up the cluster network to 2x or 3x10GE.
My biggest issue was the MTTR of a failed cluster. If one is RBD mirroring to a different cluster, recovery time may be hours or days.
That said, I had one cluster run continuously for a couple years with basically zero administration. It had great performance and good reliability other than that one week-long service impacting incident as a result of a bug.
Sounds like you've been in environments that have done network block storage incorrectly. If you do multi-path, redundant/replicated network storage with proper configurations in place, you shouldn't have a single point of failure. Ceph oddities aside.
Everything has issues given a long enough time span though. Each layer has an accounted level of risk, and it definitely is a balancing act.
We are actually very happy with database performance over block storage.
As for the architecture, in DigitalOcean the external block storage is the only way to separate function from state. We need this to be able to rebuild VPS-es regularly. This is to avoid configuration drift and to prove/enforce reproducibility.
I guess I've solved that in other ways -- configuration drift is really a config management and orchestration problem. And recycling or OS upgrades can happen if your story around replication is sound -- I only typically do that every two years anyway following the Ubuntu LTS releases.
Add a cache layer for read-only content which doesn't invalidate in case of a failure like this, so it allows your system to go into read-only mode.
Local copies of puppet/ansible can be very useful as well.
For logs and metrics, I'm not aware of an out of the box solution that could replay from the last, successfully transmitted line; this is something rsyslog, graphite, etc. could certainly benefit from. (Please let me know if you are aware of these kinds of buffers.)
However, distributed network block storage is usually something unavoidable after a certain size; local filers are really expensive.
> For logs and metrics, I'm not aware of an out of the box solution that could replay from the last, successfully transmitted line; this is something rsyslog, graphite, etc. could certainly benefit from. (Please let me know if you are aware of these kinds of buffers.)
As somebody who's been looking really hard at a project/side business that'd use Spaces (DO's object storage system), this makes me super, super nervous. To say nothing of block storage--yikes.
Can anyone speak to quality/reliability of other object storage providers that have S3-compatible (including presigned URL) APIs? S3's pricing is absolutely ridiculous by comparison, but they have the reliability argument on their side...
>S3's pricing is absolutely ridiculous by comparison, but they have the reliability argument on their side...
Well, unless your S3 buckets are in us-east-1. For some reason Amazon keeps having issues with S3 in that region.
Since the storage costs appear to be the same between Spaces and S3 ($0.02/GB/month) and neither charge for inbound transfer, I'm assuming your problem is with the outbound transfer pricing (S3 charges 9x what DO charges) and/or the per-request pricing. GCP's Regional Cloud Storage has the same storage pricing, even higher outbound transfer pricing, and the same request pricing. I haven't looked at any other providers, but if you want reliability, you're going to have to pay for it.
Their us-east-1 region is notorious for being the region with the least reliability. Unless there's a good reason not to I always recommend people default to us-west-2.
I can't speak about S3 but we're using heavily GCS (google cloud storage) and haven't experienced any problems. From time to time we can see a slowdown, but never seen an outage.
I also looked at B2 [0] once or twice. The price is great, but the traffic cost (egress from GCE) renders it unusable for us.
There are actually a few options in egress land for us if cost is your primary concern.
If you're doing serving over http(s), you should probably be using Cloud CDN with your bucket [1] or put one of our partners like Cloudflare or Fastly with CDN Interconnect [2]. Both of these get you closer to $.04-$.08/GB depending on src/dest.
If not, and you don't care that we have a global backbone, you can get a more AWS-like network with our Standard Tier [3] (curiously with pricing squirreled away at [4], I'll file a bug). The packets will hop off our network in a hot potato / asap fashion, so you're not riding our backbone as much.
I know about Standard Tier, though it had slipped my mind--thanks for pointing it out.
It's still way too expensive. And Cloud CDN isn't an appropriate tool for my use case. I really do just need a bunch of egress from a single location that isn't insanely expensive. $0.085/GB is in that insanely-expensive tier, for me.
Don't know yet! But I expect to see between 100x and 1000x on a per-megabyte basis. Not evenly distributed across objects (objects between 40MB and 150MB), but as a rough estimate.
I'd be happy to talk further via email; it's not secret, just not public.
Your bandwidth pricing is a joke. Yes, you got a nice network and yes you pay premiums to get transit of providers that are "hard to work with". And yes, you have dark fiber between your locations, which is costing a lot of money, but even considering those facts you are still charging at least 10x as much as your bandwidth should cost your customers.
How have you even calculated those prices?
"Let's look at AWS and make it even more expensive"?
The bandwidth charges are hefty if you're moving a lot of bits, but I wouldn't use anything other than S3 or GCS probably -- the other guys just don't have a track record of reliability yet.
But, you can build a poor-man's CDN -- varnish caches on DO/Linode/whatever where you get multiple terabytes of bandwidth for a small VM. So, you use the best object storage provider, but move most of the bits cheaply using Varnish + Route53 geo-dns.
I mean, Digital Ocean itself has been “jokeish”. I had a VM ago down for 12 hours, it took 6 hours to even get them to confirm they had an issue on that machine. It was stupid.
There is Sheepdog distributed file system which recently went into version 1.0. Sheepdog is similar to Ceph but seem simpler to operate.
For distributed object storage: I have also used MooseFS,LizardFS distributed object storage and MooseFS,Lizard runs very steady on production work loads. Steady as setup and then no ops issues.
Also to the short list is BeeGFS, BeeGFS is created by Fraunhofer is seriously fast distributed file system.
I can't speak for the quality of other object storage providers, but being in the storage business I can say that if someone is running Ceph, find another provider.
If you are relying on a single object storage provider and cannot survive downtime, data loss or simply being very slow at times you will never find a good one. Expect things to fail. Distributed systems are not that trivial for a random object storage provider to have enough expertise to run Ceph or any other open source solution at scale with no issues.
Or you should purchase commercial support for said storage...
Salesforce runs several large Ceph clusters, and they have a dedicated team to run it. If you can't invest in the employees, you should invest in commercial support.
Salesforce also commits a lot of updates and patches back to the Ceph community
The issue with Ceph isn't that it's some how deficient. It's amazing. The problem is that it's difficult to engineer correctly and hard to troubleshoot.
There is no other software, open source or otherwise that works quite as well as Ceph for providing durability and scale.
ScaleIO gets high marks for block storage performance compared to Ceph. It's not quite as durable and lacks some other features, but people seem to like it.
A lot of companies using Ceph at scale are facing huge issues (OVH, etc.), so he is not wrong. Why take the risk of going with a solution that is known to cause issues?
I've talked to a lot of large-ish commercial Ceph customers and they seem to spend a lot of time building kludge-arounds for support. And tend to live terrified that the whole clumsy edifice will come crashing down at the cost of their jobs.
Also too, Ceph is block, object and file. Block is ok up to a point, object is dubious and file is utterly untrustworthy. At least at any kind of real scale - 3 servers in a rack aren't "scale".
Why must someone who isn't a Ceph fan (and I fail to see why storage systems are a "fan" activity) live in the evil pockets of EMC? I know people who've smoked for years and don't have any sign of lung cancer either.
OVH isn't exactly a shining example of a quality engineering organization. Simple web searches show how they have misused things and cause large outages.
Ceph is very reliable and durable. We've actually gone out of our way to try and corrupt data, but we failed every time. It always repaired the data correctly and brought things back into a good working state.
Ceph and Yahoo run very large Ceph clusters at scale, too.
You can use Ceph together with OpenStack. They used ceph for their cloud services but had huge problems. If I am not mistaken they completely threw out Ceph by now.
Any idea what the underlying issues with Ceph were?
My story is a bit dated, but we went from gluster to ceph to moosefs at one startup. Gluster had odd performance problems (slow metadata operations - scatter/gather rpcs and whatnot I would guess) and it was hard to know from the logs what was going on.
Ceph was very very early at this point, but part of it ran as a kernel module and the first time it oops'd, I deleted that with fire. MooseFS ran all in userspace, had good tools for observability into the state of the cluster, and the source code was simple and clean. It didn't have a good story around multi-master at that time, but I think that is improved now.
Ceph is extraordinarily complicated to run correctly. The docs aren't great and commercial support is pretty mediocre.
It's an amazing piece of software, but takes a great deal of engineering to get right. Most folks won't invest that much engineering into their storage.
This is why Providers like EMC and NetApp can extract 10x the cost of the raw storage from enterprises.
The RedHat ceph docs are great and open to everyone for free.
The RedHat commercial support has been pretty good for us. We presented them with 2 bugs, and they addressed both. One took a few weeks but one only took a few hours to get a hotfix started.
EMC storage is absolute trash post Dell merger. Pure 100% dumpsterfire. Their customers know their systems better than they do. It's pathetic.
No clue what the underlying issue was but when reading:
"We have about 200 harddisk in this cluster... 1 of the disks was broken and we removed it. For some reasons, Ceph stopped to working : 17 objectfs are missed. It should not."
This isn't related to block storage at all, but I was a big fan of DO until I hit a weird issue where they wanted me to prepay via PayPal to spin up more than 50 droplets at a time. I work in an organization that is spinning up many nodes at once for a short time and then destroying them soon after, for various but totally legitimate reasons. One look at our account history can demonstrate that this is almost exclusively how we use their services, so it's not like this was a weird request. And we've never missed a payment or paid late or otherwise ever given Digital Ocean any reason to think we wouldn't be good for the charges at the end of the month (especially considering we were already spending sometimes in the thousands of dollars every month). This was so off-putting. I stopped using Digital Ocean that day.
We do have a business account, if I'm reading the console correctly. I was wrong about the limit of 50. It is a limit of 100 droplets but this is an artificial limitation that they wouldn't budge on. It's clear from the (lengthy) account history that this is normal behavior for us, and I work for a company whose name everyone knows, so it's not like we're some no-name scammers. Regardless, asking customers to buy vouchers for what really is only moderate use of a service is really off-putting. It was so off-putting actually that, more than halfway through writing it, I scrapped a driver for some pretty popular software that would have enabled the use of Digital Ocean as a backend.
Minio is cool. Unfortunately, performance is anyone's guess.
It has erasure coding as well. You could deploy on bare VM's with local storage in any cloud provider and have no dependency on network blocked storage.
With k8s 1.10 you get persistent local storage as well, as such you could probably build a fairly highly available system. Pro tip: do it in GCP as they have nice local SSDs you can attach to any instance. They're 375GB, 25k IOPS, and $.08GB, way cheaper than AWS I2 instances.
So I have no problem with network storage. I have a cost problem (the project I'm working on is not intended to make a bunch of money and I'm trying to keep costs low so I can keep prices extremely low). What you're describing would functionally be even more expensive than just using S3 directly, if it were to be done in AWS or in GCS.
Right now the leading (uncomfortable) solution is probably DigitalOcean Spaces and a little bit of prayer.
But...I don't care about the technology. I care about the object storage available to me without caring about the technology. So what's this do for me?
Thanks for the suggestion, but a dollar per gigabyte per month is ridiculous unless you're downloading everything in your store eleven times a month. Even S3 only costs $0.02/GB storage and $0.09/GB egress.
I would suggest you take a look at Wasabi for object storage. I'm just a customer, but have been using them for close to a year for off-site backup storage and it's been great.
Reposting from my comment from yesterday[0], how are the speeds from you location (and where's that)? It have been ridiculously slow from Northern Europe when I've tried it, like not even 1 MB/s down.
> Wasabi’s hot cloud storage service is not designed to be used to serve up (for example) web pages at a rate where the downloaded data far exceeds the stored data or any other use case where a small amount of data is served up a large amount of times
I have heard of a number of Ceph nightmares like this. A few years back, Logos Bible Software, who has a Ceph-based content platform (a huge library of e-books with a massive amount of meta-data), was down for a week because of a cascading Ceph cluster failure.
It really doesn't speak well of the Ceph architecture. It is highly performant, but at what cost? Failures on this scale can ruin a business.
Well, you can always partition a large cluster into many small clusters and prevent cascading failures, other issues from affecting everyone or getting too long to recover from. This is like a very basic reliability technique everyone should know.
They've just started sending out SLA credit notices:
-----
Hello,
On 2018-04-01 at 7:08 UTC, one of several storage clusters in our FRA1 region suffered a cascading failure. As a result, multiple redundant hosts in the storage cluster suffered an Out Of Memory (OOM) condition and crashed nearly simultaneously.
We have identified that you, or your team account, were impacted by this incident and will grant an SLA credit equal to 30% of your entire Block Storage spend for April, not just usage in FRA1. This credit will appear on your account at the end of April, and will be reflected on your April 2018 invoice.
We apologize for the incident and recognize the impact this outage had on your work and business. You can read the full detail of our public post-mortem here: http://status.digitalocean.com/incidents/8sk3mbgp6jgl
We run 26 services in production on DigitalOcean. Every single VPS in our setup uses block storage feature as a persistence layer (system logs, app logs, databases, etc).
Thanks to this architecture we can rebuild machines at will. The _function_ and _state_ are nicely separated.
The downside is that we are now fucked.