We were loyal customers of DigitalOcean for over 2 years. We showed up to work one morning and had a client email stating that their website was down. We checked, sure enough. We tried logging in to our DI account and it said it was suspended. After searching around, we realized our CC had expired. No biggie, we'll just update it, turn the server back on and be on our way.
Nope. We had to contact customer service to get back in to our account. After updating the card info we realize that all our droplets are gone. We reach out to customer service again in which case they let us know that when a CC expires, an automated process kicks off and deletes the droplets. We've been a customer for 2 years, surely they could pick up the phone and call us. We've spent thousands of dollars with them.
Plan B, let's restore them from the backups we've been paying for. Nope! When they delete your droplets, they also delete your backups.
This is where we ask to speak to them on the phone and are denied. I then ask if they are insane and why they would delete someone's servers and backups via an automated process without a human at least checking to see if it is a loyal multi year customer who has a simple lapse with their CC expiry.
We had to find a backup on a developers machine from over a month prior and rebuild data using various megtods that took close to a week. In the end, DI gave us a $500 credit.
You get what you pay for. Anyone who uses DI for production or anything more than a hobby app is playing with fire. They do not care, they are apathetic, they will screw you over and the throw you a credit for their terrible service as a half assed apology.
One time my CC expired on DO because there was a fraudulent purchase on it and I had to get a new one. I got a brief e-mail from DO noting there was a problem charging my card, and updated it the same day.
What if I was on vacation, and didn't see this e-mail for a few days? shudder
In what world is automating the deletion of customer data a good idea?
The world where employees care more about management satisfaction than customer satisfaction? And it's difficult to prove that bothering some customers here and there is impacting corporate revenue. Policies like this slip through the cracks.
The timeframe is 3 weeks until suspension, then another 2 weeks until deletion. At the time this user was affected, it was 3 weeks until suspension, then 3 days until deletion, but that period was raised to 2 weeks due to feedback that 3 days wasn't enough.
I'm not saying that excuses things, but it's far from "at the first sign of trouble".
I had the exact same experience! Except I got no credit.
My CC expired but I was on vacation and didn't have access to email. When I was back, the droplet and all backups and snapshots were completely removed. For the record, I was only 15 days late. Very sad. I tried my best to get the backups from the support engineer but no luck.
>We've been a customer for 2 years, surely they could pick up the phone and call us.
I don't understand that reasoning from customers to be honest. That right there, throws away the $5 and $10 plans. Having some one trying to contact a customer, just because their credit card expired, cost more than the actual hosting.
Top tier service does comes from companies with $5 a month plans. I'm sorry it just doesn't.
Thank you for this, I have a large client where we were considering DigitalOcean and the yearly spend could easily be in the thousands. Now I'll strike them off of the list. Automated deletion of VM's and backups, absolutely unacceptable.
The VM hosting market isn't looking good at the moment, seems like each provider has some dark history at this point, it's really frustrating.
For what it's worth this happened to me through a client's account (client wasn't checking emails and didn't see warning mails re CC expiring). I reached out to customer support once the site went down and even though we were beyond the cut-off deadline they re-activated all the droplets. Perhaps they have updated their process in this regard. I also learned my lesson and made sure I got forwarding on all mails related to DO.
I'm Zach, Director of Support here at DigitalOcean. Thanks for raising this topic.
I was able to locate your account, and I see that I was the one who granted the credit and followed up via email. I hadn't heard back from you until now, and I'm happy that we have an open line of communication. I'm hopeful that this thread starts a conversation, that it clarifies what steps we take, and we might even uncover a different solution that works well for everyone.
To start with, I want to be really clear, this is the absolute worst case scenario. I absolutely don't want it to happen to any customer, let alone someone who has been with us for such a long time. There's really a delicate balance that we must strike between customers who forget to pay and customers who do not want to pay.
Currently, our notifications for overdue balances are sent via email. If a customer account is unpaid on the first of the month, we send ~15 emails total notifying the customer of the situation, and subsequently power off servers 21 days after the account is on hold as a further way to gain your attention. At this time, droplets are removed from your account 14 days following power off, which is an increased amount of time from what you experienced. It was increased from 3 days based on past user feedback.
Why do we do it this way? In the past, we had no hold or suspension process for non-payment, which enabled bad actors to run for months and months without paying. From a business perspective, we made a decision to put a scalable process in place that limited how long an account could go unpaid.
As a support team and business, we are always willing to work with our customers who are unable to pay. We are always available via ticket, our contact form, and have made wide-ranging attempts to help users who aren't able to pay due to banking regulations: https://insidedigitalocean.com...
I'd like some input, so I'll publicly ask for feedback on questions that I've asked privately before. I'd also like any other feedback that we can consider on how to make this a more positive experience for everyone
-Knowing that we do not do phone support, what's the best way to notify you of a past-due balance?
-Is SMS effective at times like this?
Nope, the only assurance you can get is you won't be access your data, I am not sure how one can come down with conclusion of "no way they have a copy or keep it in archive/deep backup"
Having a backup and letting you to access it are two entirely different issues, and most of the time customer service staff don't know how the data management is being done by the engineering/operations team.
If you have used shared hosting you know as soon as you pay for your expired account, everything is back as it was. Never thought this was a problem until you try to test how long you can go until you pay and still restore you data.
If DO's CC can't handle data management problems they probably will connect you to their engg/ops team. So I would assume they would prefer having it resolved ( by giving the data if they had ) instead of keeping it and let user rant/post about it( which they know will probably happen ).
The second paragraph hinges on the rationale that if customer data were available, then CS (Customer Service) agents would have made every effort to contact the ops/engineering team to retrieve those to avoid a social disaster.
Unfortunately, this isn't true in the companies that I worked for, which are big customer-facing MNCs.
First of all, for US consumers, the CS teams most likely are contractors in Philippines, Malaysia, or India, they are trained with scripted responses. For any issue that is beyond the scripts, their standard response will be "Sorry we can't help you because we can't do x", one reason is they have no incentive to escalate the issue and get oneself more work to follow up, other times, it is the management explicitly decided this is to prevent too many cases hit the ops and eventually, the engineering teams. It is not that the management is stupid or nasty, just that the 20:80 rule dictates 80% of the issues usually are raised by 20% of particularly picky customers. For the 20% customers who generate 80% of the revenue, each of them will be assigned an AM (Account Manager) who can get things done much faster and more efficient than the CS team. This group of high-value customers are often given access to VPs to make sure all their issues are heard and handled properly
Second reason is most teams in a big company will not concern anything that doesn't hit them directly. For example the CS team in my ex-company upon getting repeated complaint on an erratic component that didn't get bug fix for months, eventually the CS told the upset customer to fuck off and complain to CEO directly since it was engineering department's fault.
Unbelievable? Maybe, and that company remains one of the largest high-tech companies in the world.
For the record, that's not how it works at DO. The support department encompasses Platform Support (like T1), Trust & Safety (handles cases of possible fraud or abuse complaints), and CloudOps (sysadmins who help maintain the cloud).
I'm on Platform Support and while I can't access the actual hypervisor hardware like CloudOps can, we do not have "scripts", and we always do our best to help customers in any way we can. We also have a myriad of tools that we can use to monitor our platform and help troubleshoot any issues that may arise in our tickets.
My roommate and bestie is on CloudOps, and she sits right next to me at our HQ in NYC. While we have a great remote culture, all of us on support are in the US, with the exception of one guy in London (who started last week and is AWESOME).
Not only that, but I work very closely with all of the other departments here at DO, and our executives are extremely accessible as well. If we need to pass on feedback, we always do so.
We're also encouraged to go out of scope and do whatever possible to help out our customers, and I wouldn't change a thing. I love what I do, I love the people I help, and I love my coworkers.
I had the opposite experience, my CC expired with DO, but I didn't have a new one for about 2 weeks. I didn't have any issues with just waiting to the next billing cycle and paying double then to cover the negative.
This is why you use two providers with two credit cards.
I'm quite serious. Our production environment operates on two cards with different issue & expiration dates from two different merchants and uses 2 different Cloud providers.
This is a big danger of "move fast and break things". I imagine the process they put in place seemed reasonable internally before catching this edge case, but what a crappy result.
This seems like a non-issue to me. If you're using an IaaS provider you should be treating the network as volatile from the get-go. This is the reason AWS has things like auto-scaling groups. You should be designing for failure in "the cloud"
While my first intuition was to agree with you, there's certainly an upcoming generation of developers who have never operated their own root servers and the abstraction level in the cloud nowadays is so high it makes you easily forget that "Droplets" are just VMs are just servers running software. On the other hand, hardware reliability has increased in recent years due to RAID, fully redundant networking & power adaptors, you name it.
These facts combined certainly make it easy for the younger generation to forget that redundancy and backup still don't replace each other (never did).
This seems like the wrong lesson is being taught though somewhere - the delete button is right there. It should be much more obvious that is very easy to make a server go away.
This is so true, I wish there was a way to make this more noticeable (flashing red lights on the order forms, regular email warnings about the single point of failure,etc.). This simple fact is taught in entry level system administration courses (RAID, Disaster Recovery Plans, Replication, High Availability, Fault Tolerance, etc.) but in reality some seem to not plan for it when moving to cloud services. I always suggest people set their apps up to be able to run out of multiple availability zones in case one goes down if it is very important to them. For those that are not able to I will suggest they at least setup replication to another server within the same data center and an offsite location in case their server has hardware failure, an account gets locked or other possible situation to help take of the just in case scenarios that occur to everyone.
Yep, somehow cloud customers have decided that by being in the cloud everything is redundant and durable and has several other magical properties.
At one of my positions we had stupidly put too many eggs in the basket of a single physical machine. Its disk controller failed in a way that it trashed the data volumes. I was unable to convince anyone that "move to Amazon" was not a one-step solution to "how do we make sure this never happens again".
>somehow cloud customers have decided that by being in the cloud everything is redundant and durable and has several other magical properties
To be fair the point of the cloud is that they deal with redundancy, HA, distributing to multiple datacenters, etc. through their services - but you need to use and understand implications of said services to leverage that.
If your server does something weird and corrupts the filesystem, you still have access to it. That can be a huge difference when it comes to restoring the most recent data.
CAP theorem applies everywhere, but most software practices still do not assume that you must code for your app to be able to handle performance variation and partial failure state both of which are common scenarios of cloud infra.
I have to disagree - DO's "cloud servers" are equivalent to virtual private servers, which you would never expect to loose in this manner from other providers.
The lack of explanation is what worries me most - it leads me to think this might have been a case of "we forgot to replace a bad drive, then the second in the pair failed".
I remember a ~36 hour down time with a serious uk provider (in my case just a shell account, but they also housed managed and unmanaged servers). They had redundant fiber links to the NOC, from competing providers. But turns out they both ran through the same box under a high way. And then someone blew that box up in order to knock out alarms while they committed a robbery...
My family business website got killed once by a flood.
And many years ago it suffered a truck crash ( the truck crashed on the lamp post right outside the server company building and destroyed the telecom-related wires)
Agreed. This shouldn't be an issue for anyone using cloud services -- or any computer really. Except for some super duper redundant mainframes maybe, you should always be able to deal with losing a component not matter what it is. If you really care, always have multiple instances, in multiple (availability) zones, etc.
Not a huge company like DO, but iwStack [1] provides a SAN backed cloud, with selectable KVM/XEN instances, custom ISO and virtual network support. The prices are similar to DO.
[1] http://iwstack.com/
I think one of the reasons is that they only have a small number of datacenters, have a small number of (very friendly) staff and are not going after the mass market like DO, and I guess they don't spend anything on marketing.
That's what AWS's "Elastic Block Storage" is. You can turn it off and just use instance storage (and I personally prefer to, for truly ephemeral nodes), but it increases spawn time since your disk image actually has to get copied over to the VM host machine in that case, rather than just "attached" over EBS.
Then why EBS failure rate is several orders of magnitude higher than in SAN deployments? A SAN provider would be quickly out of business with 0.1-0.5% annual failure rate.
Just because it's a SAN doesn't mean a given abstract block device from it is backed by RAID. It's literally just a multiplexed and QoSed network-attached storage cluster.
I actually prefer the lower-level abstraction: if you want a lower failure rate (or higher speed), you can RAID together attached EBS volumes yourself on the client side and work with the resultant logical volume.
On AWS, an EBS volume is only usable from one availability zone. You still need to use application-level replication to get geographic redundancy for important data, and when you have that, EBS just lets you be lazy rather than eager about copying a snapshot to local instances.
I guess I was thinking in terms of using EBS for ephemeral business-tier nodes, rather than as the backing store of your custom database-tier. (I usually use AWS's RDS Postgres for my database.)
For ephemeral business-tier nodes, EBS gives you a few advantages, but none of them are that astounding:
• the ability to "scale hot" by "pausing" (i.e. powering off) the instances you aren't using rather than terminating them, then un-pausing them when you need them again;
• the ability for EC2 to move your instances between VM hosts when Xen maintenance needs to be done, rather than forcibly terminating them. (Which only really matters if you've got circuit-switched connections without auto-reconnect—the same kind of systems where you'd be forced into doing e.g. Erlang hot-upgrades.)
• the ability to RAID0 EBS volumes together to get more IOPS, unlike instance storage. (But that isn't an inherent property of EBS being network-attached; it's just a property of EBS providing bus bandwidth that scales with the number of volumes attached, where the instance storage is just regular logical volumes that all probably sit on the same local VM host disk. A different host could get the same effect by allocating users isolated local physical disks per instance, such that attaching two volumes gives you two real PVs to RAID.)
• the ability to quickly attach and detach volumes containing large datasets, allowing you to zero-copy "pass" a data set between instances. Anything that can be done with Docker "data volumes" can be done with EBS volumes too. You can create a processing pipeline where each stage is represented as a pre-made AMI, where each VM is spawned in turn with the same "working state" EBS volume attached; modifies it; and then terminates. Alternately, you can have an EC2 instance that attaches, modifies, and detaches a thousand EBS volumes in turn. (I think this is how Amazon expected people would use AWS originally—the AMI+EBS abstractions, as designed, are extremely amenable to being used in the way most people use Docker images and data-volumes. The "AMI marketplace" makes perfect sense when you imagine Docker images in place of AMIs, too. Amazon just didn't consider that the cost for running complete OS VMs, and storing complete OS boot volumes, might be too high to facilitate that approach very well. Unikernels might bring this back, though.)
I worked for a company that believed that, for a while.
They bought a hugely expensive SAN solution (HP I think). One of my questions was: What happens when the SAN fails? Well you see, because it has redundancy, that can't happen. Clever uh?.
First time it failed, everything gridded to a halt for two days. The second time they did better and had it running within eight hours.
I suggest Digital Ocean should be using any mechanism that stops a single server from affecting data. A SAN backed block store allows you to provide rules for the level of redundancy required, as well as replication mechanisms for things like firmware failures, to a much greater degree than a host backed RAID array.
That's completely different from suggesting 'SANs are magical boxes incapable of losing data'. Have you considered apologising?
Servers affect data. That's kind of the whole point of servers. Maybe a multi-SAN setup with synchronous replication would have prevented the data loss. Maybe not. Not enough information has been provided to know if the cause of the data loss was the storage or the server.
But if DO had a multi-SAN setup with synchronous replication this thread wouldn't exist because DO's business model of "super cheap VPS w/ fast SSD storage" would have failed due to costs. Everybody wants Five Nines until they have to pay for it...
Digital Ocean sell 'Simple Cloud Infrastructure'. Server data might be transient, but most web developers aren't infrastructure engineers and will have no idea of this - certainly it seems to have come as a surprise to some people. Digital Ocean need to manage their customer's expectations - and upsell for people who want permanent storage - as a rude shock doesn't do well for their brand.
> DO should do what's written in their terms of service
Agreed.
> the customer should read them carefully.
Sure, but they shouldn't need to, as Digital Ocean should set expectations clearly.
Keep in mind Digital Ocean's <title>: Simple Cloud Infrastructure. Being surprised because there's an unsafe default hidden in a document somewhere didn't work out for MongoDB and it won't work out for DO.
Isn't "non-issue" a big of an exaggeration? If the dry cleaners lost my clothes, the bank lost my money, a valet lost my car, or gmail lost my inbox I'd be angry.
If the bank lost the particular dollar bill that you deposited last week, but offered you a new dollar bill, would you get angry? If you rent a car from the airport every time you visit a city, and it's always been the same car, but you show up one night and they they tell you the car was in an accident and they'll get you another car, would you be angry?
Yes, but those aren't expected outcomes when using those services. That's the difference gdgtfiend was pointing out — you should expect cloud servers to sometimes go away.
They are expected outcomes. Banks fail (think 2008). Dry cleaners do lose clothes. Cars do get stolen. Plenty of stories in the news of all 3 of these. No so sure about gmail losing all your email, but I was able to accidentally access someone else gdocs once. I just logged in as me and saw a complete stranger's docs.
The expectation with these services you mentioned, however, is reliability.
The implicit (and explicit) volatility of cloud hosting should change that expectation with such services.
"Non-issue" is an exaggeration because it was potentially a catastrophe. But this is sort of the way these services "work." Servers/droplets/instances are ephemeral and replaceable, and their underlying data is not guaranteed in a failure.
I'm glad the author didn't turn this into a negative review of DO. "What will you do when your server is lost?" is exactly the right question to learn from a scenario like this.
Agreed. There may also be a customer service lesson in that refund email. Typically, your customer is not happy with you even after they receive a refund, so exclamation points and language like "Booyah!" is really not a good idea for that sort of message. It's always going to sound a little bit tone deaf -- and very tone deaf when the refund was for a major incident like the loss of an entire server.
It's the standard DO credit email, which typically is from referring friends or doing something else to acquire credit. Not getting a refund because DO accidentally deleted your droplet.
Though their should be another avenue for such things.
Hmm, maybe that's what they're thinking. But I wonder if it's really true that a refer-a-friend or similar, happy scenario is the "typical" cause for a refund. Certainly in my life I've received far more refunds due to company mistakes than than credits due to the happier reasons you mention.
And when in doubt, if you're refund system is not capable of discriminating between the two, it seems to me wiser to err on the side of a less jubilant message.
DO credits from positive events are pretty common in my experience; not just refer-a-friend, but they give out credits through ads on podcasts and the like, through promotions for students and whatnot. I think I've had something like $150 total gifted DO credit.
If you're relying on backups for servers other than your database then you're keeping state on your servers and that's a Bad Thing. You should regularly destroy your own servers and recreate them using your configuration / deployment scripts if the prospect of this happening worries you. Do it before your business starts to rely on it.
For database servers you need to have procedures in place to quickly switch the production application over to a database you just spun up and migrated the data from a recent backup to in order to test your data backups. You want this to happen as smoothly as possible in the case of a failure. Keep data backups in three different places and test your procedure on all of them.
On a production system, do not keep the database on the same server as the application.
It's so insanely easy to get started with puppet or ansible I wouldn't start rolling out a product without using them from the get-go at this point. At the very least properly package your software (read: DEB or RPM, not docker containers) so it's easy to reinstall quickly if need be.
Personally, I like platform libraries rather than OS packages. I have several Ruby projects set up as gems. What's neat about that is it allows code reuse across projects.
So go ahead and vendor them in, outside of inclusion in distribution software packages (with exceptions) there's nothing stopping you from including your own dependencies. Packaging your application properly makes deployment a hell of a lot easier and less error-prone.
And the nice thing about DO and some of their cohort is that they charge by the hour for servers. Fire up new droplets and install your upgrades on them, smoke test the system, swap your routing over to the new machines, back up the old droplets (just in case you missed something) and and then destroy them.
If that takes 4 hours you've hardly spent any money at all, and you're running your cluster the way a lot of people already do.
Because state left on servers is inevitably unmanaged and will get lost eventually. At my job, I have cron jobs running on the production system from years ago that I have no clue what they do and no time to try to figure it out. I find out when they fail and someone, usually customer service, complains.
If the server goes away, then even once I redeploy, I've lost all those cron jobs. Who knows what will happen then.
On systems I build, cron jobs are added as part of the deployment process and managed as code, in the git repository.
That's just one example. Others include iptables rules, FTP server configuration, startup scripts, and such. If it's not in your codebase, it's unmanaged state on the server that you will lose if your cloud provider takes a shit. All of that needs to be managed as code or as configuration and deploy scripts should contain idempotent commands that add them if they are not present. I consider shelling into a server for any reason to be unideal and will think of ways to avoid it.
If you are asking about database being on the same server as the application, it's because it makes infrastructure management more difficult. Application servers are set up quickly, torn down quickly, and work as soon as you deploy. Database servers are fortresses, you don't build new ones and tear them down half as quickly, unless you've got scaling requirements.
It is tempting to want to put data on the same server as the application early on in order to save on hosting costs, with the understanding that you'll separate them later, but that's a rookie move. It's much less work to separate them now, when there's nothing riding on it, much less headache.
The other reason to keep the DB separate is that they have different IO patterns that are not always complementary, and it's easier to gather diagnostic metrics if the data isn't munged. In the days of physical servers you'd also choose different hardware for the DB machine.
Configuration capture is a big thing, and it separates the adults from the kids. I've had hardware failure of my dev box twice in my career, and when I tell certain people that I have to rebuild my machine you can see their faces blanche. These are the people I know I haven't converted yet.
Everything important, except for the handful of things I'm actively investigating, is always stored on shared machines. Version control. Wiki. Some sort of artifact repository. The instructions for getting an environment set up with version X of our software are also stored in one of those places, and have been vetted by every new developer and some of the QA team. I can have my machine back up and running in a couple of hours, and most of that is waiting for downloads, if we didn't have the presence to store copies locally.
Why do I care about this stuff? Seems sort of OCD on the face of it, and maybe you're right. But sooner or later you're going to have a high severity issue in production that should be an all-hands-on-deck affair, and if you haven't done this work, losing a hard drive on a dev laptop will be the least of your worries. Everyone busy doing work on X+1 needs to be able to get back to version X and all of its dependencies in under an hour, and by themselves, because the people who could help them are most likely on the front lines of fixing the bug as fast as possible.
What's more, someone probably needs to get versions X-1 and X-2 running to figure out if you need to warn people using older versions. So that's getting people running 3 or 4 versions of your software autonomously, so that you can identify repro steps, long and short term mitigation strategies, verify that they work, formulate a bulletin and provide patches for people. Not only do you need to get your configuration captured/documented, you need it captured 2-3 versions of your software ago, which means you need to start thinking about this stuff pretty early in your project.
I think the other point of contention was your term "database". There are lots and lots of systems out there that rely on data that is not stored in a database. The data is the same kinds of things you would store in a databases but they exist as files on a filesystem instead (or in another data storage devices that is usually not referred to as a database).
That said you could easily use the same advice for that data. But in some cases the application itself needs to be on the same server as the database. Think if you were building postgres for instance...
> There are lots and lots of systems out there that rely on data that is not stored in a database. The data is the same kinds of things you would store in a databases but they exist as files on a filesystem instead (or in another data storage devices that is usually not referred to as a database).
If you can't afford to lose it, then it needs to be appropriately managed so it can be recreated in the case that it's lost. If you can afford to lose it, then it's not really state and you can forget about it.
I think people will understand more easily if instead of saying "database" specifically, you say "database tier."
I'm pretty sure the usual understanding of the terms "business tier" and "database tier" is that the "business tier" is a bunch of ephemeral nodes, and the "database tier" is where-ever the data from the "business tier" goes to be persisted.
Probably due to optimization of the conversion funnel, it's VERY easy to build things on DigitalOcean without understanding that - unlike many other *-as-a-service - droplets do not include backups and that is your responsibility. As a sysadmin, this is fine since I don't trust a single provider for both prod and backups anyhow, but I fear for those who are just getting started and don't realize this.
I agree that DO handled this situation appropriately, but they should warn people a little more deliberately when they sign up and create a droplet. A checkbox along the lines of "I, user, understand that I'm responsible for backing up my server", with a few links to some how-to articles of the excellent quality that DO is known for, would suffice and empower without reducing the conversion rate.
I would really hope that the presence of a "Backups" checkbox on the droplet creation page is enough to tell the (presumably tech-savvy) user that backups don't happen unless you check that box.
That's fine, but they didn't introduce this functionality until recently. One day, I logged into my DO console (I don't have a reason to do this very often), and saw a new backup tab. Older customers might not even be aware.
I don't host anything I couldn't live without though, so I'm not really worried.
The functionality has been there for a long time. It's possible the checkbox on droplet creation is new, and it used to require activating backups afterwards, I don't recall.
That's probably wise, I really should do the same. My important stuff is in git with a remote repository, but losing all my configuration and setup would still not be great.
SolusVM — software that powers the majority of VPS range (I would love to argue how DO isn't anything cloud) provides providers with backups out-of-the-box. It's a matter of them setting it up, and usually almost always is. If it's not, well, you're probably with someone who hasn't an idea what they're doing.
This is yet another reason to push forward the idea of static websites/webapps. There is no single argument to support the idea of using a database to power a 5-6 page website. Not even a blog with daily posts. How many sites out there have less than 10 pages, are updated maybe once per month, sport a contact form or newsletter subscription box? 90%? And how many of them use an insecure behemonth like WP or Joomla?
My new model: client wants a pretty responsive theme, I get the HTML version of it. Turn into a template I can use to spit out a static site (from my homebrewed CMS). I tell clients your monthly support charges are $0. If you ever need updates I charge hourly (most never call in months). Some clients want "a list of something" say products. If small and won't grow: a static csv file that also gets munched by the homebrewed CMS. Large list? A BAAS service with API.
I really don't care if the servers are gone/hacked/ransomware. Git clone, go.
Those 'behemoth like' CMSes have spur out a new range of amateur web-developing agencies who spit out a design, theme and finish the complete project with less cost, less maintenance and fast delivery.
The client (who doesn't want to contact the agency again)is given instructions how to update which makes them happy. And then such projects spread like wildfire.
I think Google Compute Engine does the right thing here: by default it uses "persistent disks" (network-attached redundant/highly-available block devices) for all disks. The only case I've heard of where persistent disk data was lost was a few acknowledged writes occurring just before an unusual lightning-induced power outage: https://status.cloud.google.com/incident/compute/15056
For added protection, you can take regular snapshots. You only pay to store the diff from the last snapshot (so go ahead and snapshot often), and snapshot storage is geographically distributed.
(Note: I have no idea what EC2 does, maybe it's similar.)
I've been suspicious of network-attached block storage ever since the Amazon EBS cascading failure of April 2011 and these three blog posts that Joyent did in response:
I'm having trouble unpacking useful information out of these rambly blog posts. It sounds like the basic complaint is that abstractions add complexity, which is more trouble when it goes wrong.
I agree that if you have a large-scale sophisticated operation you will probably want to choose local disks and handle availability your own way (and GCE provides those options). But for small-scale operations that can't justify as much engineering, abstractions like persistent disks and automatic migration will save a ton of time and avoid data loss. Ironically, Digital Ocean seems to target this smaller scale, yet they've set the wrong defaults for their target market.
Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% - 0.2%, where failure refers to a complete or partial loss of the volume, depending on the size and performance of the volume. This makes EBS volumes 20 times more reliable than typical commodity disk drives, which fail with an AFR of around 4%. EBS also supports a snapshot feature, which is a good way to take point-in-time backups of your data.
I definitely wouldn't recommend DO's built in backup service in production based on my experience. It brings my site down every week since all I/O freezes during the backup. Sometimes the droplet doesn't recover and needs a hard reset. Apparently it's a known issue but hard to fix.
"Luckily, we made the decision at Spatie to host every site on it’s own droplet, so only one site was affected."
I think that's a poor lesson learned here. Were this me, I would have said:
"Luckily, all of our sites run on several servers, access data in a shared, replicated cluster, and a small shell script I wrote kept me from writing this entire blog post."
IaaS has only surfaced what has always been true: your data lives on little physical things that are screwed into a thing and goes through a controller that could fuck up due to cosmic rays.
1. Never have a single point of failure. Relying on DO for Server+Backups is putting your eggs on one basket.
2. Your server state should be programmable. This is not quite easy for complex configurations. But today, DO has an API, we have Docker, and quite modern deployment tools.
Here is setup:
1. Github for the server state. Basically, a repository to configure and deploy my infrastructure.
2. Enable DO backups in case I mess up something and want a quick come-back.
3. File backups through Tarsnap. Since I use Docker, I have a volume container. Backup the volume container with Tarsnap.
That really stinks, but this should be a warning to you that in the future, you should have some kind of redundancy plan so data does not get lost. Even if Amazon decided to terminate all of my app instances right now, I would just need to rebuild them. No data is actually lost, and furthermore backups are made frequently so that if a server decides to explode one day, we're still covered. DigitalOcean droplets are really not meant to be holding your database and application server...you shouldn't have experienced "data loss" by losing a droplet because your database should not have been hosted on a droplet.
Which is the irony of the entire post because the tldr was: "our DO instance was lost, fortunately we keep multiple backups of the data so we were able to restore without any data loss". It seems more like a passive aggressive note against Digital Ocean.
Why not? How is a droplet different than any other server, including your own hardware? All of them will fail eventually, the important bit is having a sound replication/backup strategy as you said yourself.
They regularly are unable to create new droplets [or other control issues where you can't perform normal functions with normal latency] and/or have full DC outages.
Pretty much any use case where I'd use something like DO being unable to create new VMs, etc. is the same as an outage.
I asked why as I work for DO's operations team and I wanted to know what your concerns were. Do you have a specific instance of a problem? What you're describing would be considered a MAJOR outage for us and we have not had one in quite some time.
That is annoying enough I'd want to be able to failover if it exceeded 15 minutes. I get it might not be an issue for anyone else but failing over is less complex than mitigating latency issues with droplet creation or whatever.
I get "regularly" to me might mean something different to you but if the 2 DO DCs I was using are hit literally every month with a problem of some kind...it seems regular to me.
To my surprise, this HN thread has no link to any external backup solution guide or little-to-no suggestion of best way to backup your server to an external service or another VPS/backup server.
What if you need to scale up to more servers? What if your server gets hacked and you have to recreate it? What if you accidentally delete the server or perform an upgrade that completely breaks it?
None of these would be an issue if you've scripted server recreation from scratch and permanent data is stored in high availability shared databases and services like S3. If you're relying on weekly backups to save your server state and customer data you're just asking for trouble. I like that Heroku recreates your server once a day to force you to do this properly.
The weekly backup deal is something I really do not like with Digital Ocean, I think Linode does it better, I've been planning to move for a while now, I've just been super lazy, but this might be the push I need. 7 days of lost data is unacceptable to me, especially when I'm paying for backups.
> Three backup slots are executed and rotated automatically: a daily backup, a 2-7 day old backup, and an 8-14 day old backup. A fourth backup slot is available for on-demand backups.
I see DO's weekly backup as a convenience feature for if you need to restore your server, not as your primary backup method. Use the DO snapshot to restore most of the way, then run an up-to-date differential restore from your other (real) backup solution (that is hosted somewhere else).
But wouldn't it be much more convenient (largely because of the complexity) to have the complete backup solution at the same place (i.e. DO)?
Obviously, you want some offsite backups rather than completely relying on DO but that should be for disaster recovery (massive failures in DO's DCs, etc) rather than a more routine one.
Agreed. When you actually need to restore something from backup, often you need a fairly recent state. Hours or days ago (and not 7 days ago).
Having something 7 days old might be useful in some cases (especially e.g. DB dump) but it is far from 'good enough'.
I am not sure what's the reason why they do not provide more recent backups (for additional fee, obviously).
Would it be so complicated that the engineering effort required wouldn't make it worth it? Or such a service wouldn't be popular enough among DO users?
Yes exactly besides that. What makes you feel your data is public that its better to host on a public tracker?
Because all those earlier breaches(which you are referring to) revealed customer info not their data. If this was your point of user info, I wouldn't have asked.
Although I have backups of my critical files, I don't want to try to rebuild from those files if I can help it.
After learning that Digital Ocean does no backups of their own (even for critical hardware failure on their own side) I've enabled weekly backups for an additional $1 per month.
Glad you posted this so that I know that option is available and really necessary. And who wouldn't pay $1?
Edit: It costs 20% but I only pay $5 for my personal server.
If you're running in "the cloud", you should _always_ be able to destroy your instances and have them auto-rebuild from configuration management and a persistent object storage system (git repo + S3 or Google Nearline).
do you have any guides or tutorials I could possibly follow to have a setup like that? I'm working on configuring a site and would like the ability to 1) scale as quickly as possibly or 2) rebuild in event of total failure.
I just don't even know where to begin. Is this something I could stand up locally and push to like s3, spin up a new server and install 1 thing and have it pull the configs and software installs down?
> I just don't even know where to begin. Is this something I could stand up locally and push to like s3, spin up a new server and install 1 thing and have it pull the configs and software installs down?
Yes. You could use shell scripts, Ansible, Salt, Puppet, whatever. I'm fond of Ansible or shell scripts, depending on complexity.
Its a semi-religious topic, so you will get varied and loud opinions.
My opinion is ansible. Basically python + yml, you list out steps to set up your server. Takes a few minutes longer round 1, but you can remake the server in 60 seconds in 6 months. Lots of ansible tutorials around, try it out..
Does anyone have any good backup solutions to mitigate this? My agency uses a script we wrote in-house to more-or-less rsync our data to an AWS instance, but it's always seemed a poor way to handle it. I'm using DO's backup service for my personal site which was always supposed to be a temporary solution (the timing of the backups is inconvenient as mentioned by the article).
This isn't exactly turnkey but for my personal servers, I use rsync.net (ask for the HN discount) and attic [0]. It's easy to script and call with cron. (Of course, this is still somewhat DIY, and any backup solution is only as good as your monitoring + restore testing.)
The comment thread on this post makes me very happy. Having recently switched all my production instances to individual docker containers, I'm very happy to hear that the consensus seems to be that server instances should be killed and spun back up like bacteria. There's hope for humanity yet.
I do understand the part where they terminated and killed of the VMs. I do not, at all, understand why they deleted the backups. How can the CEO sit in his chair and make that decision? If he didn't and he doesn't know, why is he still in that chair?
Even damn Netflix preserves data for 10 months. Deleting backups is making 100% sure that the customer doesn't come back to you. It's amazingly stupid.
Unless you're as smart and as big as Google, use the damn phone. Don't use the word automated at all. It's not automated if one fool made a decision and another one wrote the script.
Oh well, It's not like DigitalOcean doesn't tell you to enable backups. And you should not expect that hardware failures are now a thing of the past just because it's in "the cloud".
If you're paying for backups and they get deleted the moment there's an issue with payment, ur doin' it rong. I don't care whatever the TOS says, they're being completely shitty to customers. Sure there should be regular off-site backups. No one is arguing against that. But the fastest restore is usually from the closest source, and that's partly why you'd pay them for it.
Anecdotally: Every company I've worked for that uses hardware raid controllers, it feels like the controllers break about as often as one of the disks. Sure, still an improvement over no RAID, but still ridiculous compared to any decent software raid (md or ZFS).
You lose a RAID card as often as spinning metal disks? Sure that's not a wild exaggeration? Sure sounds like one. In 15+ years of doing this, I've only been a part of a RAID card swap three times. Once was just to be safe with an overly paranoid customer.
That said, I'm a fan of moving away from RAID. It's just more complexity in the face of a movement towards simpler architectures. The one benefit of a good RAID card: stupid fast, safe writes with a BBWC.
I hadn't heard of the backup providers mentioned in the article. Does anyone have experience with those or with other recommended solutions?
I can re-implement the infrastructure of my VMs fairly easily (thank you Ansible), but backing up content outside of the provider's built-in options is something I haven't played with yet, and would obviously be the crucial piece.
We use Backupninja to drive backups, mostly Duplicity against S3. Some things, like Postgres dumps, are custom commands that just upload individual files to S3. The reason you want to wrap things in Backupninja is to centralize logging, scheduling and monitoring.
Note that we only back up specific things like databases, logs (central syslog server) and data directories. We use Puppet to configure our boxes, and consider everything to be expendable that isn't data.
Tell me if I'm doing something wrong: my two EC2 servers both use EBS including for the root device, so I'm not using any device-attached storage. If Amazon "lost" my instance, it would appear to me to be just a reboot.
And yes, I do a nightly backup of databases, webroot, /etc and some other directories.
I use DO as a test bed for a lot of stuff. I ended up destroying/recreating so many droplets I just started using Chef. Also, I come from a major background in AWS, ChaosMonkey in prod, and a very strict push-button + nuke & pave deployment strategy.
So, all this strikes me as "lol you didn't know this?"
Whats the point of raid if raid card failure resulted in complete loss of data? I came across an article a week or two ago that discussed various alternatives, I think ZFS with checksums, which unlike raid will not replicate corrupt data from drive nearing failure to the healthy drive.
They should have been able to swap out the RAID card and import the configs from the disks. This is pretty common practice with LSI/Avago-based cards. Curious as to why they didn't do that. Maybe the card went really bad and started writing garbage all over the place?
Holy hell, obviously where critical data is concerned, redundancy is something you should insist on, but this definitely sucks. Seems ridiculous that all you got from DigitalOcean is a $15 credit though. I'd have expected something like 6 months of paid backups comped.
We are also experiencing a huge problem with performance of DO servers as we scale. Especially databases. With stories like that , a decision of moving to AWS seems pretty obvious.
Anyone have any experiences with running VM's with ceph storage? I've used it for other things, but I know that VM hosting is quite popular. Care to share any stories?
I had two DO machines with a unrepairable file system after a crash and reboot. Shit happens. Never rely on a _any_ VPS. I treat them as throw-away boxes that can fail anytime.
No surprise for me. It also happened on DO for me one time. It seemed I've had my machine on a wrong rack/server there. At the end you get what you pay for I would say... you can have luck on your DO server but there is a reason why Rackspace and Amazon still exist nowadays! If you can afford one of the bigger ones go for it.
Rackspace!!1 ha ha ha ha ha.. (neurotic laugh..)
They took our whole infrastructure down for a day just because their highly paid Linux support team changed all root passwords and screwed firewalls and added unneed balancers so that servers weren't able to communicate with each other.. We didn't even request anything (as I remember).. Just bought their support and this shit happened. That was a nightmare
How many servers? 1, 100, 1000,10,000? Where are you hosting them, in your home or in a datacenter? How much are you paying for bandwidth and power? What SLAs do you have in place for network and power? How much do you pay for your remote hands?
I hope you see what I'm getting at. While I love ZFS and think self-hosting can make a lot of economic sense, without knowing all the variables you can't really make a fair comparison.
<begin quote> --------------------------------------------------------
We were loyal customers of DigitalOcean for over 2 years. We showed up to work one morning and had a client email stating that their website was down. We checked, sure enough. We tried logging in to our DI account and it said it was suspended. After searching around, we realized our CC had expired. No biggie, we'll just update it, turn the server back on and be on our way.
Nope. We had to contact customer service to get back in to our account. After updating the card info we realize that all our droplets are gone. We reach out to customer service again in which case they let us know that when a CC expires, an automated process kicks off and deletes the droplets. We've been a customer for 2 years, surely they could pick up the phone and call us. We've spent thousands of dollars with them.
Plan B, let's restore them from the backups we've been paying for. Nope! When they delete your droplets, they also delete your backups.
This is where we ask to speak to them on the phone and are denied. I then ask if they are insane and why they would delete someone's servers and backups via an automated process without a human at least checking to see if it is a loyal multi year customer who has a simple lapse with their CC expiry.
We had to find a backup on a developers machine from over a month prior and rebuild data using various megtods that took close to a week. In the end, DI gave us a $500 credit.
You get what you pay for. Anyone who uses DI for production or anything more than a hobby app is playing with fire. They do not care, they are apathetic, they will screw you over and the throw you a credit for their terrible service as a half assed apology.