Back in the late 90s, I implemented the first systematic monitoring of WalMart Store's global network, including all of the store routers, hubs (not switches yet!) and 900mhz access points.
Did you know that WalMart had some stores in Indonesia? They did until 1998.
Around that same time, the equipment in the Jakarta store started sending high temperature alerts prior to going offline. Our NOC wasn't able to reach anyone in the store.
The alerts were quite accurate: that was one of the many buildings that had been burned down in the riots. I guess it's slightly surprising that electrical power to the equipment in question lasted long enough to allow temperature alerting. Most of our stores back then used satellite for their permanent network connection, so it's possible telcom died prior to the fire reaching the UPC office.
In a couple of prominent places in the home office, there were large cutouts of all of the countries WalMart was in at the time up on the walls. A couple of weeks after this event, the Indonesia one was taken down over the weekend and the others re-arranged.
Thanks for sharing this interesting story. Part of my family immigrated from Indonesia due to those riots, but I was unaware up until today of the details covered by the Wikipedia article you linked.
I remember during the 2000s and 2010s that WalMart in the USA earned a reputation for it's inventories primarily consisting of Chinese-made goods. I'm not sure if that reputation goes all the way back to 1998, but it makes me wonder if WalMart was especially targeted by the anti-Chinese element of the Indonesian riots because it.
I can't recall (and probably didn't know at the time...it was far from my area) where products were sourced for the Indonesia stores.
Prior to the early 2000s, WalMart had a strong 'buy American' push. It was even in their advertising at the time, and literally written on the walls at the home office in Bentonville.
Realities changed, though, as whole classes of products were more frequently simply not available from the United States, and that policy and advertising approach were quietly dropped.
Just for the hell of it, I did a quick youtube search: "walmart buy american advertisement" and this came up: https://www.youtube.com/watch?v=XG-GqDeLfI4 "Buy American - Walmart Ad". Description says it's from the 1980s, and that looks about right.
What the hell, here's another story. The summary to catch your attention: in the early 2000s, I first became aware of WalMart's full scale switch to product sourcing from China by noting some very unusual automated network to site mappings.
Part of what my team (Network Management) did was write code and tools to automate all of the various things that needed to be done with networking gear. A big piece of that was automatically discovering the network. Prior to our auto discovery work, there was no good data source for or inventory of the routers, hubs, switches, cache engines, access points, load balancers, VOIP controllers...you name it.
On the surface, it seems scandalous that we didn't know what was on our own network, but in reality, short of comprehensive and accurate auto discovery, there was no way to keep track of everything, for a number of reasons.
First was the staggering scope: when I left the team, there were 180,000 network devices handling the traffic for tens of millions of end nodes across nearly 5,000 stores, hundreds of distribution centers and hundreds of home office sites/buildings in well over a dozen countries. The main US Home Office in Bentonville, Arkansas was responsible for managing all of this gear, even as many of the international home offices were responsible for buying and scheduling the installation of the same gear.
At any given time, there were a dozen store network equipment rollouts ongoing, where a 'rollout' is having people visit some large percentage of stores intending to make some kind of physical change: installing new hardware, removing old equipment, adding cards to existing gear, etc.
If store 1234 in Lexington, Kentucky (I remember because it was my favorite unofficial 'test' store :) was to get some new switches installed, we would probably not know what day or time the tech to do the work was going to arrive.
ANYway...all that adds up to thousands of people coming in and messing with our physical network, at all hours of the day and night, all over the world, constantly.
Robust and automated discovery of the network was a must, and my team implemented that. The raw network discovery tool was called Drake, named after this guy: https://en.wikipedia.org/wiki/Francis_Drake and the tool that used many automatic and manual rules and heuristics to map the discovered networking devices to logical sites (ie, Store 1234, US) was called Atlas, named after this guy: https://en.wikipedia.org/wiki/Atlas_(mythology)
All of that background aside, the interesting story.
In the late 90s and early 2000s, Drake and Atlas were doing their thing, generally quite well and with only a fairly small amount of care and feeding required. I was snooping around and noticed that a particular site of type International Home Office had grown enormously over the course of a few years. When I looked, it had hundreds of network devices and tens of thousands of nodes. This was around 2001 or 2002, and at that time, I knew that only US Home Office sites should have that many devices, and thought it likely that Atlas had a 'leak'. That is, as Atlas did its recursive site mapping work, sometimes the recursion would expand much further than it should, and incorrectly map things.
After looking at the data, it all seemed fine. So I made some inquiries, and lo and behold, that particular international home office site had indeed been growing explosively.
In the early 2000s I was working as a field engineer installing/replacing/fixing network equipment for Walmart at all hours. It's pretty neat to hear the other side of the process! If I remember correctly there was some policy that would automatically turn off switch ports that found new, unrecognized devices active on the network for an extended period of time, which meant store managers complaining to me about voip phones that didn't function when moved or replaced.
Ah neat, so you were an NCR tech! (I peeked at your comment history a bit.) My team and broader department spent a lot of hours working with, sometimes not in the most friendly terms, people at different levels in the NCR organization.
You're correct, if Drake (the always running discovery engine) didn't detect a device on a given port over a long enough time, then another program would shut that port down. This was nominally done for PCI compliance, but of course having open, un-used ports especially in the field is just a terrible security hole in general.
In order to support legit equipment moves, we created a number of tools that the NOC and I believe Field Support could use to re-open ports as needed. I think we eventually made something that authorized in-store people could use too.
As an aside, a port being operationally 'up' wasn't by itself sufficient for us mark the port as being legitimately used. We had to see traffic coming from it as well.
You mentioned elsewhere that you're working with a big, legacy Perl application, porting it to Python. 99% of the software my team at WalMart built was in Perl. (: I'd be curious to know, if you can share, what company/product you were working on.
The NCR/Walmart relationship was fairly strained during my tenure. Given the sheer number of stores/sites that Walmart had and NCR's own problems, it was not always possible to provide the quality of service people might expect, especially with a smile. From a FE perspective, working on networking gear at Walmart meant that you were out at 11pm at night (typically after working 10-11 hours already and spending an hour or two onsite waiting for the part to arrive via courier) and your primary concern was to get the job done and get back home. The worst was to plug a switch in, watch it not power up, and realize you'd need to be back in the same spot at three or four hours later to try again.
Walmart must have been an interesting place to work during the late 90s, early 2000s - I imagine that most everywhere they had to solve problems at scale before scale was a considered a thing. I'd be very interested to see how the solutions created in that period match to best-practices today, especially since outside of the telecom or perhaps defense worlds there probably wasn't much prior art.
As for the Perl application, I probably shouldn't say since I'm still employed at the same company and I know coworkers who read HN. If you're interested, DM me and I can at least provide the company name and some basic details.
> The NCR/Walmart relationship was fairly strained ...
Definitely. (: I didn't hold it against the hands on workers like yourself. Even (and perhaps especially) back then, WalMart was a challenging, difficult and aggressive partner.
> working on networking gear at Walmart meant that you were out at 11pm at night
That sounds about right; the scheduling I was directly aware of was very fast paced. Our Network Engineering Store Team pushed and pushed and pushed, just as they were pushed and pushed and pushed.
> Walmart must have been an interesting place to work during the late 90s, early 2000
Yup. Nowhere I've worked before or since had me learning as much or getting nearly as much done. It was an amazingly positive experience for me and my team, but not so positive for a lot of others.
> I imagine that most everywhere they had to solve problems at scale before scale was a considered a thing.
Sometimes I imagine writing a book about this, because it's absolutely true, all over Information Systems Division.
For a time in the early 2000s, we were, on average, opening up a new store every day, and a typical new store would have two routers, two VOIP routers, two cache engines, between 10 and 20 switches, two or four wireless access point controllers and dozens of AP endpoints. That was managed by one or two people on the Network Engineering side, so my team (Network Management) wrote automation that generated the configs, validated connections, uploaded configs, etc etc etc. (Not one or two people per store: one or two people for ALL of the new stores.)
The networking equipment was managed by a level of automation that is pretty close to what one sees inside of Google or Facebook today, and we were doing it 20 years ago.
> ... telecom ... prior art ...
John Chambers, the long-time CEO of Cisco, was at the time on WalMart's board of directors. He was always a bit of a tech head, and so when he came to Bentonville for board meetings, he'd often come and visit us in Network Engineering.
Around 2001-2002, we were chatting with him and he asked why we weren't using Cisco Works to manage our network. https://en.wikipedia.org/wiki/Cisco_Prime but back then it was mostly focused on network monitoring and to a lesser extent, config management. We chuckled and told him that there's no way that Cisco Works could scale to even a fraction of our network. He asked what we used, and of course we showed him the management system we'd written.
He was so impressed that he went back to San Jose, selected a group of Cisco Works architects, had them sign NDAs, and sent them to Bentonville, Arkansas for a month. The intent was to have them evaluate our software with an eye toward packaging it up and re-selling it.
Those meetings were interesting, but ultimately fruitless. The Cisco Works architects were Ivory Tower Java People. The first thing they wanted to see was our class hierarchy. We laughed and said we had scores of separate and very shallow classes, all written in Perl, C and C++.
Needless to say, they found the very 'rough and ready' way our platform was designed to be shocking and unpalatable. They went back and told Chambers that there was literally no way our products could be tied together.
> ... match to best-practices today ...
Professionally, I've been doing basically the same kinds of things since then, and I'll say that while our particular methods and approaches were extremely unusual, the high level results would meet or perhaps exceed what one gets with 'best practices' seen today.
Not because we were any smarter or better, but because we had no choice but to automate and automate effectively. At that scale, at that rate of change, at those uptime requirements, 'only' automating 99% would be disastrous.
FWIW, my brain was going "book! book! book! book! book!" back at the top-level comment, and the beeper may have got slightly overloaded and broke as I continued reading. :)
Yes please.
As a sidenote, the story about "the CEO vs the architects" was very fascinating: the CEO could see the end-to-end real-world value of what you'd built, but the architects couldn't make everything align. In a sense the CEO was more flexible than the architects, despite the fact that stereotypes might suggest the opposite being more presumable.
Also, the sentiment about your unusual methodology exceeding current best practice makes me wonder whether you achieved so-called "environmental enlightenment" - where everything clicks and just works and makes everyone who touches the system a 5x developer - or whether the environment simply had to just work really really well. Chances are the former is what everyone wishes they'll find one day, while the latter (incredibly complex upstream demands that are not going to go away anytime soon and which require you to simply _deliver_) definitely seems like the likelier explanation for why the system worked, regardless of the language it was written in - it was the product of a set of requirements that would not accept anything else.
Hmm. Now I think about that a bit and try and apply it to "but why is current best practice worse", I was musing the other day about how a lot of non-technical environments don't apply tech in smart ways to increase their efficiency, because their fundamental lack of understanding in technology means they go to a solutions provider, get told "this will cost $x,xxx,xxx", don't haggle because they basically _can't_, and of course don't implement the tech. I wonder if the ubiquitification (that seems to be a word) of so-called "best practices" in an area doesn't function in a similar way, where lack of general awareness/understanding/visibility in an area means methodology and "practices" (best or not) aren't bikeshed to death, and you can just innovate. (Hmm, but then I start wondering about how highly technically competent groups get overtaken by others... I think I'll stop now...)
I love hearing stories from "old" Walmart. I was a Walmartian from 2017 to 2019, and I still miss my co-workers. (Shout-out to the Mobile Client team.)
Some interesting facts to know for those who don't dig into it. Walmart:
- has 80+ internal apps, mostly variants but still unique
- runs k8s inside of Distribution Centers
- maintains a fleet of >180k mobile devices in the US alone
- has a half-dozen data centers in the US
- has most International infrastructure seperate from US Stores'
I've got some stories of my own, maybe I'll post them in a bit.
Wow, that's a hell of a change! When I left in 2009, there were exactly two datacenters: NDC and EDC. Not surprising really.
From where I was sitting, the best era was definitely 1997-2004 or so. ISD really went down hill, pretty quickly, in my last five years there, for many different reasons.
Really though, I feel truly awful for anyone affected by this. The post recommends implementing a disaster recovery plan. The truth is that most people don't have one. So, let's use this post to talk about Disaster Recovery Plans!
Mine: I have 5 servers at OVH (not at SBG) and they all back up to Amazon S3 or Backblaze B2, and I also have a dedicated server (also OVH/Kimsufi) that gets the backups. I can redeploy in less than a day on fresh hardware, and that's good enough for my purposes. What's YOUR Disaster Recovery Plan?
I'm at OVH as well (in the BHS datacenter, fortunately). I run my entire production system on one beefy machine. The apps and database are replicated to a backup machine hosted with Hetzner (in their Germany datacenter). I also run a tiny VM at OVH which proxies all traffic to Hetzner. I use a failover IP to point at the big rig at OVH. If the main machine fails, I move the failover IP to the VM, which sends all traffic to Hetzner.
If OVH is totally down, and the fail over IP doesn't work, I have a fairly low TTL on the DNS.
I backup the database state to S3 every day.
Since I'm truly paranoid, I have an Intel NUC at my house that also replicates the DB. I like knowing that I have a complete backup of my entire business within arm's reach.
I also run our entire production system on one beefy machine at OVH, and replicate to a similar machine at Hetzner. In case of a failure, we just change DNS, which has a 1 hour TTL. We've needed to do an unplanned fail-over only once in over 10 years.
And like you, I have an extra replica at the office, because it feels safe having a physical copy of the data literally at hand.
Same but with a regular offline physical copy (cheap nas). One of my worries is a malicious destruction of the backups if anything worms its way in my network
Which is why "off" is still a great security tool. A copy on a non-powered device, even if that device is attached to the network, is immune to worms. There is something to be said for a NAS solution that requires a physical act to turn on and perform an update.
hetzner has storage boxes and auto snapshots. so even if someone deletes the backups remotely there are still snapshots which they can't get unless they have control panel access.
My threat model is someone that would have full access to my computer without me knowing. So they could over time get access to passwords, modify my OS to MITM yubikeys... Over cautious really likely, but that doesnt cost me much more
Browser zero day or some kind of malicious linux package that gets distributed mostly. Don't think i've a profile that would make people bother to do physical attacks.
Not done any research into it, but I always thought OVH was supposed to be a very budget VPS service primarily for personal use rather than business. Although thought it was akin to having a Raspberry Pi plugged in at home.
Again, I may be completely wrong but why would you not use AWS/GCP? Even if it's complexity, Amazon have Lightsail, or if it's cost I thought DigitalOcean was one of the only reputable business-grade VPS providers.
I just can't imagine many situations where a VPS would be superior to embracing the cloud and using cloud functions, containers, instances with autoscaling/load balancers etc.
You cant imagine it yet big chunk of the independent internet runs on small vps servers. There isnt much difference between DO and OVH, Hetzner, Vultr, Linode... not sure why DO would be better. I mean its US company doing marketing right. Thats the difference. Plus ovh/hetzner have only EU locations.
I think small bussinesses like smaller simple providers instead of bigclouds. Its different philosophy if you are afraid of extreme centralisation of internet it makes sense.
I can think of a lot of big differences. For one you can get much larger machines at OVH and Hetzner with fancy storage configurations for your database if desired (e.g. Optane for your indices, magnetic drives for your transaction log, and raided SSDs for the tables)
They also don't charge for bandwidth, although some of those other providers have a generous free bandwidth and cheap overage.
I didn't realize they had US datacenters before now. It's possible that's no longer an option. It was on the largest servers in the Montreal datacenter when I specced that out.
Much cheaper and better performance at the high end. Doesn't compete at all at the low-end, except through their budget brand Kimsufi. I don't see them really as targeting the same market.
I rent a server from OVH for $32 a month. It's their So You Start line... doesn't come with fancy enterprise support and the like.
It's a 4 core 8 thread Xeon with 3x 1TB SATA with 32GB of ECC RAM IIRC (E3-SAT-1-32, got it during a sale with a price that is guaranteed as long as I keep renewing it)
The thing is great, I can run a bunch of VM's on it, it runs my websites and email.
Overall to get something comparable elsewhere I would be paying 3 to 4 times as much.
I would consider $50 a month or less low end pricing. ¯\_(ツ)_/¯
Yeah, I forgot they also have the so you start brand. It's probably more expensive than the majority of what digital ocean sells, but there is some overlap for sure.
I don't know about OVH but Hetzner beats DO at the lower end: for $5/month you get 2 CPUs vs 1, 2 GB RAM vs 1, 40 GB disk vs 25 and 20 TB traffic vs 1. They have an even lower-end package for 2.96 Euro/month as well.
OVH has at least one large North American datacenter in Beauharnois, located just south of Montreal. I've used them before for cheap dedicated servers. They may have others.
If all you need is compute, storage, and a pipe, all the big cloud providers are a total ripoff and you should look elsewhere. The big ones only make sense if you are leveraging their managed features or if you need extreme elasticity with little chance of a problem scaling up in real time.
OVH is one of the better deals for bare metal, but there are even better ones for bandwidth. You have to shop around a lot.
Also be sure you have a recovery plan... even with the big providers. These days risks include not only physical stuff but some stupid bot shutting you off because it thinks you violated TOS or is reacting to a possibly malicious complaint.
We had a bot at AWS gank some test systems once because it thought we were cryptocurrency mining with free credits. We weren’t, but we were doing very CPU intensive testing. I’ve heard of this and worse happening elsewhere. DDOS detector and IDS bots are particularly notorious.
Twice the revenue of DigitalOcean still puts it < $1B ARR, or am I missing something? I can’t see how that’s the third largest in the world, or does your definition of “hosting provider” exclude clouds?
OVH is one of the largest providers in the world. They run a sub brand for personal use (bare metal for $5/m, hardware replacements in 30 min or less usually).
..and they do support all of those things you just listed, not just API-backed bare metal.
Their sub-brand soyoustart has older servers (that are still perfectly fine), roughly E3 Xeon/16-32GB/3x2TB to 4x2TB for $40/m ex vat.
Their other sub brand kimsufi for personal servers has Atom low-power bare metal with 2TB HDD (in reality it is advertised 500GB/1TB, but they don't really have any of those in stock left, if your drive fails they replace it with a 2T - so far this has been my exp) for $5.
All of this is powered by automation, you don't really get any support and you are expected to be competent. If your server is hacked you get PXE-rebooted into a rescue system and can scp/rsync off your contents before your server is reinstalled. OS installs, reboots, provisioning are all automated, there's essentially no human contact.
PS: Scaleway, in Paris, used to offer $2 bare metal (ultra low voltage, weaker than an Atom, 2GB ram), but pulled all their cheap machines, raised prices on existing users, and rebranded as enterprisey. The offer was called 'kidechire'
--
It is kind of interesting that on the US side everyone is in disbelief, or like "why not use AWS" - while most of the European market knows of OVH, Hetzner, etc.
My own reason for using OVH? It's affordable and I would not have gotten many projects (and the gaming community I help out with) off the ground otherwise. I can rent bare metal with NVMe, and several terabytes of RAM for less than my daily wage for the whole month, and not worry about per-GB billing or attacks. In the gaming world you generally do not ever want to use usage based billing - made the mistake of using Cloudfront and S3 once and banned script kiddies would wget-loop the largest possible file from the most expensive region botnet repeatedly in a money-DoS.
I legitimately wouldn't have been able to do my "for-fun-and-learning" side projects (no funding, no accelerator credits, ...) without someone like them. The equivalent of a digitalocean $1000/m VM is about $100 on OVH.
Edit: Seems like they stopped publishing videos for that datacenter, but this seems to be a video for the burn down datacenter in 2013:
https://www.youtube.com/watch?v=Y47RM9zylFY
OVH STARTED as a budget VPS service some 20 years ago... but they grew a lot since 6-7 years, adding more "cloud" services and capabilities, even not on par with the main players...
Why not use AWS/GCP? From my personal point of view: as a French citizen, I'm more and more convinced that I can't completly trust the (US) big boys for my own safety. Trump showed that "US interest" is far more important than "customer interest" or even "ally interest". And moreover, Google is showing quite regurlaly that it's not a reliable business partner (AWS look better for this).
Yeah, I was thinking about all the horror stories that can be found on this site.
As a customer (or maybe an "involontary data provider"), I do as much as I can to avoid Google to be my SPOF, not technically (it's really technically reliable) but on the business side. I had to setup my own mail server just to avoid any risk of google-ban for example... just in case.
I won't use Google authentificator for the same reason. I'm happy to have left Google Photos some years ago, to avoid problems of Google shutting it down. And the list could go on...
As a business, I like to program Android apps but the Google Store is really a risk too. Risk to have any Google account blacklisted because some algorithm thought I did something wrong. And no appeal.
Maybe all this doesn't apply to GCP customers. Maybe GCP customers have a human direct line, with someone to really help and the capacity to do it. Or maybe it's just Google: as long as it work, enjoy. If it doesn't, go to (algorithmic) hell.
Nope. I was at a company with a $1M dedicated spend contract w/ GCP and what that got us was support through a VAR. It then became the VAR’s job to file support tickets that took two weeks to get the response “oh well that’s now how we do it at Google. Have you read these docs you already said you read and can you send logs you already sent?” instead of my job to do that.
Enterprise-level projects often have only light protection against wrongful hosting account termination, reasoning that spending a lot of money and having an account manager keeps them safe from clumsy automated systems.
So they might have their primary and replica databases at different DCs from the same hosting provider, and only their nightly backup to a different provider. Four copies to four different providers is a step above three copies with two providers!
A large enterprise would probably be using a filesystem with periodic snapshots, or streaming their redo log to a backup, to protect against a fat-fingered DBA deleting the wrong thing. Of course, filesystem snapshots provide no protection against loss of DC or wrongful hosting account termination, so you might not count them as true backup copies.
This is why you should have a “Cloud 3-2-1” backup plan. Have 3 copies of your data, two with your primary provider, and 1 with another.
e.g., if you are an AWS customer, have your back ups in S3 and use simple replication to sync that to either GCS or Azure, where you can get the same level of compliance attestation as from AWS.
It's not paranoia if you're right. All of the risks GP is protecting against are things that happen to someone every day, and they should be seen like wearing the seat belt in a car.
I have a reliability and risk avoidance mindset, but I’ve had to stand back because my mental gas tank for trying to keep things going is near empty.
I’ve really struggled working with others that either are both ignorant and apathetic about the business’s ability to deal with risk or believe that it’s their job to keep putting duct tape over the duct tape that breaks multiple times a day while users struggle.
I like seeing these comments reminding others to a wear seat belt or have backups for their backups, but I don’t know whether I should care more about reliability. I work in an environment that’s a constant figurative fire.
I also like to spend time with my family. I know it’s just a job, and it would be even if I were the only one responsible for it; that doesn’t negate the importance of reliability, but there is a balance.
If you are dedicated to reliability, don’t let this deter you. Some have a full gas tank, which is great.
> ... [F]inance is fundamentally about moving money and risk through a network. [1]
Your employer has taken on many, many risks as part of their enterprise. If every risk is addressed the company likely can’t operate profitably. In this context, your business needs to identify every risk, weigh the likelihood and the potential impact, decide whether to address or accept the risk, and finally, if they decide to address the risk, whether to address it in-house our outsource it.
You’ve identified a risk that is currently being “accepted” by your employer, one that you’d like to address in-house. Perhaps they’ve taken on the risk unintentionally, out of ignorance.
As a professional the best I can do is to make sure that the business isn’t ignorant about the risk they’ve taken on. If the risk is too great I might even leave. Beyond that I accept that life is full of risks.
This resonates with me. I notice my gas tank rarely depletes because of technology. It doesn’t matter how brain dead the 00’s oracle forms app with absurd unsupported EDI submission excel thinga-ma-bob that requires a modem ... <fill in the rest of the dumspter fire as your imagination deems>. Making a tech stack safe is a fun challenge.
Apathetic people though, that can be really tough going. It’s just that way “because”. Or my favourite “oh we don’t have permission to change that”, how about we make the case and get permission? _horrified looks_ sometimes followed by pitch forks.
Reliability is there to keep your things running smoothly during normal operations. Backups are there for when you reach the end of your reliability rope. Neither is really a good replacement for the other. The most reliable systems will still fail eventually, and the best of backups can't run your day to day operations.
At the end of the day you have a budget (of any kind) and a list of priorities on which to spend it. It's up to you or your management to set a reasonable budget, and to set the right priorities. If they refuse, leave or you'll just burn the candle at both ends and just fade out.
When a backup is used to re-enable something, then the amount of time disabled may be decreased. When it is, this is reliability- we keep things usable and in function, more than not.
are your domains at ovh too ? If yes, I'd consider changing this: this morning the manager was quite flooded and the DNS service was down for some time...
For small firms, CEO / CTO maintaining off-sites at a residence is reasonable and not an uncommon practice. As with all security / risk mitigation practices, there is a balance of risks and costs involved.
And as noted, encrypted backups would be resistant to casual interdiction, or even strongly-motivated attempts. Data loss being the principle risk mitigated by off-site, on-hand backups.
> There is nothing magical about data centers making them safe while your local copy isn't.
Is this a serious comment? My house is not certified as being compliant with any security standards. Here's the list that the 3rd party datacenter we use is certified as complaint with:
The data centers we operate ourselves are audited against several of those standards too. I guess you're right that there's nothing magic about security controls, but it has nothing to do with trust. Sensitive data should generally never leave a secure facility, outside of particularly controlled circumstances.
You are entierly missing the point by quoting the compliance programs followed by AWS whose sole business is being a third party hoster.
For most business, what you call sensitive data is customers and orders listing, payment history, inventory if you are dealing in physical goods and HR related files. These are not state secrets. Encryption and a modicum of physical security go a long way.
I personally find the idea that you shouldn't store a local backup of this kind of data out of security concern entirely laughable. But that's me.
This is quite a significant revision to your previous statement that there’s nothing about a data center that makes it more secure than your house.
This attitude that your data isn’t very important, so it’s fine to not be very concerned about it’s security, while not entirely uncommon, is something most organisations try to avoid when choosing vendors. It’s something consumers are generally unconcerned about, until a breach occurs, and The Intercept write an article about it. At which point I’m sure all the people ITT who are saying it’s fine to take your production database home would be piling on with how stupid the company was for doing ridiculous things like taking a copy of their production database home.
> This is quite a significant revision to your previous statement that there’s nothing about a data center that makes it more secure than your house.
I said there was nothing magical about data centers security, a point I stand with.
It's all about proper storage (encryption) and physical security. Obviously, the physical security of an AWS data center will be tighter that your typical SME but in a way which is of no significance to storing backups.
> This attitude that your data isn’t very important
You are once again missing the point.
It's not that your data isn't important. It's that storing it encrypted in a sensible place (and to be clear by that I just mean not lying around - a drawer in an office or your server room seems perfectly adequate to me) is secure enough.
The benefits of having easily available backups by far trump the utterly far fetched idea that someone might break into your office to steal your encrypted backups.
> It's that storing it encrypted in a sensible place (and to be clear by that I just mean not lying around - a drawer in an office or your server room seems perfectly adequate to me) is secure enough.
In the SME space some things are "different", and if you've not worked there it can be hard to get one's head around it:
A client of mine was burgled some years ago.
Typical small business, offices on an industrial estate with no residential housing anywhere nearby. Busy in the daytime, quiet as the grave during the night. The attackers came in the wee small hours, broke through the front door (the locks held, the door frame didn't), which must have made quite a bit of noise. The alarm system was faulty and didn't go off (later determined to be a 3rd party alarm installer error...)
All internal doors were unlocked, PCs and laptops were all in plain sight, servers in the "comms room" - that wasn't locked either.
The attacker(s) made a cursory search at every desk, and the only thing that was taken at all was a light commercial vehicle which was parked at the side of the property, its keys had been kept in the top drawer of one of the desks.
The guy who looked after the vehicle - and who'd lost "his" ride - was extremely cross, everyone else (from the MD on downwards) felt like they'd dodged a bullet.
Physical security duly got budget thrown at it - stable doors and horses, the way the world usually turns.
Once you're big enough to afford a CISO, you're likely big enough to afford office space with decent physical security to serve as a third replicated database site to complement your two datacenters.
These solutions are not one-size-fits-all. What works for a small startup isn't appropriate for a 100+ person company.
Not in my experience. Worked at some small shops that were lightyears ahead in terms of policy, procedures and attitude compared to places I've worked with 50k+ employees globally.
Large organisations tend not to achieve security compliance with overly sophisticated systems of policy and controls. They tend to do it using bureaucracy, which while usually rather effective at implementing the level of control required, will typically leave a lot to be desired in regards to UX and productivity. Small organisations tend to ignore the topic entirely until they encounter a prospective client or regulatory barrier that demands it. At which point they may initially implement some highly elegant systems. Until they grow large enough that they all devolve into bureaucratic mazes.
I'm aware, but that's not been my experience. I've been in large places where there's been a lassiez faire attitude because it was "another team's job" and general bikeshedding over smaller features because the bigger picture security wasn't their area or was forced from a dictat from above to use X because they're on the board, whilst X is completely unfit for purpose. There's no pushback.
However I've worked at small ISPs where we took security extremely seriously. Appropriate background check and industry policy but moreso the attitude... we wanted to offer customers security because we had pride in our work.
If you are a corporate entity of some kind, the final layer of your plan should always be "Go bankrupt". You can't successfully recover from every possible disaster and you shouldn't try to. In the event of a sufficiently unlikely event, your business fails and every penny spent attempting the impossible will be wasted, move on and let professional administrators salvage what they can for your creditors.
Lots of people plan for specific elements they can imagine and forget other equally or even more important things they are going to need in a disaster. Check out how many organisations that doubtless have 24/7 IT support in case a web server goes down somehow had no plan for what happens if it's unsafe for their 500 call centre employees to sit in tiny cubicles answering phones all day even though pandemic respiratory viruses are so famously likely that Gates listed them consistently as the #1 threat.
"Go bankrupt" is not a plan. Becoming insolvent might be the end result of a situation but it's not going to help you deal with it.
Let's take an example which might lead to bankruptcy. A typical answer to a major disaster (let's say your main and sole building burning as a typical case) for an SME would be to cease activity, furlough employes and stop or defer every payments you can while you claim insurance and assess your options. Well, none of these things are obvious to do especially if all your archive and documents just burnt. If you think about it (which you should), you will quickly realise that you at least need a way to contact all your employes, your bank and your counsel (which would most likely be the accountant certifying your results rather than a lawyer if you are an SME in my country) offsite. That's the heart of disaster planning: having solutions at the ready for what was easy to foresee so you can better focus on what wasn't.
Yes it is. (Though it's better, as GP suggested, as a final layer of a plan and not the only layer.)
> Becoming insolvent might be the end result of a situation but it's not going to help you deal with it.
Insolvency isn't bankruptcy. Becoming insolvent is a consequence, sure. Bankruptcy absolutely does help you deal with that impact, that's rather the point of it.
Bankruptcy when dealt with correctly is a process not an end.
If everything else fail it's better to fill for bankruptcy when there is still something to recover with help of others than to burn everything to ashes because of your vanity.
At least that's how I understood parent's comment.
As a quick interlude, since this may be confusing to non-US readers: bankruptcy in the United States in the context of business usually refers to two concepts, whereas in many other countries it refers to just one.
There are two types of bankruptcies in the US used most often by insolvent businesses: Chapter 7, and Chapter 11.
A Chapter 7 bankruptcy is what most people in other countries think of when they hear "bankruptcy" - it's the total dissolution of a business and liquidiation of its assets to satisfy its creditors. A business does not survive a Chapter 7. This is often referred to as a "bankruptcy" or "liquidation" in other countries.
A Chapter 11 bankruptcy, on the other hand, is a process by which a business is given court protection from its creditors and allowed to restructure. If the creditors are satisfied with the reorganisation plan (which may include agreeing to change the terms of outstanding debts), the business emerges from Chapter 11 protection and is allowed to continue operating. Otherwise, if an agreement can't be reached, the business may end up in Chapter 7 and get liquidated. Most countries have an equivalent to a Chapter 11, but the name for it varies widely. For example, Canada calls it a "Division 1 Proposal," Australia and the UK call it "administation," and Ireland calls it "examinership."
Since there's a lot of international visitors to HN I just thought I'd jump in and provide a bit of clarity so we can all ensure we're using the same definition of "bankruptcy." A US Chapter 7 bankruptcy is not a plan, it's the game over state. A US Chapter 11 bankruptcy, on the other hand, can definitely be a strategic maneuver when you're in serious trouble, so it can be part of the plan (hopefully far down the list).
> Bankruptcy when dealt with correctly is a process not an end.
Yes, that's why "Go bankrupt" is not a plan which was the entire point of my reply. That's like saying that your disaster recovery plan is "solve the disaster".
Going bankrupt is a plan. However, it is a somewhat more involved one than it sounds, at first. That's why there should be a corporate lawyer advising on stuff like company structure, liabilities, continuance of pension plans, ordering and reasons for layoffs, etc.
It's not quite that simple, the data you might have may be needed for compliance or regulatory reasons. Having no backup strategy might make you personally liable depending on the country!
The more insecure your workers, the easier it is to get them to come in, regardless of what the supposed rules may or may not be.
Fast Fashion for example often employs workers in more or less sweatshop conditions close to the customers (this makes commercial sense, if you make the hot new items in Bangladesh you either need to expensively air freight them to customers or they're going to take weeks to arrive after they're first ordered - there's a reason it isn't called "Slow fashion"). These jobs are poorly paid, many workers have dubious right-to-work status, weak local language skills, may even be paid in cash - and so if you tell them they must come in, none of them are going to say "No".
In fact the slackening off in R for the area where my sister lives (today the towering chimneys and cavernous brick factories are just for tourists, your new dress was made in an anonymous single story building on an industrial estate) might be driven more by people not needing to own new frocks every week when they've been no further than their kitchen in a month than because it would actually be illegal to staff their business - if nobody's buying what you make then suddenly it makes sense to take a handout from the government and actually shut rather than pretend making mauve turtleneck sweaters or whatever is "essential".
Just to clarify: trans-atlantic shipments take a week port-to-port, e.g. Newark, NJ, USA to Antwerp, Belgium. (Bangladesh to Italy via Suez-channel looks like a 2-week voyage, or 3 weeks to the US west coast. Especially the latter would probably have quite a few stops on the way along the Asian coast.)
You get better economics than shipping via air-freight from one full pallet and up. Overland truck transport to and from the port is still cheaper than air freight, at least in the US and central Europe.
For these major routes, there are typically at least bi-weekly voyages scheduled, so for this kind of distance, you can expect about 11 days pretty uniformly distributed +-2 days, if you pay to get on the next ship.
This may lead to (committing to) paying for the spot on the ship when your pallet is ready for pickup at the factory, not when it arrives at the port) and use low-delay overland trucking services.
Which operate e.g. in lockstep with the port processing to get your pallet on the move within half a day of the container being unloaded from the ship, ideally having containers pre-sorted at the origin to match truck routes at the destination.
So they can go on a trailer directly from the ship and rotate drivers on the delivery tour, spending only a few minutes at each drop-off.
Because those can't rely on customers to be there and get you unloaded in less than 5 minutes, they need locations they can unload at with on-board equipment. They'd notify the customer with a GPS-based ETA display, so the customer can be ready and immediately move the delivery inside.
Rely on 360-degree "dashcam" coverage and encourage the customer to have the drop-off point under video surveillance, just to easily handle potential disputes. Have the delivery person use some suitable high-res camera with a built-in light to get some full-surface-coverage photographic evidence of the condition it was delivered in.
I'd guess with a hydraulic lift on the trailer's back and some kind of folding manual pallet jack stuck on that (fold-up) lift, so they drive up to the location, unlock the pallet jack, un-fold the lift, lower the lift almost to the ground, detach the pallet jack to drop it the last inch/few cm to the ground, pull the jack out, lower the lift the rest of the way, drive it on to the lift, open the container, get up with the pallet jack, drive the pallets (one-by-one) for this drop-off out of the container and leave them on the ground, close and lock the container, re-arm the jack's hooks, shove it jack back under the slightly-lowered folding lift, make it hook back in, fold it up, lock the hooking mechanism (against theft at a rest stop (short meal and toilet breaks exist, but showering can be delayed for the up to 2 nights)), fold it all the way up, and go on to drive to their next drop-off point.
Not really, the insurance won't make things right in an instant. They will usually compensate you financially, but often only after painstaking evaluation of all circumstances, weighing their chances in court to get out of paying you and maybe a lengthy court battle and a race against your bankruptcy.
So yes, getting insurance can be a good idea to offset some losses you may have, as long as they are somewhat limited compared to your companies overall assets and income. But as soon as the insurance payout matches a significant part of your net worth, the insurance might not save you.
There are always uninsurable events and for large enough companies/risks there are also liquidity limits to the size of coverage you can get from the market even for insurable events.
As such, it makes sense to make the level of risk you plan to accept (by not being insured against it and not mitigating) a conscious economic decision rather than pretending you've covered everything.
As long as you have outside shareholders you can decide that. If you do you'd be surprised about how they will respond to an attitude like that. After all: you can decide the levels of risk that you personally are comfortable with leading to extinguishing of the business, but a typical shareholder is looking at you to protect their investment and not insuring against a known risk which at some point in time materializes is an excellent way to find yourself in the crosshairs of a minority shareholder lawsuit against a (former) company executive.
In my work life I am a professional investor, so I've been through the debate on insure/prepare or not many times. It's always an economic debate when you get into "very expensive" territory (cheap and easy is different obviously).
The big example of this which springs to mind is business interruption cover - it's ruinously expensive so it's extremely unusual to have the max cover the market might be prepared to offer. It's a pure economic decision.
Yes, but it is an informed decision and typically taken at the board level, very few CEO's that are not 100% owners would be comfortable with the decision to leave an existential risk uncovered without full approval of all those involved, which is kind of logical.
Usually you'd have to show your homework (offers from insurance companies proving that it really is unaffordable). I totally get the trade-off, and the fact that if the business could not exist if it was properly insured that plenty of companies will simply take their chances.
We also both know that in case something like that does go wrong everybody will be looking for a scapegoat, so for the CEO's own protection it is quite important to play such things by the book, on the off chance the risk one day does materialize.
Absolutely - but that's kind of my point. You should make the decision consciously. The corporate governance that goes around that is the company making that decision consciously.
And this is the heart of the problem: a lot of times these decisions are made by people who shouldn't be making them or they aren't made at all, they are just made by default without bring the fact that a decision is required to the level of scrutiny normally associated with such decisions.
This has killed quite a few otherwise very viable companies, it is fine to take risks as long as you do so consciously and with full approval of all stakeholders (or at least: a majority of all stakeholders). Interesting effects can result: a smaller investor may demand indemnification, then one by one the others also want that indemnification and ultimately the decision is made that the risk is unacceptable anyway (I've seen this play out), other variations are that one shareholder ends up being bought out because they have a different risk appetite than the others.
It's true: most companies do not have a disaster recovery plan, and many of them confuse a breach protocol with a disaster recovery plan ('we have backups').
Fires in DCs aren't rare at all, I know of at least three, one of those in a building where I had servers. This one seems to be worse than the other two. Datacenters tend to concentrate a lot of flammable stuff, throws a ton of current through them and does so 24x7. The risk of a fire is definitely not imaginary, which is why most DCs have fire suppression mechanisms. Whether those work as advertised depends on the nature of the fire. An exploding on prem transformer took out a good chunk of EV1's datacenter in the early 2000's, and it wasn't so much the fire that caused problems for their customers, but the fact that someone got injured (or even died, I don't recall exactly), and before the investigation was completed and the DC released to the owners again took a long time.
Being paranoid and having off-site backups is what allowed us to be back online before the fire was out. If not for that I don't know if our company would have survived.
No, SBG2 was a building in the "tower design", as is SBG3 behind it. The container in the foreground are SBG1 from the time when OVH didn't know if Straßburg is going to be a permanent thing.
Funnily enough, I think it was the fire risk that caused them to ditch the idea and move to their current design. Though I know modular design is highly likely to be used by all players as edge nodes spring up worldwide.
It was also that the container had literally no advantages. It was just a meme that did not survive rational analysis. The building in which the datacenter is located is the simplest, cheapest part of the design. Dividing it up into a bunch of inconveniently-sized rectangles solves nothing.
Got burned once (no pun intended), learned my lesson.
Hot spare on a different continent with replicated data along with a third box just for backups. The backup box gets offsite backups held in a safe with another redundant copy in another site in another safe.
Probably this is the most important part of your plan. It's not the backup that matters; it's the restore. And if you don't practice it from time to time, it's probably not going to work when you need it.
A few years ago I worked on the British Telecom Worldwide intranet team and we had a matrix mapping various countries encryption laws.
This was so we remained legal in all of the countries BT worked in which required a lot of behind the scenes work to make sure we didn't serve "illegaly encypted" Data.
yeah, there's lots of countries with regulations that certain data can't leave the geographical boundary of the country. Often, it is the most sensitive data.
These laws generally don't work how people think they do.
For example, the Russian data residency law states that a copy of the data must be stored domestically, not that it can't be replicated outside the country.
The UAE has poorly written laws that have different regulations for different types of data - including fun stuff like only being subject to specific requirements if the data enters a 270 acre business park in Dubai.
Don't even get me started on storing encrypted data in one country and the keys in another...
Also stupid things not to forget: make sure your dns provider is independent otherwise you won’t be able to point to your new server (or have a secondary DNS provider). Make sure any email required for 2FA or communicating with your hosting service managing your infrastructure isn’t running on that same infrastructure.
We test rolling over the entire stack to another AWS DR region (just one we dont normally use) from S3 backups, etc. We do this annually and try to introduce some variations to the scenarios. It takes us about 18 hours realistically.
Documentation / SOPs that have been tested thoroughly by various team members are really important. It helps work out any kinks in interpretation, syntax errors etc.
It does feel a little ridiculous at the time for all the effort involved, but incidents like this show why it's so important.
As an immediate plan, the 2-3 business critical systems are replicating their primary storages to systems in a different datacenter. This allows us to kick off the configuration management in a disaster, and we need something in between 1-4 hours to setup the necessary application servers and middlewares to get critical production running again.
Regarding backups, backups are archived daily to 2 different borg repo hosts on different cloud providers. We could lose an entire hoster to shenanigans and the damage would be limited to ~2 days of data loss at worst. Later this year, we're also considering to export some of these archives to our sister team, so they can place a monthly or weekly backup on tape in a safe in order to have a proper offline backup.
Regarding restores - there are daily automated restore tests for our prod databases, which are then used for a bunch of other tests after anonymization. On top, we've built most database handling on top of the backup/restore infra in order to force us to test these restores during normal business processes.
As I keep saying, installing a database is not hard. Making backups also isn't hard. Ensuring you can restore backups, and ensuring you are not losing backups almost regardless of what happens... that's hard and expensive.
* All my services are dockerized and have gitlab pipelines to deploy on a kubernetes cluster (RKE/K3s/baremetal-k8s)
* git repo's containing the build scripts/pipelines are replicated on my gitlab instance and multiple work computers (laptop & desktop)
* Data and databases are regularly dumped and stored in S3 and my home server
* Most of the infrastructure setup (AWS/DO/Azure, installing kubernetes) is in Terraform git repositories. And a bit of Ansible for some older projects.
Because of the above, if anything happens all I need to restore a service is a fresh blank VM/dedicated machine or a cloud account with a hosted Kubernetes offering. From there it's just configuring terraform/ansible variables with the new hosts and executing the scripts.
One of my backup servers used to be in the same datacenter as the primary server. I only recently moved it to a different host. It's still in the same city, though, so I'm considering other options. I'm not a big fan of just-make-a-tarball-of-everything-and-upload-it-to-the-cloud backup methodology, I prefer something a bit more incremental. But with Backblaze B2 being so cheap, I might as well just upload tarballs to B2. As long as I have the data, the servers can be redeployed in a couple of hours at most.
The SBG fire illustrates the importance of geographical redundancy. Just because the datacenters have different numbers at the end doesn't mean that they won't fail at the same time. Apart from a large fire or power outage, there are lots of things that can take out several datacenters in close vicinity at the same time, such as hurricanes and earthquakes.
> I'm not a big fan of just-make-a-tarball-of-everything-and-upload-it-to-the-cloud backup methodology, I prefer something a bit more incremental.
pretty much a textbook use-case for zfs with some kind of snapshot-rolling utility. Snap every hour, send backups once a day, prune your backups according to some timetable. Transfer as incrementals against the previous stored snapshot. Plus you get great data integrity checking on top of that.
with all due respect here - I've never heard of it either, and that's not what you want with a filesystem.
The draw of ZFS is that it's the log-structured filesystem with 10 zillionty hours of production experience that says that it works. And that's why BTRFS is not a direct substitute either. Or Hammer2. There are lots of things that could be cool, the question is are you willing to run them in production.
There is a first-mover advantage in filesystems (that occupy a given design and provide a given set of capabilities). At some point a winner sucks most of the oxygen out of the atmosphere here. There is maybe space for a second place winner (btrfs), there isn't a spot for a fourth-place winner.
I use tarballs because it allows me to not trust the backup servers. ssh is set up such that backup server's ssh keys are certified to only run a command that will allow them to run a backup script that will just return the encrypted data, and nothing else.
It's very easy to use spare storage in various places to do backups this way, as ssh, gpg and cron are everywhere, and you don't need to install any complicated backup solutions or trust the backup storage machines much.
All you have to manage centrally is private keys for backup encryption, and CA for signing the ssh keys + some occasional monitoring/tests.
I thought so too for a long while. Until I was trying to restore something (just to test things), and wasn’t able to... it might have been specific to our GPG or an older version or something... but I decided to switch to restic and am much happier now.
Restic has a single binary that takes care of everything. It feels more modern and seems to work really well. Never had any issue restoring from it.
Just one data point. Stick to whatever works for you. But important to test not only your backups, but also restores!
I've been using Duplicati forever. The fact that it's C# is a bit of a pain (some distros don't have recent Mono), but running it in Docker is easy enough. Being able to check the status of backups and restore files from a web UI is a huge plus, so is the ability to run the same app on all platforms.
I've found duplicity to be a little simplistic and brittle. Purging old backups is also difficult, you basically have to make a full backup (i.e. non-incremental) before you can do that, which increases bandwidth and storage cost.
Restic looks great feature-wise, but still feels like the low-level component you'd use to build a backup system, not a backup system in itself. It's also pre-1.0.
Interesting, I will check Restic out, I’ve heard other good things about it. Duplicity is a bit of a pain to set up and Restic’s single binary model is more straightforward (Go is a miracle). Thanks for the recommendation!
GPG is a bit quirky but I do regularly check my backups and restores (if once every few months counts as regular).
Ditto. Moved to rclone after having a bunch of random small issues with Duplicity that on their own weren't major but made me lose faith in something that's going to be largely operating unsupervised except for a monthly check-in.
Self-hosted Kuberenetes and a FreeNAS storage system at home, and a couple of VMs in the cloud. I've got a mixed strategy, but it covers everything to remote locations.
Personal: I run a webserver for some website (wordpress + xenforo), I've set up a cronjob that creates a backup of /var/www, /etc and a mysql database dump, then uploads it to an S3 bucket (with automatic Glacier archiving after X period set up). It should be fairly straightforward to rent a new server and set things back up. I still dislike having to set up a webserver + php manually though, I don't get why that hasn't been streamlined yet.
My employer has a single rack of servers at HQ. It's positioned at a very specific angle with an AC unit facing it, their exact positions are marked out on the floor in tape. The servers contain VMs that most employees work on, our git repository, issue trackers, and probably customer admin as well. They say they do off-site backups, but honestly, when (not if) that thing goes it'll be a pretty serious impact on the business. They don't like people keeping their code on their take-home laptop either (I can't fathom how my colleagues work and how they can stand working in a large codebase using barebones vim over ssh), but I've employed some professional disobedience there.
Basically the same (offsite backups), but the details are in the what and how which is subjective... For my purposes I decided that offsite backups should only comprise user data and that all server configuration be 100% scripted with some interactive parts to speed up any customization including recovering backups. I also have my own backup servers rather than using a service, and implement immutable incremental backups with rotated ZFS snapshots (this is way simpler than it sounds) - I can highly recommend ZFS as an extremely reliable incremental backup solution but you must enable block level deduplication and expect it to gobble up all the server RAM to be effective (but that's why I dedicate a server to it and don't need masses of cheap slow storage)... also the backup server is restorable by script and only relies on having at least one of the mirrored block devices in tact which I make a local copy of occasionally.
I'm not sure how normal this strategy is outside of container land but I like just using scripts, they are simple and transparent - if you take time and care to write them well.
This sounds like what I want to do for the new infrastructure I'm setting up in one of OVH's US-based data centers. Are you running on virtual machines or bare metal? What kind of scripting or config management are you using?
VPS although there is no dependency on VPS manager stuff so I don't see any issue with running on bare metal. No config managers, just bash scripts.
They basically install and configure packages using sed or heredocs with a few user prompts here and there for setting up domains etc.
If you are constantly tweaking stuff this might not suit you, but if you know what you need and only occasionally do light changes (which you must ensure the scripts reflect) then this could be an option for you.
It does take some care to write reliable clear bash scripts, and there are some critical choices like `set -e` so that you can walk away and have it hit the end and know that it didn't just error in the middle without you noticing.
Servers are at a mix of "cloud" providers, and on-site. Most data (including system configs!) is backed up on-site nightly, and to B2 nightly with historical copies - and critical data is also live-replicated to our international branches. (Some "meh" data is backed up only to B2, like our phone logs; we can get most of the info from our carrier anyway).
Our goal and the reason we have a lot of stuff backed up on-prem is to have our most time-critical operations back up within a couple of hours - unless the building is destroyed, in which case that's a moot point and we'll take what we can get.
A dev wiped our almost-monolithic sales/manufacturing/billing/etc MySQL database a month or two ago. (I have been repeatedly overruled on the topic of taking access to prod away from devs) We were down for around an hour. Most of that time was spent pulling gigs of data out of the binlog without also wiping it all again. Because our nightly backups had failed a couple weeks prior - after our most recent monthly "glance at it".
Less than a day for disaster recovery on fresh hardware? Same as my case. As you say, good enough for most purposes, but I'm also looking for improvement. I have offsite realtime replicas for data and mariaDBs, and offsite nightly backups (combo of rsnapshot, lsyncd, mariaDB multi-source replication, and a post-new-install script that setups almost everything in case you have to recover on bare-metal, i.e. no available VM snapshots).
Currently trying to reduce that "less than a day" though. Recently discovered "ReaR" (Relax and Recover) from RedHat and sounds really nice for bare-metal servers. Not everybody runs on virtualized/cloud (being able to recover from VM snapshots is really a plus). Let's share experiencies :)
We have two servers at OVH (RBX and GRA, not SBG). I make backups of all containers and VMs every day and keep the last three, plus one one each month. Backups are stored in a separate OVH storage disk and also downloaded to a NAS on-premise. In case of a disaster, we'd have to rent a new server, reprovision the VMs and containers and restore the backups. About two days of work to make sure everything works fine and we could lose about 24 hours of data.
It's not the best in terms of Disaster Recovery Plan but we accept that level of risk.
Nothing too crazy, just a simple daily cron to see sync user data and database dumps on our OVH boxes to backblaze and rsync.net. This simple setup is already saved our asses a few times already.
Most people/companies don't have money to setup those disaster plans. They need you to have a similar server ready to go and also a backuo solution like Amazon S3.
I was affected, my personal VPS is safe but down and other VPS I was managing I don't know anything about. I have the backups and right now I'd love for them to just set me up a new VPS so I can restore the backups and restore the services.
I only have a personal server running in Hetzner but it's mirrored onto a tiny local computer at home.
They both run postfix + dovecot, so mail is synced via dovecot replication. Data is rsync-ed daily, and everything has ZFS snapshots. MySQL is not set into replication - my home internet breaks often enough to have serious issues, so instead I drop everything every day import a full dump from the main server, and do a local dump as backup on both sides.
Not saying that you should never do a full mysql dump. Nor that you should not ensure that you can import a full dump.
But when you already use ZFS you can do a very speedy full backup with:
mysql << EOF
FLUSH TABLES WITH READ LOCK;
system zfs snapshot data/db@snapname
UNLOCK TABLES;
EOF
Transfer the snapshot off-site (and test!). Either as a simple filecopy (the snapshot ensured a consistent database) or a little more advanced with zfs send/receive. This is much quicker and more painless than mysql dump. Especially with sizeable databases.
Do you even need to flush the tables and grab a read lock while taking the ZFS snapshot? My understanding was that since ZFS snapshots are point-in-time consistent, taking a snapshot without flushing tables or grabbing a read lock would be safe; restoring from that snapshot would be like rebooting after losing power.
I think you are correct. But then you risk data as you would with the unclean shutdown.
I much prefer to have a known clean state which all things considered should be a safer bet.
Just like some are OK running without fsync.
I don't have to "backup servers" for a long time now. I have an Ansible playbook to deploy and orchestrate services, which, in turn, are mostly dockerized. So my recovery plan is to turn on "sorry, maintenance" banner via CDN, spin up a bunch of new VPSes, run Ansible scenario for deployment and restore database from hidden replica or latest dump.
My recovery plan: tarball & upload to Object Store. I'm going to check out exactly how much replication the OVH object store offers, and see about adding a second geographic location, and maybe even a second provider, tomorrow.
If your primary data is on OVH, I'd look at using another company's object store if feasible (S3, B2, etc). If possible, on another payment method. (If you want to be really paranoid, something issued under another legal entity.)
There's a whole class of (mostly non-technical) risks that you solve for when you do this.
If anything happens with your payment method (fails and you don't notice in time; all accounts frozen for investigation), OVH account (hacked, suspended), OVH itself (sudden bankruptcy?), etc, then at least you have _one_ other copy. It's not stuff that's likely to happen, but the cost of planning for it at least as far as "haven't completely lost all my data even if it's going to be a pain to restore" here is relatively minimal.
I have three servers (1 OVH - different location, 2 DO). The only thing I backup is the DB, which is synced daily to S3. There's a rule to automatically delete files after 30 days to handle GDPR and stop the bucket and costs spiralling out of control.
Everything is managed with Ansible and Terraform (on DO side), so I could probably get everything back up and running in less than an hour if needed.
That makes it sound like you didn't try/practice. I imagine that in a real-life scenario things will be a little more painful than in one's imagination.
Exactly. Having a plan is only part of it. Good disaster plans do dry runs a couple of times a year (when time changes is always convenient reminder). If you rehearse the recovery when you're not panicked, you have a better chance of not skipping a step when the timing is much more crucial. Also, some sort of guide with steps given procedurally is a great idea.
I don't think this is necessarily true for all parts of a disaster plan. Some mechanisms may be untestable because it is unknown how to actually trigger it (think certain runtime assertions, but on a larger scale).
Even if it possible to trigger and test, actually using the recovery mechanism may have some high cost either monetarily or maybe losing some small amount of data. These mechanisms should almost always be an additional layer of defense and only be invoked in case of true catastrophe.
In both cases, the mechanisms should be tested as thoroughly as possibly, either through artificial environments that can simulate improbable scenarios or in the latter case on a small test environment to minimize cost.
I haven't ever deleted everything and timed how long I could get it up and running again, but I have tested it works by spinning up new machines and moving everything over to there (it was easier than running "sudo apt-get dist-upgrade").
Here's what i do for my homelab setup that has a few machines running locally and some VPSes "in the cloud":
I personally have almost all of the software running in containers with an orchestrator on top (Docker Swarm in my case, others may also use Nomad, Kubernetes or something else). That way, rescheduling services on different nodes becomes less of a hassle in case of any one of them failing, since i know what should be running and what configuration i expect it to have, as well as what data needs to be persisted.
At the moment i'm using Time4VPS ( affiliate link: https://www.time4vps.com/?affid=5294 ) for the stuff that needs decent availability and because they're cheaper than almost all of the alternatives i've looked at (DigitalOcean, Vultr, Scaleway, AWS, Azure) and that matters to me.
Now, in case the entire data centre disappears, all of my data would still be available on a few HDDs under my desk (which are then replicated to other HDDs with rsync locally), given that i use BackupPC for incremental scheduled backups with rsync: https://backuppc.github.io/backuppc/
For simplicity, the containers also use bind mounts, so all of the data is readable directly from the file system, for example, under /docker (not really following some of the *nix file system layout practices, but this works for me because it's really easy to tell where the data that i want is).
I actually had to migrate over to a new node a while back, took around 30 minutes in total (updating DNS records included). Ansible can also really help with configuring new nodes. I'm not saying that my setup would work for most people or even anything past startups, but it seems sufficient for my homelab/VPS needs.
My conclusions:
- containers are pretty useful for reproducing software across servers
- knowing exactly which data you want to preserve (such as /var/lib/postgresql/data/pgdata) is also pretty useful, even though a lot of software doesn't really play nicely with the idea
- backups and incremental backups are pretty doable even without relying on a particular platform's offerings, BackupPC is more than competent and buying HDDs is far more cost effective than renting that space
- automatic failover (both DNS and moving the data to a new node) seems complicated, as does using distributed file systems; those are probably useful but far beyond what i actually want to spend time on in my homelab
- you should still check your backups
A status update on the OVH tracker for a different datacenter (LIM-1 / Limburg) says "We are going to intervene in the rack to replace a large number of power supply cables that could have an insulation defect." [0][1] The same type of issue is "planned" in BHS [3] and GRA [2].
Eerie timing: do they possibly suspect some bad cables?
>Eerie timing: do they possibly suspect some bad cables?
Why not? Cables with ratings lower than the load they are carrying is a prime cause for electrical fires. If the load is too high for long enough, the shielding melts away, and if it is close enough for other material to catch fire then that's the ball game. It's a common cause for home electrical fires. Some lamp with poor wiring catches the drapes on fire, etc. Wouldn't think a data center would have flammable curtains though.
This definitely could be probable cause. When I was a teen I witnessed such fire once. Basically a friend had a heater but couldn't find any mains cable and eventually decided to disconnect mains cable from the radio and use that. After few minutes the isolation from the mains cable melted away and the cable turned glowing red and it started burning the table it was on. Fortunately we were not asleep and got it under control quickly. Lesson learned.
We have several bare-metal servers on GRA/Gravelines & RBX/Roubaix, 3 weeks ago we had a 3h downtime on RBX because they were replacing power cords without previous notification.
Maybe they were aware this could happen, and were in the process to fix it
Treated lumber is generally considered to be fairly fireproof (comparable, though with different precise failure modes than steel or concrete). It depends on exactly what kind of wood is being used. A treated 12x12 beam is very fire resistant, plywood is less so.
The issue is you'll have lots of plastic (cabling) in a DC, and plastic will burn
There is self-extinguishing cable insulation, though. I'm actually surprised this (DC flammability) is still an issue, and not already solved by making components self-extinguishing and banning non-tiny batteries inside the equipment. If you want to have a battery for your raid controller, put something next to it that will stop your system from burning down it's surroundings.
Data centres I used to work in back in the early 2000s had argonite gas dumps in place (prior to argonite, halon used to be popular but is an ozone depleting gas so was phased out)
In the case of a fire, it would dump a lot of argonite gas in and consume a large amount of the oxygen in the room, depriving the fire of fuel. It's also safe and leaves minimal clean-up work afterwards, doesn't harm electronics etc. unlike sprinklers and the like.
The amount of oxygen left is sufficient for human life, but not for fires, though my understanding is that it can be quite unpleasant when it happens. You won't want to hang around.
One of ours had a giant red button you could hold to pause the 60 second timer before all the oxygen was displaced. Every single engineer was trained to immediately push that if people were in the room because it was supposedly a guaranteed death if you got stuck inside once the system went off.
Well, yeah, these normal inert gas fire suppression systems don't do a good job if humans can still breathe.
The Novec 1230 based ones can actually be sufficiently effective for typical flammability properties you can cheaply adhere to in a datacenter, but even then you iirc would want to add both that and some extra oxygen, because the nitrogen in the air is much more effective at suffocating humans than at suffocating fire. This stuff is just a really, really heavy gas that's liquid below about body temperature (boils easily though), and the heat capacity of gasses is mostly proportional to their density.
Flames are extinguished by this cooling effect (identical to water in that regard), but humans rely on catalytic processes that aren't affected by the cooling effect.
If you could keep the existing oxygen inside, while adding Novec 1230, humans could continue to breathe while the flames would still be extinguished, but this would require the building/room to be a pressure chamber that holds about half an atmosphere of extra pressure. I'm pretty sure just blowing in some extra oxygen with the Novec 1230 would be far cheaper to do safely and reliably.
I mean, in principle, if you gave re-breathers to the workers and have some airlocks, you could afford to keep that atmosphere permanently, but it'd have to be a bit warm (~30 C I'd guess). Don't worry, the air would be breathable, but long-term it'd probably be unhealthy to breathe in such high concentrations and humans breathing would slightly pollute the atmosphere (CO2 can't stay if it's supposed to remain breathable).
Just to be clear: in an effective argonite extinguishing system you'd have about a minute or two until you pass out and need to be dragged out, ideally get oxygen, get ventilated (no brain, no breathing) and potentially also be resuscitated (the heart stops shortly after your brain from a lack of oxygen, so if you're ventilated fast enough, it never stops and you wake up a few externally-forced breaths later). Having an oxygen bottle to supplement your breaths would fix that problem for as long as it's not empty.
> I mean, in principle, if you gave re-breathers to the workers and have some airlocks, you could afford to keep that atmosphere permanently,
At this point I feel like it would be cheaper just to not have workers go there. Fill the place completely full of nitrogen with an onsite nitrogen generator (and only 1atm pressure). Have 100% of regular maintenance and as much irregular maintenance as possible be done by robots. If something happens that requires strength/dexterity beyond the robots (e.g. a heavy object falling over), either have humans go in in some form of scuba gear, or if you can work around it just don't fix it.
That seems reasonable.
But just to clarify what I meant with airlock: some thick plastic bag with a floor-to-ceiling zipper on the "inner" and "outer" ends, and for entry, it's first collapsed by a pump sucking the air out of it. Then you open the zipper on the outer end, step in, close the zipper, let the pump suck away the air around you, and open the inner zipper (they should probably be automatically operated, as you can't move well/much when you are "vacuum bagged").
For exit, basically just the reverse, with the pump pumping the air around the person to wherever the person came from.
The general issue with unbreathable atmospheres is that a failure in their SCBA gear easily kills them.
And re-breathers that are only there so you don't have to scrub the atmosphere in the room as often shouldn't be particularly expensive. You may even get away with just putting a CO2 scrubber on the exhaust path, and giving them slightly oxygen-enriched bottled air so you can keep e.g. a 50:50 oxygen:nitrogen ratio inside (so e.g. 20% O2, 20% N2, 60% Novec 1230).
And it doesn't even need to be particularly effective, as you can breathe in quite a bit of the ambient air without being harmed, and the environment can tolerate some of your CO2. Like, as long as it scrubs half of your exhausted CO2 it won't even feel stuffy in there (you could handle the ambient air you'd have without re-breathers being used, as it'd be just 1.6% CO2, but you'd almost immediately get a headache).
They'd have an exhaust vent for pressure equalization, which would chill the air to condense and re-cycle the Novec 1230. For pressure equalization in the other direction, they'd probably just boil off some of that recycled Novec 1230.
So yeah, re-breather not needed, if you just get a mouth+nose mask to breathe bottled 50:50 oxygen:nitrogen mix.
That 50% oxygen limit (actually 500 mBar) is due to long-term toxicity, btw. Prolonged exposure to higher levels causes lung scarring and myopia/retina detachment, so not really fun.
I think Microsoft had some experimental datacenter containers they submerged in the northern Atlantic for passive cooling, and I believe those were filled with an inert gas as well. I guess that would come very close to an actual fireproof datacenter.
Yep you’re right, learning about this was part of some onboarding training we had to complete...
it was an interesting proof of concept, but finding the right people to maintain the infra with both IT and Scuba skills was a narrow niche to nail down ;)
I don't opening it at any point before decomissioning it completely is even an afterthought with that. They just write off any failures and roll with it as long as it's viable.
Well, you only need to get shared infrastructure reliable enough that you can afford to not design it with repair in mind.
The cloud servers are already design without unit-level maintenance work in mind, which saves money by eliminating rails and similar. They get populated racks from the factory and just cart them from the dock to their place, plug them in (maybe run some self-test to make sure there are no loose connectors or so), and eventually decommission them after some years.
From the outside (and from a position lacking all inside knowledge) it looks highly interconnected and very well ventilated. I'm not sure where you'd put a inert gas supression system or beefy firewalls to slow the fire progress.
I feel like there should be place to report infrastructure suppliers with misleading status pages, some kind of crowdsourced database. Without this information, you only find out that they are misleading when something goes very wrong.
At best you might be missing out on some SLA refunds, but at worst it could be disasterous for a business. I've been on the wrong side of a update-by-hand status system from a hosting provider before and it wasn't fun.
Agreed, though. A fake status page is worse than no status page. I don't mind if the status page states that it's manually updated every few hours as long as it's honest. But don't make it look like it's automated when it's not.
Wtf is this disclaimer on Down Detector for? (Navigate to OVH page.). It sits in front of user comments, I think:
> Unable to display this content to due missing consent.
By law, we are required to ask your consent to show the content that is normally displayed here.
They are not the only ones though. All too common. Well, it's tricky to set this up properly. The only proper way would be to use external infra for the status page.
It's not difficult to make a status page with minimal false negatives. Throw up a server on another host that shows red when it doesn't get a heartbeat. But then instead you end up with false positives. And people will use false positives against you to claim refunds against your SLA.
As someone who maintained a status page (poorly), I'm sorry on behalf of all status pages.
But, they're usually manual affairs because sometimes the system is broken even when the healthcheck looks ok, and sometimes writing the healthcheck is tricky, and always you want the status page disconnected from the rest of the system as much as possible.
It is a challenge to get 'update the status page' into the runbook. Especially for runbooks you don't review often (like the one for the building is on fire, probably).
Luckily my status page was not quite public; we could show a note when people were trying to write a customer service email in the app; if you forget to update that, you get more email, but nobody posts the system is down and the status page says everything is ok.
Yep. I guess what could be done is a two-tiered status page: automated health check which shows "possible outage, we're investigating" and then a manual update (although some would say it looks lame to say "nah, false positive" which is probably why this setup is rare).
Well, it sucks to catch fire and I care for the employees and the firemen, but if their status page is a lie then I have a whole lot less sympathy for the business. That's shady business and they should feel bad.
I can appreciate an honest mistake though, like the status page server cron is hosted in the same cluster that caught fire and hence it burnt down and can't update the page anymore.
Is the status page relevant though? At the very least, OVH immediately made a status announcement on their support page and they've been active on Twitter. I don't see anything shady here. From their support page:
> The whole site has been isolated, which impacts all our services on SBG1, SBG2, SBG3 and SBG4. If your production is in Strasbourg, we recommend to activate your Disaster Recovery Plan
What's the point of a status page then if it does not show you the status? I don't want to be chasing down twitter handles and support pages during an outage.
It seems to be a static site, which seems reasonable since it aggregates a lot of data and might encounter high load when something goes wrong, so generating it live without caching is not viable. So maybe the server that normally updates it is down too (not that this would be a good excuse)?
SBG2 was HUGE and if this isn't a translation error on the part of Octave (which I could understand given the stress and ESL) I have a hard time fathoming what kind of fire could destroy a whole facility with nearly 1000 racks of equipment spread out across separated halls.
I'm really hoping "destroyed" here means "we lost all power and network core and there's smoke/fire/physical damage to SOME of that"
I can't even fathom a worst-case scenario of a transformer explosion (which does occur and I've seen the aftermath of) having this big of an impact. Datacenters are built to contain and mitigate these kinds of issues. Fire breaks, dry-pipe sprinkler systems and fire-extinguishing gas systems are all designed to prevent a fire from becoming large-scale.
Really glad nobody was hurt. OVH is gonna have a bad time cleaning all this up.
If a fire supression kicks in or the fire department shows up with their hoses, would they still say the fire destroyed it or just say destroyed due to fire and water damage?
Also, fire suppression system do fail. There was an infamous incident in LA for one of the studios. They built a warehouse to be a tape vault with tapes going back to the 80s. A fire started, but the suppression system failed because there was not enough pressure in the system. Total loss. Got to keep your safety equipment properly tested!
It wasn’t just “tapes going back to the 80s.” Those were just the media Universal initially admitted to losing. No, that building was the mother lode. It had film archives going back over 50 years, and worst of all — unreleased audio tape masters for thousands upon thousands of recording artists. The amount of “remastered” album potential that fire destroyed is probably in the billions of dollars, let alone the historical loss of all those recordings by historical persons that will never be heard due to a failure of a fire prevention system. Fascinating case study in why you should never put all your eggs in one basket.
In 1996 Surinam lost many of their government archives after a fire burned the building down.
I can find surprisingly few English-language resources for this (only Dutch ones); guess it's a combination of a small country + before the internet really took off.
This is why I'm such a fan of digitizing. If you have 1 film master you effectively have none. Do some 8K or 16K scans of that master and effectively manage the resulting data and you're effectively 100% immune from future loss in perpetuity.
There's a problem with testing sprinklers: engaging them can be damaging to contents and even structures. So, we're talking about completely emptying the facility, then taking it offline to dry for a time. I've never heard about this being done to anything that was already operational (but I wasn't researching this either).
There are methodds of testing the water pressure in the pipes without actually engaging the sprinkler heads. It is part of the normal checks done during the maintenence/inspection a business is supposed to have done. In fact, one place I was in had sensors, and would sound the actual fire alarm if the pressure fell below tolerance at any time. The lack of pressure in the Universal vault was 100% unexcusable.
It’s common for sprinklers in parking garages in cold climates to be “dry” where there’s a bubble of air that needs to run through first before water shoots out from a non-freezeable source.
Want to test the system? Just turn off that water valve, hook up your air compressor and pressure test to your heart’s content.
Halon is pretty much banned for years now, other agents have been introduced.
Sadly, making an actual full test of a gaseous extinguishing system (such as FM200 or Novec 1230) can be prohibitively expensive (mainly costs of the "reloading" of the system with new gas). Those are just mostly tested for correct pressure in tanks and if the detection electronics are working fine, making an actual dump would be very impractical (evacuation of personnel, ventilating the room afterwards etc.)
Good point. But, apart from cost of the gas, mentioned by sibling comment, that also isn't disruption-free. Gas-based systems are by definition (they displace oxygen) dangerous to humans. But yeah, not being able to run maintenance is orders of magnitude less problematic.
OVH design their own datacenters, so it's possible that they missed something or some system or another didn't work as intended, thus the heavy damage.
There are photos of the fire suppression system in one of their data centres in this (French) forum thread: https://lafibre.info/ovh-datacenter/ovh-et-la-protection-inc.... They have sprinklers, with the reasoning being that the burning racks' data is gone anyway if there's a fire, and at least sprinklers don't accidentally kill the technicians.
> They have sprinklers, with the reasoning being that the burning racks' data is gone anyway if there's a fire
I think the real problem, per that post, is this:
>> They are simple sprinklers that spray with water. It has nothing to do with very high pressure misting systems, where water evaporates instantly, and which save servers. Here, it's watering, and all the servers are dead. It's designed like that. Astonishing, isn't it?
>> Obviously, they rely above all on humans to extinguish a possible fire, unlike all conventional data centers.
(all thanks to Google Translate)
This strikes me as a terrible safety system because even if a human managed to detect the fire, they have to make a big call: is the risk of flooding the facility and destroying a ton of gear worth putting out a fire? By the time the human decides "yes, it is", it may well be too late for the sprinklers.
> and at least sprinklers don't accidentally kill the technicians.
Not a real risk with modern 1:1 argon:nitrogen systems – the goal is to pump in inert gases and reduce oxygen content to around 13%, a point where the fire is suppressed and people can survive. You wouldn't want to be in a room breathing 13% oxygen for a long time, but it won't kill you.
All in all, it looks like this was a "normal accident"[1] for a hosting company that aggressively competes on price. The data center was therefore built with less expensive safeties and carried a higher risk of catastrophic failure.
Given that they're headquartered in Europe and most popular there, why is the satellite location better? Is it because the Canadian data center is newer, because Canada has stronger regulations in this area, or something else? Also, does anyone know how the OVH US data centers compare?
According to a 2013 forum post by the now CEO of Scaleway, a competitor of OVH, it's due to North American building regulations that basically force you into sprinklers and stuff for insurance reasons.
"Every data center room is fitted with a fire detection and extinction system, as well as fire doors. OVHcloud complies with the APSAD R4 rule for the installation of mobile and portable fire extinguishers and has the N4 certificate of conformity for all our data centers."
https://us.ovhcloud.com/about/company/security
I find it very hard to believe that that would pass code anywhere in the US/EU or most of the world. They may not have had sprinklers but that doesn't mean there isn't fire suppression.
Data centers frequently burn down, or are destroyed by natural disaster.
These days, fire suppression systems need to be non-lethal, so inert gasses are out. Water is too, for obvious reasons. Last I checked, they flooded the inside of the DC with airborne powder that coats everything (especially the inside of anything with a running fan). Once that deploys, the machines in the data center are a write off even if the fire was minor.
Just guessing, but maybe a fire suppression system going off could wipe out all the machines?
The couple datacenters I've been inside were small, old and used halon gas which wasn't supposed to destroy the machines. No idea how it works in big places these days.
I've also seen a weird video (lost to time unfortunately), where someone showed that they could yell at their servers in a data centre and introduce errors (or something similar). Was very strange to see, but they had a console up clearly showing that it was having an impact.
We had the same issue at a customer site. To add since the decibels were outside the rated environment the warranty was void on the harddisks and they had to be replaced even if they were not destroyed.
and a 2017 (!) article where they were planning to remove 1 and 2 because of power issues!
> OVH to Disassemble Container Data Centers after Epic Outage in Europe
> “This is probably the worst-case scenario that could have happened to us.”
> OVH [...] is planning to shut down and disassemble two of the three data centers on its campus in Strasbourg, France, following a power outage that brought down the entire campus Friday, causing prolonged disruption to customer applications that lasted throughout the day and well into the evening.
> In the case of Roubaix 4, the Datacenter is made with a lot of wood:
> Finally, we have other photos of the floor of the OVH "Roubaix 4" tower. It is clearly wood! Hope it's fireproof wood! A wooden datacenter ... is still original, we must admit.
> In France, data centers are mainly regulated by the labor code, by ICPE recommendations (with authorization or declaration) and by insurers.
At the purely regulatory level, the only things that are required are:
> - Mechanical or natural smoke extraction for blind premises or those covering more than 300m2
> - The fire compartmentalization beyond a certain volume / m2
> - Emergency exits accessible with a certain width
> - Ventilation giving a minimum of fresh air per occupant
> - Access to the firefighter from the facade for the premises are the low floor of the last level is more than 8 meters
> - 1 toilet for 10 people (occupying a position considered "fixed")
Am I looking at the wrong thing or am I right to wonder why we still bother with public status pages if it never shows the real status?
Edit: nvm just saw another comment pointing out the same further down the thread (I randomly came across this page while looking for the physical location of another DC)
> Due to a fire at one of our data centres, a few of our servers are down and may be down permanently. We are restoring these servers from backups and will enable puzzles and storm as soon as possible.
>
> We hope that everyone who is dealing with the fire is safe, including the firefighters and everyone at OVH. <3.
I opened two tabs to relax tonight: Hacker News and Lichess. This was the top HN thread, and Lichess is having issues because of the fire.
I didn't know what OVH was before 10 minutes ago, but this seems really impactful. I hope everyone there is safe and that the immediate disaster gets resolved quickly.
Look them up, they're one of the biggest hosting providers in the world ( especially in France), and due to cheap prices are especially pop-up with smaller scale stuff.
Yeah, they scale down to their kimsufi line which used to have quite powerful dedicated servers for the price of basic VPSes from other providers.
e.g. They have a 4core, 16gb ram server for $22/mo which is 25% of what my preferred provider, Linode, charges.
Now, it comes with older consumer hardware (that one is a sandy bridge i5), and about as much support as the price tag suggests, as well as a dated management interface, but when I used to run a modded minecraft server as a college student, which needed excessive amounts of RAM and could be easily upset by load spikes on other clients, then it was a no-brainer, even if I would expect the modern-ish Xeons Linode uses to win on a core for core basis.
Dated? They're probably the only place that isn't $comedy "bare metal cloud" pricing that not only has an API for their $5/m servers but also the panel is a reasonably modern SPA that implements that API and uses OAuth for login
Yeah this is not at all what you use now. It's an Angular SPA.
It is still a mess of separate accounts, but you can use email addr to log in instead of random-numbers generated handle.
The OVHCloud US is completely separated for legal purposes, from what I remember. No account sharing, different staff, OVH EU cannot help you at all with US accounts.
Yes it changed. It was replaced by a more modern version a few years ago (but the transition was painful, as not everything was implemented in the new version when they started to deploy it).
I think playing "from position" is also broken? I was playing chess with my friend and we usually play "from position" but it wasn't working just now so we're playing standard instead. It might be an unrelated bug.
This is certainly a good reminder to have regular backups. I have (had?) a VPS in SBG1, the shipping-container-based data centre in the Strasbourg site, and the latest I know is that out of the 12 containers, 8 are OK but 4 are destroyed [1]. Regardless, I imagine it will be weeks / probably months before even the OK servers can be brought back online.
Naturally, I didn't do regular backups. Most of the data I have from a 2019 backup, but there's a metadata database that I don't have a copy of, and will need to reconstruct from scratch. Thankfully for my case reconstructing will be possible - I know that's not the case for everyone.
Right now I'm feeling pretty stupid, but I only have myself to blame. For their part, OVH have been really good at keeping everyone updated (particularly the Founder and CEO, Octave Klaba).
I believe that when I signed up back in 2017, the Strasbourg location was advertised as one of the cheapest, so I can imagine a lot of people with a ~$4 / month OVH VPS are in the same situation, desperately scrambling to find a backup.
(For those that have a OVH VPS that's down right now, you can find what location it is in by logging onto the OVH control panel.)
I can remember several San Diego fires that threatened the original JohnCompanies datacenter[1] circa mid 2000s and thinking about all of the assets and invested time and care that went into every rack in the facility.
Very interested to read the post-mortem here ... even more interested in any actionable takeaways from what is a very rare event ...
[1] Castle Access, as it was known, at the intersection of Aero Dr. and the 15 ... was later bought by Redit, then Kio ...
I have the same issue on my XPS 13 (4K screen), the header takes up a good 30% or so of the height of the screen and it's like reading through a mailbox slit.
The key differentiator of OVH is the very compact datacenters they achieve thanks to water cooling. Some OVH exec were touting about that in a recent podcast.
Interestingly in this case, having a very compact data center was probably an aggravating factor. This shows how complex these technical choices are, you have to think of operating savings, with a trade off on the gravity of black swan events...
Interesting. That said, the technique is not the issue here, losing a whole datacenter can always happen. This event would have been much less serious if all the four SBG* datacenters were not all so close to each other on the same plot of land.
They are so close to each other that they are basically the same physical datacenter with 4 logical partitions.
They are all annexes built up over time and a victim of their own success (the site was meant to be more of a edge node in size). The container based annexes were meant to be dismantled 3 years ago but profit probably got in the way
TBF "data" here means the state of the gameworld which anyway resets entirely every 2 weeks or every month, so it's not exactly a big deal, everyone constantly start over in Rust.
I can't see anything about a fire suppression system mentioned? Doesn't OVH have one, except for colocation datacenters?
A fire detection system using eg. lasers and Inergen(or Argonite) for putting the fire out is commonly used in datacenters. The gas fills the room and reduces the amount of oxygen in the room so most fires are put out within a minute.
The cool thing is that the gas is designed to be used in rooms with people, so that is can be triggered any time. It is however quite loud, and some setups have been known to be too loud, even destroying harddrives.
At a datacentre I used to visit years ago, part of the site induction was learning that if the fire suppression alarm went off, you had a certain amount of time to get out of the room before the argon would deploy, so you should always have a path to the nearest exit in mind. The implication was that it wasn't safe to be in the room once it deployed, but I don't know for sure.
Yeah, that was always my understanding. GP saying "the gas is designed to be used in rooms with people, so that [it] can be triggered any time" made me second-guess that though.
Maybe there is a concentration of oxygen that is high enough for humans to survive, yet too low to sustain combustion?
The "gas" is more like a fog of capsules. FM-200 is a common one. Basically it has a Fire suppression agent inside crystals which are blasted into the room by compressed air. These crystals melt when they get over a certain temperature and therefore won't kill you; however, breathing that in isn't really pleasant.
A few years ago the company I work for installed suppressors on the Inergen system. It did trigger from time to time, which was tracked to the humidifiers. And yes -- it did destroy harddrives because of the pressure/sound waves before we installed the supressors. Haven't had any incidents after we fixed the humidifiers.
But Inergen (and other gas) is more or less useless if you allow it to escape too quickly. So the cooling system should be a fairly closed circut.
So how is it safe with people if there is no oxygen left to breathe?
Reminds me of my first trip to a datacenter, where the guy who accompanied us said: "In the event of a fire this room is filled with nitrogen in 20 seconds. But don't worry: nitrogen is not toxic!"
Well, I was a little worried :)
Newer systems like the ones mentioned above are designed to reduce the amount of oxygen in the room to around 12% (down from around 21%). That's low enough to extinguish fires, but allows people to safely evacuate and prevent people from suffocating if they're incapacitated.
They aren't super common, but halon and other gas systems are just the right tool for the job. It can get inside the server chassis and doesn't damage equipment like a chemical application would.
We won't know what went wrong at OVH until a proper post mortem comes out. These systems work by suppressing the flame reaction, but if the actual source was not addressed, it could reignite after awhile.
The decision to design and implement specialised tech comes from a combination of how likely the risk is and the magnitude of the potential loss. Fires are not that common in DCs, but the potential loss can be enormous (as OVH is currently finding out).
Seems like data.gouv.fr [1], the government platform for open data is impacted ; we might not get the nice COVID-19 graphs from non-governemental sites ([2], [3]) today.
I can't wait for the conspiracy theories about how the fire is a "cover up" to "hide" bad COVID-19 numbers...
[3] https://www.meteo-covid.com/trouillocarte (Just wanted to share the "trouillocarte" - which roughly translates to "'how badly is shit hitting the fan today' map" ;) )
In French but with pictures: https://lafibre.info/ovh-datacenter/ovh-et-la-protection-inc... - probably a different OVH data centre, but they clearly have sprinklers there. The argument they made against gas extinguishers is by the time they have to use it, the data is gone anyway, and it's only going to trigger in the affected areas. It's also far safer for the people working there.
That's very interesting find! Google translate seemed to do a good enough job on it for me. 8 years ago and there are people in that thread ripping on what they see in the photos.
Almost 11 years ago (March 27 2010) in Ukraine datacenter of company Hosting.ua went in flames as clients were watching their systems go unresponsive at various rows of the datacenter.
Anti-fire systems didn't kick-in. The reason? For couple of days the system was detecting a little bit of smoke from one of the devices. Operators weren't able to pinpoint exact location, considered it a false alarm and manually switched anti-fire system off.
“ Update 7:20am
Fire is over. Firefighters continue to cool the buildings with the water.
We don’t have the access to the site. That is why SBG1, SBG3, SBG4 won’t be restarted today.”
I sent this story to my colleagues and one of them asked "where is the FM200?"
I don't really know how FM200 systems work in data centres, but I'm guessing that if the fire didn't start from within the actual server room, FM200 might not save you? e.g. if a fire started elsewhere and went out of control, it would be able to burn through the walls/ceiling/floor of the server room, in which case no amount of FM200 gas can save you, right?
Another possibility, of course, is that the FM200 system simply failed to trigger even though the fire started from within the server room.
There is no published investigation details about this incident yet, I believe. Can somebody chime in about past incidents where FM200 failed to save the day?
I think most of these gases are or will eventually be banned in Europe because of their impact on the environment. I've seen newer datacenters use water mist sprays.
Nitrogen makes up 78% of the atmosphere, so I doubt it will be banned. Most datacenters don't actually use halocarbons despite the common "FM200" name.
You might be thinking of Halons, which are CFCs that depletes the ozone layer? They are mostly phased out worldwide but existing installations might still be in use.
FM200 is something else that is often used in modern builds (not just datacenters).
I've heard that one. I thought it mostly affects refrigerants, but I didn't notice that FM200 is also an HFC. There are other fire suppression gasses with a low global warming potential, which probably can still be used in the future.
How... what. What if the fire is electrical? You can't just go "well the triple interlocked electrical isolation will trip and cut the current" if a random fully-charged UPS decides to get angry...
Ah, so it's possible that they also used a water sprinkler system at SBG2. But still, I wonder how the fire protection system (water sprinkler, FM200, or otherwise) not save SBG2?
It doesn't really surprise me that the machines are dead, but the whole place being destroyed is much more surreal.
"Everyone is safe. Fire has destroyed SBG2. A part of SBG1 is destroyed. The firefighters are protecting SBG3. No impact on SBG4". Tweet from Octave Klaba, founder of OVHcloud. "All our clients on this site are possibly impacted"
Tout le monde est sain et sauf. Le feu a détruit SBG2. Une partie de SBG1 est détruite. Les pompiers protègent actuellement SBG3. Pas d’impact sur SBG4 », a tweeté Octave Klaba, le fondateur d’OVHcloud, en désignant les différentes parties du site. « Tous nos clients sur ce centre de données sont susceptibles d’être impactés », a précisé l’entreprise sur Twitter.
Reminder to not only have backups, but also have some periodic OFFLINE backups.
If your primary is set up with credentials to automatically transfer a copy to the backup destination over the network, what happens if your primary gets pwned and the access is used to encrypt or delete the backup?
Secondly, test doing restores of your backups, and have methods/procedures in place for exactly what a restore looks like.
I think they announced their IP plans yesterday [1], which is probably the worst timing one can have (if there even is a good timing for a datacenter burning down, probably there isn't).
If they have a good insurance I'm confident this will have little impact on their operations, I really hope they do. I host a few components on OVH/SoYouStart dedicated servers, luckily not mission critical, but still had rather good experience with them, especially in terms of price to performance.
> The publicity damage alone will be on par with (if not bigger) their replacement costs. I wouldn’t be surprised if they had to rebrand.
Honest question: which publicity damage?
A fire in a datacenter is very much part of the things you should expect to see happen when you operate a large number of datacenters and will obviously cause some disruption to your customers hosting physical servers there.
Provided the disruption doesn't significantly extend to their cloud customers and doesn't affect people paying for guaranteed availability (which it shouldn't - OVH operates datacenters throughout the world), this seems to me to be an unfortunate incident but not a business threatening one.
Most people I feel would expect fire suppression to kick in and prevent the whole data center (and the adjacent ones) from catching on fire. The fact that it didn't is concerning regarding their operations since they build their own custom data centers. The fire isn't the issue, how much damage it did is the issue. So one can ask if there was there a systematic set of planning mistakes of which this is just the first to surface?
Wow, SBG3 seems to be OK: «Update 11:20am: All servers in SBG3 are okey. They are off, but not impacted. We create a plan how to restart them and connect to the network. no ETA. Now, we will verify SBG1.»
"Update 7:20am
Fire is over. Firefighters continue to cool the buildings with the water.
We don’t have the access to the site. That is why SBG1, SBG3, SBG4 won’t be restarted today."
Most modern data centres wouldn't have had this issue, at least in in Australia they use Argonite suppression systems, these work by using a gas that is a mixture of argon and nitrogen that suppresses fire by depleting oxygen in the the data hall.
I don't think Aussie dc's are all the same. Globalswitch in Sydney uses inergen but Equinix is using water at least at SY1 but I'm reasonably sure that GS is much older
One of the things I'm still extremely grateful for is that I learnt the basics of computer science from an ex-oracle guy turned secondary school teacher, who wasn't the best programmer let's say but who absolutely drilled into us (I was probably the only one listening but still) the importance of code quality, backups, information security etc.
Nothing fancy, but it's the kind of bread and butter intuition you need to avoid walking straight off a cliff.
He also let me sit at the back writing a compiler instead of learning VB.Net, top dude
Trust but verify. As a developer it doesn't matter what sysadmins or anyone else says about backups of your data; if you haven't run your DR plan and verified the results, it doesn't exist.
There have been a handful of talks at computer security conferences talking about setting up physical traps in server chassis (such as this one: https://www.youtube.com/watch?v=XrzIjxO8MOs). Since seeing those I've been waiting for some idiot to try something like that in a physical server and burn down a data center.
There is NO evidence that is what happened here, and I don't think OVH allows customers to bring their own equipment making even less likely. Still I wait and hope to hear a root cause from this one.
> At 00:47 on Wednesday, March 10, 2021, a fire broke out in a room in one of our 4 datacenters in Strasbourg, SBG2. Please note that the site is not classified as a Seveso site.
> Firefighters immediately intervened to protect our teams and prevent the spread of the fire. At 2:54 am they isolated the site and closed off its perimeter.
> By 4:09 am, the fire had destroyed SBG2 and continued to present risks to the nearby datacenters until the fire brigade brought the fire under control.
> From 5:30 am, the site has been unavailable to our teams for obvious security reasons, under the direction of the prefecture. The fire is now contained.
If your backup snapshots are stored through OVH's normal backup functionality, then create a new server at e.g. RBX now, and restore from those backups. That'll take a few hours and it'll all be up again quickly.
Really sorry to hear, hope you get it restored. I cannot judge - I use the default backup options in Azure and hope they store it in another data centre but never though to check too hard. This is very bad luck.
Hopefully you had the code in GitHub but that still leaves the DB. It looks like your has something to do with command line or Linux lessons so not sure how much user data is critical? Maybe you can get this up and running again to some extent.
So far I haven't got any communication from OVH alerting me about this. I think that's the first thing they should do, alerting their customers that something it's happening.
Anyway I was running a service that I was about to close, so this may be it. I do have a recovery plan, but I don't know it it is worth it at this point.
I'm never using OVH again. The fire can happen, but don't ask me about my recovery plan, what about yours?
It says they've alerted customers, but I expect some have been missed, through inaccurate email records, email hosted on the systems that have been destroyed etc.
Was the building made from stacked shipping containers? Containers are such a budget-friendly and trendy structural building block these days. They even click with the software engineers - "Hey, it's like Docker".
Containers would seem to be at a disadvantage when it comes to dissipating, rather than containing, heat. I hope improved thermal management and fire suppression designs can be implemented.
My impression is that they tried very hard to maintain uptime, which was probably a bad idea when we see the extent of the damages. This VPS just hosts external facing services and is easy to set back up.
I see several people talking about advanced backup systems for businesses.
I do not have a company, I work as a freelancer, I am Brazilian and the current 70 euros of tuition I was paying was already compromising my income, since my local currency is quite devalued against the dollar.
Then imagine the situation. my websites are down, the only backup I have was a copy of the vps that i made in november last year because i was with the intention of creating a server at my house, since it was getting expensive to maintain this server at OVH. It would be unacceptable for a company of this size not to have its servers backed up or to keep them in the same location as the incident, since their networks are all connected. I hope you have a satisfactory and quick solution to this problem.
If OVH backed up everything, the cost of the service would be double.
Many customers don't need a backup, so it's up to each customer to arrange their own backups — perhaps with tools and services provided by the hosting company, or their own solution.
Running a company with no backup (for cost or any other reason) is very risky, as some people will have found out today.
I had my server in SBG2 and sadly the backup failed since the end of january. Yep it is my mistake for not checking the backups. Now I lost about 1 month of data.
The only good thing is that my backup was offsite.
Does OVH offer automatic snapshots for VPS? I know Hetzner does it for 20% addidation of the cost of the server. If they do the next question would be if they are destroyed too?
With OVH the price can depending on the situation double your cost, if you've got a 3€ VPS and a 3€ backup configuration (as the price depends on size)
Well, not having managed backups is obviously part of choosing to go bare-metal. They do have triple-redundancy backups in their cloud offerings. Nobody to blame but yourself.
Also, if you’re hosting clients’ static websites, you were burning your money, there are way cheaper options out there (and fully managed).
Interesting. I wonder if the cladding was a major problem here? It looks like it has all burnt out and could have had the fire spread extremely rapidly on the outside.
They are different physical buildings. OVH does not generally claim anything regarding AZ or distance. SBG1, 2, 3, etc is just denoting the building your server is in - they are not like AWS style AZ or similar, quite literally just building addresses.
I have used them for years and I don't believe they've ever said anything like deploy in both SBG1 and SBG2 for safety or availability, because you don't get that choice.
When you provision a machine (eg via API) they tell you "SBG 5 min, LON 5 min, BHS 72h" and you pick SBG and get assigned first-available. There is no "I want to be in SBG4" generally.
Local as in your home/office. While your application may run in AWS or whatever remote server, its necessary to have copies of your data that you can physically touch and access.
One main deployment, one remote backup and one onsite physically accessible backup.
(site a)---[replicate local LUNs/shares to remote storage arrays]--->(site b),
(site a)---[replicates local VMs to remote HCI]--->(site b),
(site a)---[local backups to local data archive]--->(site a),
(site a)---[local data archive replicates to remote data archive]--->(site b),
(site b)---[remote data archive replicates to remote air gapped data archive]--->(site b),
(site a)---[replicates to cold storage on aws/gcp/azure]--->(site c),
(site c)---[replicate to another geo site on cloud]--->(site d)
scenario 1: site a is down
plan: recover to site b by most convinent means
scenario 2: site b is down
plan: restore services, operate without redundancy out of site a
scenario 3: site c is down
plan: restore services, catch up later. continue operating out of site a
scenario 4: site b and c down
plan: restore services, operate without redundancy out of site a
scenario 5: site a and b down
plan: cross fingers, restore to new site from cold storage on expensive cloud VM instances
scenario 6: data archive corrupted ransomware
plan: restore from air gapped data archive, hope ransomware was identified within 90 days
scenario 7: site b and c down, then site a down
plan: quit
scenario 8: staff hates job and all quit
plan: outsource
I just recently started moving some services for my business to one of OVH's US-based data centers. Should I take this fire as evidence that OVH is incompetent and get out? I really don't want AWS, or the big three hyperscalers in general, to be the only option.
IMO you should take this fire as evidence, that you need to have (working!) backups wherever you host your data. AWS, GCP Azure are not fire resistant, same as OVH.
I don't know if OVH is more or less competent than big three, I choose to trust no one.
I read multiple times that they didn't even have sprinklers, only smoke detectors in their EU datacenter(s). I'm 100% sure AWS, Azure and Google have better fire prevention.
This thread has people saying they have sprinklers, don't have sprinklers, have / don't have gas suppression, and have puppies / actually have toilets.
Wait for the misinformation hose to dry up, and decide in a few weeks.
And this is why the big 3 will continue to dominate. AWS, Microsoft and Google can throw in a lot more money at their phyiscal infrastructure than any other cloud provider.
After this sorry episode, I dont think any CTO or CIO of any public company will be able to even consider using the other guys.
edit: I am not implying that we put all eggs in one basket with no failover and dr. I am implying the big cos will pay 2x premium on infrastructure to project reliability.
I could replicate my whole infrastructure on 3 different OVH datacenters, with enough provision to support twice the peak load - it would still be cheaper than a single infrastructure at AWS, and I would get a better uptime than AWS: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
I agree with you 100%. However, I did state * any CTO or CIO of any public company* ... the executives don't worry about costs , they worry about being able to * project reliability * .
Executives that worry about reliability would insist on deploying on multiple data centers, which would make the project more reliable than any single AWS availability zone.
Also, cost matters if the AWS bill is one of the company's top expenses.
reserved a1.large instances are about half the price of OVH's b2-7 instance. a1.xlarge are still cheaper (and larger). So you get more raw compute per dollar on AWS.
OVH dedicated instances start at about the size of an a1.metal instance, which is ~30% more than the comparable OVH instance, but you can get discounts in various ways.
Or you could use t4g.2xlarge, which is cheaper. There's no situation where OVH is 3x cheaper (I mean maybe if bandwidth is your thing, but IDK).
With Google arbitrarily killing accounts, and with Amazon showing that they’ll do the same if it’s politically expedient, I’m not sure I’d trust the big three, either. It’s a case of “pick your poison”.
Imagine AWS, with less features, but 10x-100x lower prices. And now you know why until a few years ago they were larger in traffic, customers, and number of servers than even AWS.
OVH aren't nice in that regard, and have trolled competitors to "leave it to the pros" before when there were serious incidents ( of which OVH have had their fair share), so it's not surprising.
/shrug/ We have our infrastructure partially in OVH, I see it as a friendly jab at them and a way to get updates without having to navigate to twitter.
I'm not sure if it's because my tolerance of Graham Linehan has snapped or not, but I barely laugh at the IT Crowd any more. As with other GL shows I find it's just mostly held together but the cast's delivery and such
The laugh track and the writing is honestly dated even by the standards of Dads Army.
I don't remember the details, but I think that season 2 kind of retroactively ruined season 1. They used to have all those O'Reilly and EFF stickers, and working at a help desk at the time, it felt very authentic. Then everything got super nice in season 2 -- leather couches, people were dressing nicely, etc. It kind of lost its charm. You can't rewatch it because you know Denholm is just going to randomly jump out of a window.
(Having said that, I think "Fire" was a memorable episode that is still amusing. The 0118911881999119 song, "it's off, so I'll turn it on... AND JUST WALK AWAY".)
It might have been ahead of its time. Silicon Valley was well received and is as nerdy and intricately detailed as Season 1 of the IT Crowd. "Normal people" thought it was far out and zany. People that work in tech have been to all those meetings. And, a major character was named PG!
> They used to have all those O'Reilly and EFF stickers, and working at a help desk at the time, it felt very authentic. Then everything got super nice in season 2 -- leather couches, people were dressing nicely, etc. It kind of lost its charm.
That sounds like a pretty realistic allegory of the last two decades in Free Software (or software in general, or the web...)
The IT Crowd's comedy became dated incredibly quickly, just like Father Ted's.
Comedies that came later ditched the laugh track. They had to work harder to get viewers at home to laugh, but ultimately a bunch of them (starting with The UK Office) hold up much better as a result.
Unfortunately A lot of people are going to find out the hard way today why AWS/GCP/Big Expensive Cloud is so expensive (Hint: they have redundancy and failover procedures which drive up costs).
Keep in mind I’m talking not of “downtime” but of actual data loss which might affect business continuity.
This is really tragic. I’m hoping they have some kind of multi regional backup/replication and not just multi zones (although from the twitters it appears that only one of the zones was destroyed however the others don’t seem to be operational atm).
“Big cloud” has had fires take out clusters, and somehow they manage to keep it out of the news. In spite of the redundancy and failover procedures, keeping your data centers running when one of the clusters was recently *on fire* is something that is often only possible due to heroic efforts.
When I say “heroic efforts”, that’s in contrast to “ordinary error recovery and failover”, which is the way you’d want to handle a DC fire, because DC fires happen often enough.
The thing is, while these big companies have a much larger base of expertise to draw on and simply more staff time to throw at problems, there are factors which incentivize these employees to *increase risk* rather than reduce it.
These big companies put pressure on all their engineers to figure out ways to drive down costs. So, while a big cloud provider won’t make a rookie mistake—they won’t forget to run disaster recovery drills, they won’t forget to make backups and run test restores—they *will* do a bunch of calculations to figure out how close to disaster they can run in order to save money. The real disaster will then reveal some false, hidden assumption in their error recovery models.
Or in other words, the big companies solve all the easy problems and then create new, hard problems.
You know, those are excellent observations. But they don’t change the decision calculus in this case. Using bigger cloud providers doesn’t eliminate all risk, it just creates a different kind of risk.
What we call “progress” in humanity is just putting our best efforts into reducing or eliminating the problems we know how to solve without realizing the problems they may create further down the line. The only way to know for sure is to try it, see how it goes, and then re-evaluate later.
California had issues with many forest fires. They put out all fires. Turns out, that solution creates a bigger problem down the line with humongous uncontrollable fires which would not have happened if the smaller fires had not been put out so frequently. Oops.
I encourage you to have a look at the operating income that AWS rakes in.
Sure, the amount of expertise, redundancy and breadth of service offerings they provide is worth a markup, but they are also significantly more expensive than they need to be.
Thanks to being the leader in an oligopoly, and due to patterns like making network egress unjustifiably expensive to keep you (/your data) from leaving.
I think the question here, then is of subjective value.
AWS may charge more for egress, but that’s not high enough for it to be a concern for most clients.
A bigger, independent concern is probably that there should be sufficient redundancy, backups and such that allows for business continuity. (Note again that I’m not saying that all companies make full use of these features, but those that care for such things do. Additionally, I’ve honestly never heard of an AWS DC burning down. Either it doesn’t happen frequently or it doesn’t have enough of an effect on regular customers, both of situations are equivalent for my case).
Most businesses choose to prioritize the second aspect. Even if they have to pay extra for egress sometimes, it’s just not big enough of a concern as compared to businesses continuity.
An availability zone (AZ) in AWS eu-west-2 was flooded by a fire protection system going off within the last year. It absolutely did affect workloads in that AZ. That shouldn't have had a large impact on their customers since AWS promote and make as trivial as is viable multi-AZ architectures.
Put another way: one is guided towards making operational good choices rather than being left to discover them yourself. This is a value proposition of public clouds since it commoditises that specialist knowledge.
What surprised me most about today's fire is that their datacenters have so little physical separation. I expected them to be far enough apart to act as separate availability zones.
I've never heard of any data centre burning down (and I work in this industry), so never hearing of an AWS DC burning down isn't really saying anything about AWS.
That's true, but it seems whole of SBG region for OVH is within same disaster radius for one fire... with SBG2 destroyed and SBG1 partly damaged.
"The whole site has been isolated, which impacts all our services on SBG1, SBG2, SBG3 and SBG4. "
Wonder if those SBGx were advertised as being the same as "Availability Zones" - when other cloud providers ensure zones are distanced enough from each other (~1km at least) to likely survive events such as fire.
They were not and never advertised as anything similar to AZ. You could not deploy in SBG1,2,3, etc. You only pick city = Strasbourg at deploy time. It's merely a building marker.
> but it seems whole of SBG region for OVH is within same disaster radius for one fire
SBG is for Strasbourg. That's not a region. It's a city. Obviously, SBG1 to 4 are in the radius of one fire. It's four different buildings on the same site.
Why? They aren't making any sort guarantees with those locations, those aren't advertised as separated fault zones. Those buildings are meant as expansions for existing site.
If the data is that critical, surely you would be backing it up frequently and also mirror it on at least one geographically separate server?
I use a single server at OVH, and I'm not in the affected DC, but if this DID happen to me I could get back up and running fairly quickly. All our data is mirrored on S3 and off site backups are made frequently enough it wouldn't be an issue.
Plus, you still need to plan for a scenario like this even with AWS or any other cloud provider. It is less likely to happen with those, given the redundancy, but there is still a chance you lose it all without a backup plan.
Yup, I've never heard of a fire taking out a Big Cloud DC. They actually know what they're doing and don't put server racks in shipping containers stacked on top of each other. If you want quality in life, sometimes you have to pay for it.
Personally I'll continue to use these third world cloud providers. But I like to live on the edge.
Their solar panels on the roof caught on fire. Presumably there was no actual disruption in service computing-wise.
> Google fire
2006, still a young company at that point. Also this was way before GCP where they were responsible for other people's services.
> AWS fire
Data center caught on fire while it was under construction. Drunk construction workers messed up, not AWS. Presumably no computers were even inside yet. This is completely different from OVH completely burning down while all their customers' data went up in flames.
If these are the best examples of Big Cloud messing up, this is quite the endorsement.
It's funny how the OVH CEO tells everyone to go activate their DRPs while the company didn't have enough foresight to install fire suppression systems.
These are just the first examples I could find. Fires account for 20%+ of all DC outages in the industry. But taking the time to find a reason to invalidate each one, it seems like you have your mind set.
White collar jobs are not paid well in the EU. If a saleswoman from a nearby shop has the same salary, why should I care about the quality of my work at all? The entire data center burned down? Fine! There are a lot of other places with the same tiny salary.
As you can see below, a developer expert is only making 24K-63K Euro per year at OVH (in US dollar it's almost the same amount):
So when https://en.wikipedia.org/wiki/May_1998_riots_of_Indonesia started happening, we heard some harrowing stories of US employees being abducted, among other things.
Around that same time, the equipment in the Jakarta store started sending high temperature alerts prior to going offline. Our NOC wasn't able to reach anyone in the store.
The alerts were quite accurate: that was one of the many buildings that had been burned down in the riots. I guess it's slightly surprising that electrical power to the equipment in question lasted long enough to allow temperature alerting. Most of our stores back then used satellite for their permanent network connection, so it's possible telcom died prior to the fire reaching the UPC office.
In a couple of prominent places in the home office, there were large cutouts of all of the countries WalMart was in at the time up on the walls. A couple of weeks after this event, the Indonesia one was taken down over the weekend and the others re-arranged.